SlideShare a Scribd company logo
1 of 29
Download to read offline
© Hortonworks Inc. 2013
Hortonworks
Data Science with Hadoop – A Primer
Hadoop Summit, June 2013
Ofer Mendelevitch
ofer@hortonworks.com
@ofermend
© Hortonworks Inc. 2013 Page 2
Who am I?
currently <- c(
role=“director of data sciences”,
company=“Hortonworks”)
• Previously: Nor1, Yahoo!, Risk Insight, Quiver, etc…
• Blog: www.achessdad.com
© Hortonworks Inc. 2013 Page 3
What I will be talking about?
•What is Data Science?
•Hadoop and Data Science
•Use-cases: data science with Hadoop
•How to get started?
© Hortonworks Inc. 2013 Page 4
What is Data Science?
What is a data scientist?
A person who does this
Data Product: software product whose core
functionality relies on applying statistical (or
machine learning) methods to data.
What is Data Science?
The art of building data products
© Hortonworks Inc. 2013 Page 5
Data science & big data
© Hortonworks Inc. 2013 Page 6
With Hadoop…
Time and cost of building large scale
data products is dramatically reduced
© Hortonworks Inc. 2013
ApplianceCloudOS / VM
An Apache Hadoop Platform
HORTONWORKS
DATA PLATFORM (HDP)
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness: HA,
DR, Snapshots, Security, …
Distributed
Storage & ProcessingHDFS
MAP REDUCE
DATA
SERVICES
Store, Proces
s and Access
Data
HCATALOG
HIVEPIG
HBASE
SQOOP
FLUME
OPERATIONAL
SERVICES
Manage &
Operate at
Scale
OOZIE
AMBARI
© Hortonworks Inc. 2013
A typical Big Data Architecture
Page 8
APPLICATIONSDATASYSTEMS
TRADITIONAL REPOS
RDBMS EDW MPP
DATASOURCES
MOBILE
DATA
OLTP,
POS
SYSTEMS
OPERATIONAL
TOOLS
MANAGE &
MONITOR
Traditional Sources
(RDBMS, OLTP, OLAP)
New Sources
(web logs, email, sensor data, social media)
DEV & DATA
TOOLS
BUILD &
TEST
Business
Analytics
Custom
Applications
Packaged
Applications
HORTONWORKS
DATA PLATFORM
© Hortonworks Inc. 2013 Page 9
Keys to Hadoop’s power
• Computation co-located with data
– Data and computation system co-designed to work
together
• Affordable at scale
– Use “commodity” hardware nodes
– Self-healing; failure handled by software
– Very good at batch processing of large datasets
© Hortonworks Inc. 2013 Page 10
Hadoop improves productivity of data
scientists
•All data in one place
–Ability to store all the data in raw format
–Data silo convergence
–Data scientists will find innovative uses of combined data
assets
•Data/compute capabilities available as shared asset
–Data scientists can quickly prototype a new idea without an
up-front request for funding
© Hortonworks Inc. 2013 Page 11
Data-driven innovation is accelerated since
Hadoop is “schema on read”
I need
new data
Finally,
we start
collecting
Let me
see… is it
any good?
Start 6 months 9 months
“Schema change” project
Let’s just put
it in a folder
on HDFS
Let me
see… is it
any good?
3 months
My model is
awesome!
© Hortonworks Inc. 2013 Page 12
Hadoop is ideal for pre-processing of large
raw datasets
Strip away
HTML/PDF/DOC/P
PT
Entity resolution
Document vector
generation
Sampling, filtering
Joins
Raw Data
Processed
Data
Term
normalization
© Hortonworks Inc. 2013 Page 13
In machine learning, very often:
more data -> better outcomes
Banko & Brill, 2001
•More examples to learn from
•More possible feature types
–We’re looking for the most useful
for our task
© Hortonworks Inc. 2013 Page 14
Use-cases
© Hortonworks Inc. 2013 Page 15
A (partial) map of data science “tasks”
Discovery
Clustering
Detect natural groupings
Outlier detection
Detect anomalies
Affinity Analysis
Co-occurrence patterns
Prediction
Classification
Predict a category
Regression
Predict a value
Recommendation
Predict a preference
Big Data Science: High energy physics, Genomics, etc
© Hortonworks Inc. 2013 Page 16
Use-case: product recommendation
•Inputs:
–Explicit product ratings (when provided)
–Implicit information: purchase transactions, page views,
comments
5 2 4 ? ?
? ? 5 2 ?
1 2 ? ? 3
? 2 3 1 5
Epic
X-Men
Hobbit
Argo
Pirates
U101
U102
U103
U104
U105
…
Ratings
Page views
Forum
Comments
© Hortonworks Inc. 2013 Page 17
Goal: predict a preference
5 2 4 ? ?
? ? 5 2 ?
1 2 ? ? 3
? 2 3 1 5
Epic
X-Men
Hobbit
Argo
Pirates
5 2 4 1 3
4 1 5 2 3
1 2 4 1 3
3 2 3 1 5
U101
U102
U103
U104
U105
…
U101
U102
U103
U104
U105
…
Epic
X-Men
Hobbit
Argo
Pirates
© Hortonworks Inc. 2013 Page 18
Using Hadoop for recommendation
Pre-process
SQL
Online serving
HDFS
Map Reduce
Transactions
Page views
Content
Recommend
Data sources
Custom
Logic
With Hadoop, we can process
very large preference datasets
© Hortonworks Inc. 2013 Page 19
Use-case: failure prediction
•Inputs:
–Equipment history: install date, model, past issues
–Equipment sensor data
–Product catalog: product families, expected lifetime
SKU Install
date
Service
Person ID
Zip
code
Avg
temp
TTF
(days)
113454 5/1/2011 1345 94002 72 180
998323 5/3/2009 3234 88321 68 450
345375 8/2/2005 1112 53323 82 332
… … … …
history
Sensor data
Product
Catalog
© Hortonworks Inc. 2013 Page 20
Building a prediction model
SKU Install
date
Service
Person ID
Zip
code
Avg
temp
TTF
(days)
113454 5/1/2011 1345 94002 72 180
998323 5/3/2009 3234 88321 68 450
345375 8/2/2005 1112 53323 82 332
… … … …
Unseen data
Model
TTF
Labeled Data
SKU Install
date
Service
Person ID
Zip
code
Avg
temp
332456 3/3/2013 1345 94005 71
442343 6/6/2013 1112 77485 67
© Hortonworks Inc. 2013 Page 21
Using Hadoop for failure prediction
• HDFS: central repository for all data
– Service records (word, pdf, etc)
– Equipment purchase transaction data
– Product catalog: SKUs, model numbers, etc
• Pre-process
– Convert service records to item features: remove PDF
formatting, detect entities in records
– Normalize data using service records, product catalog
– Create feature matrix; ready for modeling algorithm
© Hortonworks Inc. 2013 Page 22
Use-case: SaaS application security
•Inputs:
–Click-stream: user interaction with application
User ID User
since
Logins/m
onth
Avg DL
KB/day
…
123456 1/3/2004 6 30
998323 5/3/2009 1 5
345375 8/2/2005 22 120
… … … …
User data
Clicks
© Hortonworks Inc. 2013 Page 23
Detecting anomalous behavior records
• User access profile modeled as vector of features
• Detect anomalies in application access patterns
– Rules based
– Machine learning based (determine “outlier factor”: 0…1)
© Hortonworks Inc. 2013 Page 24
Using Hadoop for anomaly detection
• HDFS: central repository for all raw data
– Raw user-access logs
– User information (organization, demographics)
• Pre-process
– Build access-profile (behavioral) for each user
• Detect anomalies
– In Hadoop
– Using existing tools: R, SAS, rules engine, etc
© Hortonworks Inc. 2013 Page 25
How do I get started?
© Hortonworks Inc. 2013 Page 26
1. Pick a good use-case that delivers immediate
business value
2. Implement a proof-of-value (POV)
3. Build a team (hire/train)
Getting started with Data science on Hadoop
© Hortonworks Inc. 2013 Page 27
• Put together a Hadoop cluster
• Define the POV business use-case
• Pull raw data you need into the cluster
• Build it
• Show the business value of your data assets
Contact us. We can help!
Implement a proof-of-value
© Hortonworks Inc. 2013 Page 28
Build a team:
The data scientist skillset continuum
Software
engineer
Research
Scientist
Data
Engineer
Data
Scientist
Applied
Scientist
Role Data Engineer Applied Scientist
Function Builds production-grade data products Finds signal/meaning in the data
Applies statistical/ML models and tunes the
algorithm
Good at…. Data and Systems architecture
Hadoop, PIG/HIVE, MapReduce, mahout
Java, Python, Perl, SQL, C++, etc
NoSQL (Hbase, Cassandra, Mongo)
Statistics, Machine learning
Text processing, NLP
R, Matlab, SAS, SQL
Sciptring, prototyping
Visualization / telling the story
© Hortonworks Inc. 2013 Page 29
Thank you!
Any Questions?
Ofer Mendelevitch
Director, Data Sciences @ Hortonworks
ofer@hortonworks.com
@ofermend
We’re hiring!
Data Science training: www.hortonworks.com/training

More Related Content

What's hot

PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
 
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |EdurekaHadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |EdurekaEdureka!
 
PaNOSC Overview - ExPaNDS kick-off meeting - September 2019
PaNOSC Overview - ExPaNDS kick-off meeting - September 2019PaNOSC Overview - ExPaNDS kick-off meeting - September 2019
PaNOSC Overview - ExPaNDS kick-off meeting - September 2019PaNOSC
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using MahoutIMC Institute
 
Hardening Hadoop for Healthcare with Project Rhino
Hardening Hadoop for Healthcare with Project RhinoHardening Hadoop for Healthcare with Project Rhino
Hardening Hadoop for Healthcare with Project RhinoAmazon Web Services
 
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Edureka!
 
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACWhat is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACAdam Muise
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...Edureka!
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopEdureka!
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)MIT College Of Engineering,Pune
 
Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)IMC Institute
 
Webinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use HadoopWebinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use HadoopEdureka!
 
An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...riyaniaes
 
A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceUniversity of Washington
 
雲端影音與物聯網平台的軟體工程挑戰:以 Skywatch 為例-陳維超
雲端影音與物聯網平台的軟體工程挑戰:以 Skywatch 為例-陳維超雲端影音與物聯網平台的軟體工程挑戰:以 Skywatch 為例-陳維超
雲端影音與物聯網平台的軟體工程挑戰:以 Skywatch 為例-陳維超台灣資料科學年會
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopIJTET Journal
 
Is Hadoop a necessity for Data Science
Is Hadoop a necessity for Data ScienceIs Hadoop a necessity for Data Science
Is Hadoop a necessity for Data ScienceEdureka!
 

What's hot (20)

PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |EdurekaHadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
Hadoop Training For Beginners | Hadoop Tutorial | Big Data Training |Edureka
 
PaNOSC Overview - ExPaNDS kick-off meeting - September 2019
PaNOSC Overview - ExPaNDS kick-off meeting - September 2019PaNOSC Overview - ExPaNDS kick-off meeting - September 2019
PaNOSC Overview - ExPaNDS kick-off meeting - September 2019
 
HadoopWorkshopJuly2014
HadoopWorkshopJuly2014HadoopWorkshopJuly2014
HadoopWorkshopJuly2014
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 
Hardening Hadoop for Healthcare with Project Rhino
Hardening Hadoop for Healthcare with Project RhinoHardening Hadoop for Healthcare with Project Rhino
Hardening Hadoop for Healthcare with Project Rhino
 
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
 
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACWhat is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMAC
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
 
Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)Big Data Hadoop Local and Public Cloud (Amazon EMR)
Big Data Hadoop Local and Public Cloud (Amazon EMR)
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Webinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use HadoopWebinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use Hadoop
 
An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...
 
A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScience
 
Presentation_Final
Presentation_FinalPresentation_Final
Presentation_Final
 
雲端影音與物聯網平台的軟體工程挑戰:以 Skywatch 為例-陳維超
雲端影音與物聯網平台的軟體工程挑戰:以 Skywatch 為例-陳維超雲端影音與物聯網平台的軟體工程挑戰:以 Skywatch 為例-陳維超
雲端影音與物聯網平台的軟體工程挑戰:以 Skywatch 為例-陳維超
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
 
Is Hadoop a necessity for Data Science
Is Hadoop a necessity for Data ScienceIs Hadoop a necessity for Data Science
Is Hadoop a necessity for Data Science
 

Similar to Data Science with Hadoop - A primer

Hortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopHortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopMark Ginnebaugh
 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudHortonworks
 
Modern Data Architecture: In-Memory with Hadoop - the new BI
Modern Data Architecture: In-Memory with Hadoop - the new BIModern Data Architecture: In-Memory with Hadoop - the new BI
Modern Data Architecture: In-Memory with Hadoop - the new BIKognitio
 
Hortonworks kognitio webinar 10 dec 2013
Hortonworks kognitio webinar 10 dec 2013Hortonworks kognitio webinar 10 dec 2013
Hortonworks kognitio webinar 10 dec 2013Michael Hiskey
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Hortonworks
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Hortonworks
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Barijaxconf
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Innovative Management Services
 
Enterprise Apache Hadoop: State of the Union
Enterprise Apache Hadoop: State of the UnionEnterprise Apache Hadoop: State of the Union
Enterprise Apache Hadoop: State of the UnionHortonworks
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopPOSSCON
 
Yahoo! Hack Europe
Yahoo! Hack EuropeYahoo! Hack Europe
Yahoo! Hack EuropeHortonworks
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?Hortonworks
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopSlim Baltagi
 
Hortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks
 
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsWeb Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsKognitio
 

Similar to Data Science with Hadoop - A primer (20)

Hortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopHortonworks Big Data & Hadoop
Hortonworks Big Data & Hadoop
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open Cloud
 
Modern Data Architecture: In-Memory with Hadoop - the new BI
Modern Data Architecture: In-Memory with Hadoop - the new BIModern Data Architecture: In-Memory with Hadoop - the new BI
Modern Data Architecture: In-Memory with Hadoop - the new BI
 
Hortonworks kognitio webinar 10 dec 2013
Hortonworks kognitio webinar 10 dec 2013Hortonworks kognitio webinar 10 dec 2013
Hortonworks kognitio webinar 10 dec 2013
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
Munich HUG 21.11.2013
Munich HUG 21.11.2013Munich HUG 21.11.2013
Munich HUG 21.11.2013
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Enterprise Apache Hadoop: State of the Union
Enterprise Apache Hadoop: State of the UnionEnterprise Apache Hadoop: State of the Union
Enterprise Apache Hadoop: State of the Union
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Yahoo! Hack Europe
Yahoo! Hack EuropeYahoo! Hack Europe
Yahoo! Hack Europe
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
Hortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptx
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?
 
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analyticsWeb Briefing: Unlock the power of Hadoop to enable interactive analytics
Web Briefing: Unlock the power of Hadoop to enable interactive analytics
 

Recently uploaded

UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
The Kubernetes Gateway API and its role in Cloud Native API Management
The Kubernetes Gateway API and its role in Cloud Native API ManagementThe Kubernetes Gateway API and its role in Cloud Native API Management
The Kubernetes Gateway API and its role in Cloud Native API ManagementNuwan Dias
 
UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5DianaGray10
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Juan Carlos Gonzalez
 
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdfPaige Cruz
 
Valere | Digital Solutions & AI Transformation Portfolio | 2024
Valere | Digital Solutions & AI Transformation Portfolio | 2024Valere | Digital Solutions & AI Transformation Portfolio | 2024
Valere | Digital Solutions & AI Transformation Portfolio | 2024Alexander Turgeon
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 

Recently uploaded (20)

UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
The Kubernetes Gateway API and its role in Cloud Native API Management
The Kubernetes Gateway API and its role in Cloud Native API ManagementThe Kubernetes Gateway API and its role in Cloud Native API Management
The Kubernetes Gateway API and its role in Cloud Native API Management
 
UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5UiPath Studio Web workshop series - Day 5
UiPath Studio Web workshop series - Day 5
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?
 
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
 
Valere | Digital Solutions & AI Transformation Portfolio | 2024
Valere | Digital Solutions & AI Transformation Portfolio | 2024Valere | Digital Solutions & AI Transformation Portfolio | 2024
Valere | Digital Solutions & AI Transformation Portfolio | 2024
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 

Data Science with Hadoop - A primer

  • 1. © Hortonworks Inc. 2013 Hortonworks Data Science with Hadoop – A Primer Hadoop Summit, June 2013 Ofer Mendelevitch ofer@hortonworks.com @ofermend
  • 2. © Hortonworks Inc. 2013 Page 2 Who am I? currently <- c( role=“director of data sciences”, company=“Hortonworks”) • Previously: Nor1, Yahoo!, Risk Insight, Quiver, etc… • Blog: www.achessdad.com
  • 3. © Hortonworks Inc. 2013 Page 3 What I will be talking about? •What is Data Science? •Hadoop and Data Science •Use-cases: data science with Hadoop •How to get started?
  • 4. © Hortonworks Inc. 2013 Page 4 What is Data Science? What is a data scientist? A person who does this Data Product: software product whose core functionality relies on applying statistical (or machine learning) methods to data. What is Data Science? The art of building data products
  • 5. © Hortonworks Inc. 2013 Page 5 Data science & big data
  • 6. © Hortonworks Inc. 2013 Page 6 With Hadoop… Time and cost of building large scale data products is dramatically reduced
  • 7. © Hortonworks Inc. 2013 ApplianceCloudOS / VM An Apache Hadoop Platform HORTONWORKS DATA PLATFORM (HDP) PLATFORM SERVICES HADOOP CORE Enterprise Readiness: HA, DR, Snapshots, Security, … Distributed Storage & ProcessingHDFS MAP REDUCE DATA SERVICES Store, Proces s and Access Data HCATALOG HIVEPIG HBASE SQOOP FLUME OPERATIONAL SERVICES Manage & Operate at Scale OOZIE AMBARI
  • 8. © Hortonworks Inc. 2013 A typical Big Data Architecture Page 8 APPLICATIONSDATASYSTEMS TRADITIONAL REPOS RDBMS EDW MPP DATASOURCES MOBILE DATA OLTP, POS SYSTEMS OPERATIONAL TOOLS MANAGE & MONITOR Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media) DEV & DATA TOOLS BUILD & TEST Business Analytics Custom Applications Packaged Applications HORTONWORKS DATA PLATFORM
  • 9. © Hortonworks Inc. 2013 Page 9 Keys to Hadoop’s power • Computation co-located with data – Data and computation system co-designed to work together • Affordable at scale – Use “commodity” hardware nodes – Self-healing; failure handled by software – Very good at batch processing of large datasets
  • 10. © Hortonworks Inc. 2013 Page 10 Hadoop improves productivity of data scientists •All data in one place –Ability to store all the data in raw format –Data silo convergence –Data scientists will find innovative uses of combined data assets •Data/compute capabilities available as shared asset –Data scientists can quickly prototype a new idea without an up-front request for funding
  • 11. © Hortonworks Inc. 2013 Page 11 Data-driven innovation is accelerated since Hadoop is “schema on read” I need new data Finally, we start collecting Let me see… is it any good? Start 6 months 9 months “Schema change” project Let’s just put it in a folder on HDFS Let me see… is it any good? 3 months My model is awesome!
  • 12. © Hortonworks Inc. 2013 Page 12 Hadoop is ideal for pre-processing of large raw datasets Strip away HTML/PDF/DOC/P PT Entity resolution Document vector generation Sampling, filtering Joins Raw Data Processed Data Term normalization
  • 13. © Hortonworks Inc. 2013 Page 13 In machine learning, very often: more data -> better outcomes Banko & Brill, 2001 •More examples to learn from •More possible feature types –We’re looking for the most useful for our task
  • 14. © Hortonworks Inc. 2013 Page 14 Use-cases
  • 15. © Hortonworks Inc. 2013 Page 15 A (partial) map of data science “tasks” Discovery Clustering Detect natural groupings Outlier detection Detect anomalies Affinity Analysis Co-occurrence patterns Prediction Classification Predict a category Regression Predict a value Recommendation Predict a preference Big Data Science: High energy physics, Genomics, etc
  • 16. © Hortonworks Inc. 2013 Page 16 Use-case: product recommendation •Inputs: –Explicit product ratings (when provided) –Implicit information: purchase transactions, page views, comments 5 2 4 ? ? ? ? 5 2 ? 1 2 ? ? 3 ? 2 3 1 5 Epic X-Men Hobbit Argo Pirates U101 U102 U103 U104 U105 … Ratings Page views Forum Comments
  • 17. © Hortonworks Inc. 2013 Page 17 Goal: predict a preference 5 2 4 ? ? ? ? 5 2 ? 1 2 ? ? 3 ? 2 3 1 5 Epic X-Men Hobbit Argo Pirates 5 2 4 1 3 4 1 5 2 3 1 2 4 1 3 3 2 3 1 5 U101 U102 U103 U104 U105 … U101 U102 U103 U104 U105 … Epic X-Men Hobbit Argo Pirates
  • 18. © Hortonworks Inc. 2013 Page 18 Using Hadoop for recommendation Pre-process SQL Online serving HDFS Map Reduce Transactions Page views Content Recommend Data sources Custom Logic With Hadoop, we can process very large preference datasets
  • 19. © Hortonworks Inc. 2013 Page 19 Use-case: failure prediction •Inputs: –Equipment history: install date, model, past issues –Equipment sensor data –Product catalog: product families, expected lifetime SKU Install date Service Person ID Zip code Avg temp TTF (days) 113454 5/1/2011 1345 94002 72 180 998323 5/3/2009 3234 88321 68 450 345375 8/2/2005 1112 53323 82 332 … … … … history Sensor data Product Catalog
  • 20. © Hortonworks Inc. 2013 Page 20 Building a prediction model SKU Install date Service Person ID Zip code Avg temp TTF (days) 113454 5/1/2011 1345 94002 72 180 998323 5/3/2009 3234 88321 68 450 345375 8/2/2005 1112 53323 82 332 … … … … Unseen data Model TTF Labeled Data SKU Install date Service Person ID Zip code Avg temp 332456 3/3/2013 1345 94005 71 442343 6/6/2013 1112 77485 67
  • 21. © Hortonworks Inc. 2013 Page 21 Using Hadoop for failure prediction • HDFS: central repository for all data – Service records (word, pdf, etc) – Equipment purchase transaction data – Product catalog: SKUs, model numbers, etc • Pre-process – Convert service records to item features: remove PDF formatting, detect entities in records – Normalize data using service records, product catalog – Create feature matrix; ready for modeling algorithm
  • 22. © Hortonworks Inc. 2013 Page 22 Use-case: SaaS application security •Inputs: –Click-stream: user interaction with application User ID User since Logins/m onth Avg DL KB/day … 123456 1/3/2004 6 30 998323 5/3/2009 1 5 345375 8/2/2005 22 120 … … … … User data Clicks
  • 23. © Hortonworks Inc. 2013 Page 23 Detecting anomalous behavior records • User access profile modeled as vector of features • Detect anomalies in application access patterns – Rules based – Machine learning based (determine “outlier factor”: 0…1)
  • 24. © Hortonworks Inc. 2013 Page 24 Using Hadoop for anomaly detection • HDFS: central repository for all raw data – Raw user-access logs – User information (organization, demographics) • Pre-process – Build access-profile (behavioral) for each user • Detect anomalies – In Hadoop – Using existing tools: R, SAS, rules engine, etc
  • 25. © Hortonworks Inc. 2013 Page 25 How do I get started?
  • 26. © Hortonworks Inc. 2013 Page 26 1. Pick a good use-case that delivers immediate business value 2. Implement a proof-of-value (POV) 3. Build a team (hire/train) Getting started with Data science on Hadoop
  • 27. © Hortonworks Inc. 2013 Page 27 • Put together a Hadoop cluster • Define the POV business use-case • Pull raw data you need into the cluster • Build it • Show the business value of your data assets Contact us. We can help! Implement a proof-of-value
  • 28. © Hortonworks Inc. 2013 Page 28 Build a team: The data scientist skillset continuum Software engineer Research Scientist Data Engineer Data Scientist Applied Scientist Role Data Engineer Applied Scientist Function Builds production-grade data products Finds signal/meaning in the data Applies statistical/ML models and tunes the algorithm Good at…. Data and Systems architecture Hadoop, PIG/HIVE, MapReduce, mahout Java, Python, Perl, SQL, C++, etc NoSQL (Hbase, Cassandra, Mongo) Statistics, Machine learning Text processing, NLP R, Matlab, SAS, SQL Sciptring, prototyping Visualization / telling the story
  • 29. © Hortonworks Inc. 2013 Page 29 Thank you! Any Questions? Ofer Mendelevitch Director, Data Sciences @ Hortonworks ofer@hortonworks.com @ofermend We’re hiring! Data Science training: www.hortonworks.com/training

Editor's Notes

  1. Data science is not new. But now we need to do it with much larger datasets.
  2. As the volume of data has exploded, we increasingly see organizations acknowledge that not all data belongs in a traditional database. The drivers are both cost (as volumes grow, database licensing costs can become prohibitive) and technology (databases are not optimized for very large datasets).Instead, we increasingly see Hadoop – and HDP in particular – being introduced as a complement to the traditional approaches. It is not replacing the database but rather is a complement: and as such, must integrate easily with existing tools and approaches. This means it must interoperate with:Existing applications – such as Tableau, SAS, Business Objects, etc,Existing databases and data warehouses for loading data to / from the data warehouseDevelopment tools used for building custom applicationsOperational tools for managing and monitoring