SlideShare a Scribd company logo
1 of 31
Measuring Big Data
Understanding data by usage
Charles Smith
Big Data Platform Architecture - Netflix
About Me ▪ Netflix
- I joined Netflix in 2011
- I spend my time working to make big data easy and efficient
- Usually from the perspective of someone trying to use the platform
▪ University of Florida
- Research in Information Retrieval
- How much information does a document have
What would you measure?
What do you want to know?
~20 PB of compressed data
~500 billion events a day
~18K data sets
~4200 nodes in our clusters
Our largest two datasets:
1.4 PB
1.2 PB
~11K Hive
~3K Pig
~2.5K Presto
Task Hour Cost = (cost of node)/(tasks per node) * sum(task duration ms)/(60*60*1000)
100 Jobs comprise 86% of the cost
What data is important?
Make people tell you the answer: tagging.
Manual data doesn’t stay current unless it needs to.
How do we actually use the data?
Parse the job (or ask the tool that parses it)
CharlottePresto
Sql Parser
(Hive)
Sql Parser
(Teradata)
Lipstick
(Pig)
Metacat*
Dataset Distinct Queries
… 2000
… 1052
prodhive/dse/geo_country_d 1009
prodhive/dse/ttl_title_d 580
… 565
… 512
… 466
… 427
… 395
… 317
Dataset Queries
prodhive/dse/geo_country_d 11405
prodhive/dse/ttl_title_d 8194
… 5928
… 5451
… 4849
… 4654
… 4334
… 3620
… 3046
… 2823
Related To geo_country_d Shared Queries
prodhive/dse/ttl_title_country_r 2277
… 1697
prodhive/dse/ttl_show_d 1540
prodhive/dse/ttl_season_d 1405
prodhive/dse/ttl_title_d 1392
… 926
… 817
… 743
prodhive/dse/ttl_season_country_r 638
… 628
Datasets Input Jobs Queries
prodhive/cdn/occ… 2016 66
teradata/gdw_stg_prod/seg… 1587 36
prodhive/dse/msg… 1527 14
prodhive/dse/msg… 1512 30
teradata/gdw_stg_prod/seg… 1043 50
teradata/gdw_stg_prod/cdn… 970 10
teradata/gdw_tbl_prod/seg… 903 1
prodhive/rpt/pbe… 811 11
prodhive/gps/gro… 904 137
prodhive/cdn/ttl… 631 39
Challenges ▪ Knowing what questions should you try to answer.
▪ Getting this data isn’t easy.
▪ The data is noisy.
Thanks ▪ Charles Smith – Big Data Platform Architecture Netflix
▪ @charles_s_smith

More Related Content

What's hot

Hw09 Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data TricksHw09   Counting And Clustering And Other Data Tricks
Hw09 Counting And Clustering And Other Data Tricks
Cloudera, Inc.
 

What's hot (20)

Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)
 
csv,conf 2014 - Open data within organizations
csv,conf 2014 - Open data within organizationscsv,conf 2014 - Open data within organizations
csv,conf 2014 - Open data within organizations
 
Magical Methods for Batch Data Processing
Magical Methods for Batch Data ProcessingMagical Methods for Batch Data Processing
Magical Methods for Batch Data Processing
 
Hadoop World - Oct 2009
Hadoop World - Oct 2009Hadoop World - Oct 2009
Hadoop World - Oct 2009
 
Hw09 Counting And Clustering And Other Data Tricks
Hw09   Counting And Clustering And Other Data TricksHw09   Counting And Clustering And Other Data Tricks
Hw09 Counting And Clustering And Other Data Tricks
 
Final deck
Final deckFinal deck
Final deck
 
Graph Computing with JanusGraph
Graph Computing with JanusGraphGraph Computing with JanusGraph
Graph Computing with JanusGraph
 
Data science in ruby, is it possible? is it fast? should we use it?
Data science in ruby, is it possible? is it fast? should we use it?Data science in ruby, is it possible? is it fast? should we use it?
Data science in ruby, is it possible? is it fast? should we use it?
 
Horizon 20110928
Horizon 20110928Horizon 20110928
Horizon 20110928
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
Data Science in the Cloud
Data Science in the CloudData Science in the Cloud
Data Science in the Cloud
 
Curse of Cardinality: A History and Evolution of Monitoring at Scale
Curse of Cardinality: A History and Evolution of Monitoring at ScaleCurse of Cardinality: A History and Evolution of Monitoring at Scale
Curse of Cardinality: A History and Evolution of Monitoring at Scale
 
Zillow's favorite big data & machine learning tools
Zillow's favorite big data & machine learning toolsZillow's favorite big data & machine learning tools
Zillow's favorite big data & machine learning tools
 
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 
JanusGraph: Looking Backward, Reaching Forward
JanusGraph: Looking Backward, Reaching ForwardJanusGraph: Looking Backward, Reaching Forward
JanusGraph: Looking Backward, Reaching Forward
 
Data Science Popup Austin: Making Data Science Fast: Survey of GPU Accelerate...
Data Science Popup Austin: Making Data Science Fast: Survey of GPU Accelerate...Data Science Popup Austin: Making Data Science Fast: Survey of GPU Accelerate...
Data Science Popup Austin: Making Data Science Fast: Survey of GPU Accelerate...
 
1Spatial Australia: Introduction and getting started with fme 2017
1Spatial Australia: Introduction and getting started with fme 20171Spatial Australia: Introduction and getting started with fme 2017
1Spatial Australia: Introduction and getting started with fme 2017
 
Make your data talk
Make your data talkMake your data talk
Make your data talk
 
Research Automation for Data-Driven Discovery
Research Automationfor Data-Driven DiscoveryResearch Automationfor Data-Driven Discovery
Research Automation for Data-Driven Discovery
 

Similar to OSCON 2015

So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
Ian Foster
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
Ian Foster
 

Similar to OSCON 2015 (20)

PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
Making AI efficient
Making AI efficientMaking AI efficient
Making AI efficient
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
 
Advertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-MobileAdvertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-Mobile
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
1645 track 2 pafka
1645 track 2 pafka1645 track 2 pafka
1645 track 2 pafka
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Design for X: Exploring Product Design with Apache Spark and GraphLab
Design for X: Exploring Product Design with Apache Spark and GraphLabDesign for X: Exploring Product Design with Apache Spark and GraphLab
Design for X: Exploring Product Design with Apache Spark and GraphLab
 
Applications of Machine Learning at UCSB
Applications of Machine Learning at UCSBApplications of Machine Learning at UCSB
Applications of Machine Learning at UCSB
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Big Data
Big DataBig Data
Big Data
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

OSCON 2015