SlideShare a Scribd company logo
1 of 71
Download to read offline
Big Data at Twitter, Chirp 2010
Big Data at Twitter
#chirpdata

     Kevin Weil
     @kevinweil

     Twitter Analytics
Three Challenges
• Collecting Data
• Large-Scale Storage & Analysis
• Rapid Learning over Big Data
Three Challenges
• Collecting Data
• Large-Scale Storage & Analysis
• Rapid Learning over Big Data
Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
•   10,000 CDs
Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
•   10,000 CDs
•   5 million floppy disks
Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
•   10,000 CDs
•   5 million floppy disks
•   225 GB while I give this talk
Syslog?
• Started with syslog-ng
• As our volume grew, it didn’t scale
Syslog?
• Started with syslog-ng
• As our volume grew, it didn’t scale
• Resources
  overwhelmed
• Lost data
Scribe
• Surprise! FB had same problem, built
and open-sourced Scribe
• Log collection framework over Thrift
• You write log lines, with categories
• It does the rest
Scribe
                           FE   FE   FE
• Runs locally; reliable
in network outage
Scribe
                           FE         FE     FE
• Runs locally; reliable
in network outage
• Nodes only know
downstream writer;              Agg        Agg
hierarchical, scalable
Scribe
                              FE         FE     FE
• Runs locally; reliable
in network outage
• Nodes only know
downstream writer;                 Agg        Agg
hierarchical, scalable
• Pluggable outputs
                       File          HDFS
Scribe at Twitter
• Solved our problem, opened new
vistas
• Currently 30 different categories
logged from javascript, RoR, Scala, etc
• We improved logging, monitoring,
writing to Hadoop, compression
Scribe at Twitter
 • Continuing to work with FB
 • GSoC project! Help make it more
 awesome.


• http://github.com/traviscrawford/scribe
• http://wiki.developers.facebook.com/index.php/User:GSoC
Three Challenges
• Collecting Data
• Large-Scale Storage & Analysis
• Rapid Learning over Big Data
How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
• 80 MB/s
How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
• 80 MB/s
• 24.3 hrs to write 7 TB
How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
• 80 MB/s
• 24.3 hrs to write 7 TB
• Uh oh.
Where do I put 7TB/day?
• Need a cluster of
machines
Where do I put 7TB/day?
• Need a cluster of
machines

• ... which adds new
layers of complexity
Hadoop
• Distributed file system
   • Automatic replication, fault
   tolerance
• MapReduce-based parallel computation
   • Key-value based computation
   interface allows for wide applicability
Hadoop
• Open source: top-level Apache project
• Scalable: Y! has a 4000 node cluster
• Powerful: sorted 1TB random integers
in 62 seconds

• Easy packaging: free Cloudera RPMs
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: how many tweets per
         Map
                           Outputs user, given tweets table?
         Map      Reduce           • Input: key=row, value=tweet info
                                   • Map: output key=user_id, value=1
         Map      Reduce
                                   • Shuffle: sort by user_id
         Map      Reduce           • Reduce: for each user_id, sum
         Map
                                   • Output: user_id, tweet count
                                   • With 2x machines, runs 2x faster
         Map
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: how many tweets per
         Map
                           Outputs user, given tweets table?
         Map      Reduce           • Input: key=row, value=tweet info
                                   • Map: output key=user_id, value=1
         Map      Reduce
                                   • Shuffle: sort by user_id
         Map      Reduce           • Reduce: for each user_id, sum
         Map
                                   • Output: user_id, tweet count
                                   • With 2x machines, runs 2x faster
         Map
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: how many tweets per
         Map
                           Outputs user, given tweets table?
         Map      Reduce           • Input: key=row, value=tweet info
                                   • Map: output key=user_id, value=1
         Map      Reduce
                                   • Shuffle: sort by user_id
         Map      Reduce           • Reduce: for each user_id, sum
         Map
                                   • Output: user_id, tweet count
                                   • With 2x machines, runs 2x faster
         Map
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: how many tweets per
         Map
                           Outputs user, given tweets table?
         Map      Reduce           • Input: key=row, value=tweet info
                                   • Map: output key=user_id, value=1
         Map      Reduce
                                   • Shuffle: sort by user_id
         Map      Reduce           • Reduce: for each user_id, sum
         Map
                                   • Output: user_id, tweet count
                                   • With 2x machines, runs 2x faster
         Map
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: how many tweets per
         Map
                           Outputs user, given tweets table?
         Map      Reduce           • Input: key=row, value=tweet info
                                   • Map: output key=user_id, value=1
         Map      Reduce
                                   • Shuffle: sort by user_id
         Map      Reduce           • Reduce: for each user_id, sum
         Map
                                   • Output: user_id, tweet count
                                   • With 2x machines, runs 2x faster
         Map
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: how many tweets per
         Map
                           Outputs user, given tweets table?
         Map      Reduce           • Input: key=row, value=tweet info
                                   • Map: output key=user_id, value=1
         Map      Reduce
                                   • Shuffle: sort by user_id
         Map      Reduce           • Reduce: for each user_id, sum
         Map
                                   • Output: user_id, tweet count
                                   • With 2x machines, runs 2x faster
         Map
Inputs
         MapReduce Workflow
            Shuffle/
         Map Sort
                                   • Challenge: how many tweets per
         Map
                           Outputs user, given tweets table?
         Map      Reduce           • Input: key=row, value=tweet info
                                   • Map: output key=user_id, value=1
         Map      Reduce
                                   • Shuffle: sort by user_id
         Map      Reduce           • Reduce: for each user_id, sum
         Map
                                   • Output: user_id, tweet count
                                   • With 2x machines, runs 2x faster
         Map
Two Analysis Challenges
1. Compute friendships in Twitter’s social
graph
    • grep, awk? No way.
    • Data is in MySQL... self join on an n-
    billion row table?
    • n,000,000,000 x n,000,000,000 = ?
Two Analysis Challenges
1. Compute friendships in Twitter’s social
graph
    • grep, awk? No way.
    • Data is in MySQL... self join on an n-
    billion row table?
    • n,000,000,000 x n,000,000,000 = ?
    • I don’t know either.
Two Analysis Challenges
2. Large-scale grouping and counting
   • select count(*) from users? maybe.
   • select count(*) from tweets? uh...
   • Imagine joining them.
   • And grouping.
   • And sorting.
Back to Hadoop
• Didn’t we have a cluster of machines?
• Hadoop makes it easy to distribute the
calculation
• Purpose-built for parallel calculation
• Just a slight mindset adjustment
Back to Hadoop
• Didn’t we have a cluster of machines?
• Hadoop makes it easy to distribute the
calculation
• Purpose-built for parallel calculation
• Just a slight mindset adjustment
• But a fun one!
Analysis at Scale
• Now we’re rolling
• Count all tweets: 12 billion, 5 minutes
• Hit FlockDB in parallel to assemble
social graph aggregates
• Run pagerank across users to calculate
reputations
But...
• Analysis typically in Java
• Single-input, two-stage data flow is rigid
• Projections, filters: custom code
• Joins lengthy, error-prone
• n-stage jobs: hard to manage
• Exploration requires compilation
Three Challenges
• Collecting Data
• Large-Scale Storage & Analysis
• Rapid Learning over Big Data
Pig
• High level language
• Transformations on
sets of records
• Process data one
step at a time
• Easier than SQL?
Why Pig?
 Because I bet you can read
 the following script
A Real Pig Script




• Just for fun... the same calculation in Java
No, Seriously.
Pig Makes it Easy
• 5% of the code
Pig Makes it Easy
• 5% of the code
• 5% of the dev time
Pig Makes it Easy
• 5% of the code
• 5% of the dev time
• Within 25% of the running time
Pig Makes it Easy
• 5% of the code
• 5% of the dev time
• Within 25% of the running time
• Readable, reusable
One Thing I’ve Learned
• It’s easy to answer questions.
• It’s hard to ask the right questions
One Thing I’ve Learned
• It’s easy to answer questions.
• It’s hard to ask the right questions.
• Value the system that promotes
innovation and iteration
One Thing I’ve Learned
• It’s easy to answer questions.
• It’s hard to ask the right questions.
• Value the system that promotes
innovation and iteration
• More minds contributing = more value
from your data
Counting Big Data
• How many requests per day?
Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
• Searches per day?
Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
• Searches per day?
• Unique users searching, unique queries?
Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
• Searches per day?
• Unique users searching, unique queries?
• Geographic distribution of queries?
Correlating Big Data
• Usage difference for mobile users?
Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
• Cohort analyses
Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
• Cohort analyses
• What features get users hooked?
Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
• Cohort analyses
• What features get users hooked?
• What do successful users use often?
Research on Big Data
• What can we tell from a user’s tweets?
Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the tweets of those they follow?
Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the tweets of those they follow?
• What influences retweet tree depth?
Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the tweets of those they follow?
• What influences retweet tree depth?
• Duplicate detection, language detection
Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the tweets of those they follow?
• What influences retweet tree depth?
• Duplicate detection, language detection
• Machine learning
If We Had More Time...
• HBase backing namesearch
• LZO compression
• Protocol Buffers and Hadoop
• Our open source: hadoop-lzo, elephant-
bird
• Realtime analytics with Cassandra
Questions?
        Follow me at
        twitter.com/kevinweil

More Related Content

Viewers also liked

Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Milind Bhandarkar
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitMilind Bhandarkar
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsMilind Bhandarkar
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
 
Presentation sdimi risks, challenges and benefits of social media 2011
Presentation sdimi risks, challenges and benefits of social media 2011Presentation sdimi risks, challenges and benefits of social media 2011
Presentation sdimi risks, challenges and benefits of social media 2011ZoeMM
 
The Asset Consultancy_PPT _final
The Asset Consultancy_PPT _finalThe Asset Consultancy_PPT _final
The Asset Consultancy_PPT _finalRushin Naik
 
DataSift Update - May 3rd 2011 - Devnest
DataSift Update - May 3rd 2011 - DevnestDataSift Update - May 3rd 2011 - Devnest
DataSift Update - May 3rd 2011 - DevnestOllie Parsley
 
Tweet alert - semantic analysis in social networks for citizen opinion mining
Tweet alert - semantic analysis in social networks for citizen opinion miningTweet alert - semantic analysis in social networks for citizen opinion mining
Tweet alert - semantic analysis in social networks for citizen opinion miningSngular Meaning
 
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester Hortonworks
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongFastly
 
Demo or Die: Where advertising meets product design
Demo or Die: Where advertising meets product designDemo or Die: Where advertising meets product design
Demo or Die: Where advertising meets product designChristine Outram
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutesdwmclary
 
Twitter as a data mining source
Twitter  as  a data mining sourceTwitter  as  a data mining source
Twitter as a data mining sourceAtaxo Group
 
Social media data for Social science research
Social media data for Social science researchSocial media data for Social science research
Social media data for Social science researchDavide Bennato
 
Data Mining on Twitter
Data Mining on TwitterData Mining on Twitter
Data Mining on TwitterPulkit Goyal
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingRapheephan Thongkham-Uan
 

Viewers also liked (20)

Scaling hadoopapplications
Scaling hadoopapplicationsScaling hadoopapplications
Scaling hadoopapplications
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
 
Presentation sdimi risks, challenges and benefits of social media 2011
Presentation sdimi risks, challenges and benefits of social media 2011Presentation sdimi risks, challenges and benefits of social media 2011
Presentation sdimi risks, challenges and benefits of social media 2011
 
The Asset Consultancy_PPT _final
The Asset Consultancy_PPT _finalThe Asset Consultancy_PPT _final
The Asset Consultancy_PPT _final
 
DataSift Update - May 3rd 2011 - Devnest
DataSift Update - May 3rd 2011 - DevnestDataSift Update - May 3rd 2011 - Devnest
DataSift Update - May 3rd 2011 - Devnest
 
Tweet alert - semantic analysis in social networks for citizen opinion mining
Tweet alert - semantic analysis in social networks for citizen opinion miningTweet alert - semantic analysis in social networks for citizen opinion mining
Tweet alert - semantic analysis in social networks for citizen opinion mining
 
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
Demystify Big Data Breakfast Briefing: Martha Bennett, Forrester
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrong
 
Demo or Die: Where advertising meets product design
Demo or Die: Where advertising meets product designDemo or Die: Where advertising meets product design
Demo or Die: Where advertising meets product design
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutes
 
Twitter as a data mining source
Twitter  as  a data mining sourceTwitter  as  a data mining source
Twitter as a data mining source
 
Social media data for Social science research
Social media data for Social science researchSocial media data for Social science research
Social media data for Social science research
 
PPT FOR BIG
PPT FOR BIGPPT FOR BIG
PPT FOR BIG
 
Data Mining on Twitter
Data Mining on TwitterData Mining on Twitter
Data Mining on Twitter
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in Manufacturing
 

Similar to Big Data at Twitter, Chirp 2010

Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Peter Skomoroch
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindEMC
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Codemotion
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examplesYoshitomo Matsubara
 
Multi-label graph analysis and computations using GraphX
Multi-label graph analysis and computations using GraphXMulti-label graph analysis and computations using GraphX
Multi-label graph analysis and computations using GraphXQingbo Hu
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Claudio Martella
 
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Databricks
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra PerfectSATOSHI TAGOMORI
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at myliferesponseteam
 
Analytics for the Real-Time Web
Analytics for the Real-Time WebAnalytics for the Real-Time Web
Analytics for the Real-Time Webmaria.grineva
 
Cache aware hybrid sorter
Cache aware hybrid sorterCache aware hybrid sorter
Cache aware hybrid sorterManchor Ko
 
Infrastructure for cloud_computing
Infrastructure for cloud_computingInfrastructure for cloud_computing
Infrastructure for cloud_computingJULIO GONZALEZ SANZ
 

Similar to Big Data at Twitter, Chirp 2010 (20)

Enar short course
Enar short courseEnar short course
Enar short course
 
Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011
 
ENAR short course
ENAR short courseENAR short course
ENAR short course
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilind
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examples
 
Multi-label graph analysis and computations using GraphX
Multi-label graph analysis and computations using GraphXMulti-label graph analysis and computations using GraphX
Multi-label graph analysis and computations using GraphX
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014
 
IOE MODULE 6.pptx
IOE MODULE 6.pptxIOE MODULE 6.pptx
IOE MODULE 6.pptx
 
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra Perfect
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Scaling
ScalingScaling
Scaling
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
 
Analytics for the Real-Time Web
Analytics for the Real-Time WebAnalytics for the Real-Time Web
Analytics for the Real-Time Web
 
Cache aware hybrid sorter
Cache aware hybrid sorterCache aware hybrid sorter
Cache aware hybrid sorter
 
Infrastructure for cloud_computing
Infrastructure for cloud_computingInfrastructure for cloud_computing
Infrastructure for cloud_computing
 

Recently uploaded

Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 

Recently uploaded (20)

Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 

Big Data at Twitter, Chirp 2010

  • 2. Big Data at Twitter #chirpdata Kevin Weil @kevinweil Twitter Analytics
  • 3. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
  • 4. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
  • 5. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess?
  • 6. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr)
  • 7. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr) • 10,000 CDs
  • 8. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr) • 10,000 CDs • 5 million floppy disks
  • 9. Data, Data Everywhere • You guys generate a lot of data • Anybody want to guess? • 7 TB/day (2+ PB/yr) • 10,000 CDs • 5 million floppy disks • 225 GB while I give this talk
  • 10. Syslog? • Started with syslog-ng • As our volume grew, it didn’t scale
  • 11. Syslog? • Started with syslog-ng • As our volume grew, it didn’t scale • Resources overwhelmed • Lost data
  • 12. Scribe • Surprise! FB had same problem, built and open-sourced Scribe • Log collection framework over Thrift • You write log lines, with categories • It does the rest
  • 13. Scribe FE FE FE • Runs locally; reliable in network outage
  • 14. Scribe FE FE FE • Runs locally; reliable in network outage • Nodes only know downstream writer; Agg Agg hierarchical, scalable
  • 15. Scribe FE FE FE • Runs locally; reliable in network outage • Nodes only know downstream writer; Agg Agg hierarchical, scalable • Pluggable outputs File HDFS
  • 16. Scribe at Twitter • Solved our problem, opened new vistas • Currently 30 different categories logged from javascript, RoR, Scala, etc • We improved logging, monitoring, writing to Hadoop, compression
  • 17. Scribe at Twitter • Continuing to work with FB • GSoC project! Help make it more awesome. • http://github.com/traviscrawford/scribe • http://wiki.developers.facebook.com/index.php/User:GSoC
  • 18. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
  • 19. How do you store 7TB/day? • Single machine? • What’s HD write speed?
  • 20. How do you store 7TB/day? • Single machine? • What’s HD write speed? • 80 MB/s
  • 21. How do you store 7TB/day? • Single machine? • What’s HD write speed? • 80 MB/s • 24.3 hrs to write 7 TB
  • 22. How do you store 7TB/day? • Single machine? • What’s HD write speed? • 80 MB/s • 24.3 hrs to write 7 TB • Uh oh.
  • 23. Where do I put 7TB/day? • Need a cluster of machines
  • 24. Where do I put 7TB/day? • Need a cluster of machines • ... which adds new layers of complexity
  • 25. Hadoop • Distributed file system • Automatic replication, fault tolerance • MapReduce-based parallel computation • Key-value based computation interface allows for wide applicability
  • 26. Hadoop • Open source: top-level Apache project • Scalable: Y! has a 4000 node cluster • Powerful: sorted 1TB random integers in 62 seconds • Easy packaging: free Cloudera RPMs
  • 27. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 28. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 29. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 30. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 31. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 32. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 33. Inputs MapReduce Workflow Shuffle/ Map Sort • Challenge: how many tweets per Map Outputs user, given tweets table? Map Reduce • Input: key=row, value=tweet info • Map: output key=user_id, value=1 Map Reduce • Shuffle: sort by user_id Map Reduce • Reduce: for each user_id, sum Map • Output: user_id, tweet count • With 2x machines, runs 2x faster Map
  • 34. Two Analysis Challenges 1. Compute friendships in Twitter’s social graph • grep, awk? No way. • Data is in MySQL... self join on an n- billion row table? • n,000,000,000 x n,000,000,000 = ?
  • 35. Two Analysis Challenges 1. Compute friendships in Twitter’s social graph • grep, awk? No way. • Data is in MySQL... self join on an n- billion row table? • n,000,000,000 x n,000,000,000 = ? • I don’t know either.
  • 36. Two Analysis Challenges 2. Large-scale grouping and counting • select count(*) from users? maybe. • select count(*) from tweets? uh... • Imagine joining them. • And grouping. • And sorting.
  • 37. Back to Hadoop • Didn’t we have a cluster of machines? • Hadoop makes it easy to distribute the calculation • Purpose-built for parallel calculation • Just a slight mindset adjustment
  • 38. Back to Hadoop • Didn’t we have a cluster of machines? • Hadoop makes it easy to distribute the calculation • Purpose-built for parallel calculation • Just a slight mindset adjustment • But a fun one!
  • 39. Analysis at Scale • Now we’re rolling • Count all tweets: 12 billion, 5 minutes • Hit FlockDB in parallel to assemble social graph aggregates • Run pagerank across users to calculate reputations
  • 40. But... • Analysis typically in Java • Single-input, two-stage data flow is rigid • Projections, filters: custom code • Joins lengthy, error-prone • n-stage jobs: hard to manage • Exploration requires compilation
  • 41. Three Challenges • Collecting Data • Large-Scale Storage & Analysis • Rapid Learning over Big Data
  • 42. Pig • High level language • Transformations on sets of records • Process data one step at a time • Easier than SQL?
  • 43. Why Pig? Because I bet you can read the following script
  • 44. A Real Pig Script • Just for fun... the same calculation in Java
  • 46. Pig Makes it Easy • 5% of the code
  • 47. Pig Makes it Easy • 5% of the code • 5% of the dev time
  • 48. Pig Makes it Easy • 5% of the code • 5% of the dev time • Within 25% of the running time
  • 49. Pig Makes it Easy • 5% of the code • 5% of the dev time • Within 25% of the running time • Readable, reusable
  • 50. One Thing I’ve Learned • It’s easy to answer questions. • It’s hard to ask the right questions
  • 51. One Thing I’ve Learned • It’s easy to answer questions. • It’s hard to ask the right questions. • Value the system that promotes innovation and iteration
  • 52. One Thing I’ve Learned • It’s easy to answer questions. • It’s hard to ask the right questions. • Value the system that promotes innovation and iteration • More minds contributing = more value from your data
  • 53. Counting Big Data • How many requests per day?
  • 54. Counting Big Data • How many requests per day? • Average latency? 95% latency?
  • 55. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour?
  • 56. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour? • Searches per day?
  • 57. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour? • Searches per day? • Unique users searching, unique queries?
  • 58. Counting Big Data • How many requests per day? • Average latency? 95% latency? • Response code distribution per hour? • Searches per day? • Unique users searching, unique queries? • Geographic distribution of queries?
  • 59. Correlating Big Data • Usage difference for mobile users?
  • 60. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients?
  • 61. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients? • Cohort analyses
  • 62. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients? • Cohort analyses • What features get users hooked?
  • 63. Correlating Big Data • Usage difference for mobile users? • ... for users on desktop clients? • Cohort analyses • What features get users hooked? • What do successful users use often?
  • 64. Research on Big Data • What can we tell from a user’s tweets?
  • 65. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers?
  • 66. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow?
  • 67. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow? • What influences retweet tree depth?
  • 68. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow? • What influences retweet tree depth? • Duplicate detection, language detection
  • 69. Research on Big Data • What can we tell from a user’s tweets? • ... from the tweets of their followers? • ... from the tweets of those they follow? • What influences retweet tree depth? • Duplicate detection, language detection • Machine learning
  • 70. If We Had More Time... • HBase backing namesearch • LZO compression • Protocol Buffers and Hadoop • Our open source: hadoop-lzo, elephant- bird • Realtime analytics with Cassandra
  • 71. Questions? Follow me at twitter.com/kevinweil

Editor's Notes