SlideShare a Scribd company logo
1 of 75
Analyzing Big Data at



Web 2.0 Expo, 2010
Kevin Weil @kevinweil
Three Challenges
‣   Collecting Data
‣   Large-Scale Storage and Analysis
‣   Rapid Learning over Big Data
My Background
‣   Studied Mathematics and Physics at Harvard,

    Physics at Stanford
‣   Tropos Networks (city-wide wireless): GBs of data
‣   Cooliris (web media): Hadoop for analytics, TBs of data
‣   Twitter: Hadoop, Pig, machine learning, visualization,

    social graph analysis, PBs of data
Three Challenges
‣   Collecting Data
‣   Large-Scale Storage and Analysis
‣   Rapid Learning over Big Data
Data, Data Everywhere
‣   You guys generate a lot of data
‣   Anybody want to guess?
Data, Data Everywhere
‣   You guys generate a lot of data
‣   Anybody want to guess?
‣   	     12 TB/day (4+ PB/yr)
Data, Data Everywhere
‣   You guys generate a lot of data
‣   Anybody want to guess?
‣   	     12 TB/day (4+ PB/yr)
‣   	     20,000 CDs
Data, Data Everywhere
‣   You guys generate a lot of data
‣   Anybody want to guess?
‣   	     12 TB/day (4+ PB/yr)
‣   	     20,000 CDs
‣   	     10 million floppy disks
Data, Data Everywhere
‣   You guys generate a lot of data
‣   Anybody want to guess?
‣   	     12 TB/day (4+ PB/yr)
‣   	     20,000 CDs
‣   	     10 million floppy disks
‣   	     450 GB while I give this talk
Syslog?
‣   Started with syslog-ng
‣   As our volume grew, it didn’t scale
Syslog?
‣   Started with syslog-ng
‣   As our volume grew, it didn’t scale
‣   Resources

    overwhelmed
‣   Lost data
Scribe
‣   Surprise! FB had same problem, built and open-

    sourced Scribe
‣   Log collection framework over Thrift
‣   You “scribe” log lines, with categories
‣   It does the rest
Scribe
‣   Runs locally; reliable in   FE   FE   FE
    network outage
Scribe
‣   Runs locally; reliable in   FE         FE         FE
    network outage


‣   Nodes only know

    downstream writer;               Agg        Agg

    hierarchical, scalable
Scribe
‣   Runs locally; reliable in          FE         FE         FE
    network outage


‣   Nodes only know

    downstream writer;                      Agg        Agg

    hierarchical, scalable


‣   Pluggable outputs
                                File          HDFS
Scribe at Twitter
‣   Solved our problem, opened new vistas
‣   Currently 40 different categories logged from

    javascript, Ruby, Scala, Java, etc
‣   We improved logging, monitoring, behavior during

    failure conditions, writes to HDFS, etc
‣   Continuing to work with FB to make it better



    http://github.com/traviscrawford/scribe
Three Challenges
‣   Collecting Data
‣   Large-Scale Storage and Analysis
‣   Rapid Learning over Big Data
How Do You Store 12 TB/day?
‣   Single machine?
‣   What’s hard drive write speed?
How Do You Store 12 TB/day?
‣   Single machine?
‣   What’s hard drive write speed?
‣   ~80 MB/s
How Do You Store 12 TB/day?
‣   Single machine?
‣   What’s hard drive write speed?
‣   ~80 MB/s
‣   42 hours to write 12 TB
How Do You Store 12 TB/day?
‣   Single machine?
‣   What’s hard drive write speed?
‣   ~80 MB/s
‣   42 hours to write 12 TB
‣   Uh oh.
Where Do I Put 12TB/day?
‣   Need a cluster of machines


‣   ... which adds new layers

    of complexity
Hadoop
‣   Distributed file system
‣   	     Automatic replication
‣   	     Fault tolerance
‣   	     Transparently read/write across multiple machines
Hadoop
‣   Distributed file system
‣   	     Automatic replication
‣   	     Fault tolerance
‣   	     Transparently read/write across multiple machines
‣   MapReduce-based parallel computation
‣   	     Key-value based computation interface allows

          for wide applicability
‣   	     Fault tolerance, again
Hadoop
‣   Open source: top-level Apache project
‣   Scalable: Y! has a 4000 node cluster
‣   Powerful: sorted 1TB random integers in 62 seconds


‣   Easy packaging/install: free Cloudera RPMs
MapReduce Workflow
Inputs
                                     ‣   Challenge: how many tweets
            Shuffle/
         Map Sort                        per user, given tweets table?
         Map                         ‣   Input: key=row, value=tweet info
                           Outputs
         Map      Reduce             ‣   Map: output key=user_id, value=1
         Map      Reduce             ‣   Shuffle: sort by user_id
         Map      Reduce             ‣   Reduce: for each user_id, sum
         Map
                                     ‣   Output: user_id, tweet count
         Map
                                     ‣   With 2x machines, runs 2x faster
MapReduce Workflow
Inputs
                                     ‣   Challenge: how many tweets
            Shuffle/
         Map Sort                        per user, given tweets table?
         Map                         ‣   Input: key=row, value=tweet info
                           Outputs
         Map      Reduce             ‣   Map: output key=user_id, value=1
         Map      Reduce             ‣   Shuffle: sort by user_id
         Map      Reduce             ‣   Reduce: for each user_id, sum
         Map
                                     ‣   Output: user_id, tweet count
         Map
                                     ‣   With 2x machines, runs 2x faster
MapReduce Workflow
Inputs
                                     ‣   Challenge: how many tweets
            Shuffle/
         Map Sort                        per user, given tweets table?
         Map                         ‣   Input: key=row, value=tweet info
                           Outputs
         Map      Reduce             ‣   Map: output key=user_id, value=1
         Map      Reduce             ‣   Shuffle: sort by user_id
         Map      Reduce             ‣   Reduce: for each user_id, sum
         Map
                                     ‣   Output: user_id, tweet count
         Map
                                     ‣   With 2x machines, runs 2x faster
MapReduce Workflow
Inputs
                                     ‣   Challenge: how many tweets
            Shuffle/
         Map Sort                        per user, given tweets table?
         Map                         ‣   Input: key=row, value=tweet info
                           Outputs
         Map      Reduce             ‣   Map: output key=user_id, value=1
         Map      Reduce             ‣   Shuffle: sort by user_id
         Map      Reduce             ‣   Reduce: for each user_id, sum
         Map
                                     ‣   Output: user_id, tweet count
         Map
                                     ‣   With 2x machines, runs 2x faster
MapReduce Workflow
Inputs
                                     ‣   Challenge: how many tweets
            Shuffle/
         Map Sort                        per user, given tweets table?
         Map                         ‣   Input: key=row, value=tweet info
                           Outputs
         Map      Reduce             ‣   Map: output key=user_id, value=1
         Map      Reduce             ‣   Shuffle: sort by user_id
         Map      Reduce             ‣   Reduce: for each user_id, sum
         Map
                                     ‣   Output: user_id, tweet count
         Map
                                     ‣   With 2x machines, runs 2x faster
MapReduce Workflow
Inputs
                                     ‣   Challenge: how many tweets
            Shuffle/
         Map Sort                        per user, given tweets table?
         Map                         ‣   Input: key=row, value=tweet info
                           Outputs
         Map      Reduce             ‣   Map: output key=user_id, value=1
         Map      Reduce             ‣   Shuffle: sort by user_id
         Map      Reduce             ‣   Reduce: for each user_id, sum
         Map
                                     ‣   Output: user_id, tweet count
         Map
                                     ‣   With 2x machines, runs 2x faster
MapReduce Workflow
Inputs
                                     ‣   Challenge: how many tweets
            Shuffle/
         Map Sort                        per user, given tweets table?
         Map                         ‣   Input: key=row, value=tweet info
                           Outputs
         Map      Reduce             ‣   Map: output key=user_id, value=1
         Map      Reduce             ‣   Shuffle: sort by user_id
         Map      Reduce             ‣   Reduce: for each user_id, sum
         Map
                                     ‣   Output: user_id, tweet count
         Map
                                     ‣   With 2x machines, runs 2x faster
Two Analysis Challenges
‣   Compute mutual followings in Twitter’s interest graph
‣   	     grep, awk? No way.
‣   	     If data is in MySQL... self join on an n-

    	     billion row table?
‣   	     n,000,000,000 x n,000,000,000 = ?
Two Analysis Challenges
‣   Compute mutual followings in Twitter’s interest graph
‣   	     grep, awk? No way.
‣   	     If data is in MySQL... self join on an n-

    	     billion row table?
‣   	     n,000,000,000 x n,000,000,000 = ?
‣   
     I don’t know either.
Two Analysis Challenges
‣   Large-scale grouping and counting
‣   	    select count(*) from users? maybe.

‣   	    select count(*) from tweets? uh...
‣   	    Imagine joining these two.
‣   	    And grouping.
‣   	    And sorting.
Back to Hadoop
‣   Didn’t we have a cluster of machines?
‣   Hadoop makes it easy to distribute the calculation
‣   Purpose-built for parallel calculation
‣   Just a slight mindset adjustment
Back to Hadoop
‣   Didn’t we have a cluster of machines?
‣   Hadoop makes it easy to distribute the calculation
‣   Purpose-built for parallel calculation
‣   Just a slight mindset adjustment
‣   But a fun one!
Analysis at Scale
‣   Now we’re rolling
‣   Count all tweets: 20+ billion, 5 minutes
‣   Parallel network calls to FlockDB to compute

    interest graph aggregates
‣   Run PageRank across users and interest graph
But...
‣   Analysis typically in Java
‣   Single-input, two-stage data flow is rigid
‣   Projections, filters: custom code
‣   Joins lengthy, error-prone
‣   n-stage jobs hard to manage
‣   Data exploration requires compilation
Three Challenges
‣   Collecting Data
‣   Large-Scale Storage and Analysis
‣   Rapid Learning over Big Data
Pig
‣   High level language
‣   Transformations on

    sets of records
‣   Process data one

    step at a time
‣   Easier than SQL?
Why Pig?
Because I bet you can read the following script
A Real Pig Script




‣   Just for fun... the same calculation in Java next
No, Seriously.
Pig Makes it Easy
‣   5% of the code
Pig Makes it Easy
‣   5% of the code
‣   5% of the dev time
Pig Makes it Easy
‣   5% of the code
‣   5% of the dev time
‣   Within 20% of the running time
Pig Makes it Easy
‣   5% of the code
‣   5% of the dev time
‣   Within 20% of the running time
‣   Readable, reusable
Pig Makes it Easy
‣   5% of the code
‣   5% of the dev time
‣   Within 20% of the running time
‣   Readable, reusable
‣   As Pig improves, your calculations run faster
One Thing I’ve Learned
‣   It’s easy to answer questions
‣   It’s hard to ask the right questions
One Thing I’ve Learned
‣   It’s easy to answer questions
‣   It’s hard to ask the right questions
‣   Value the system that promotes innovation

    and iteration
One Thing I’ve Learned
‣   It’s easy to answer questions
‣   It’s hard to ask the right questions
‣   Value the system that promotes innovation

    and iteration
‣   More minds contributing = more value from your data
Counting Big Data
‣   How many requests per day?
Counting Big Data
‣   How many requests per day?
‣   Average latency? 95% latency?
Counting Big Data
‣   How many requests per day?
‣   Average latency? 95% latency?
‣   Response code distribution per hour?
Counting Big Data
‣   How many requests per day?
‣   Average latency? 95% latency?
‣   Response code distribution per hour?
‣   Twitter searches per day?
Counting Big Data
‣   How many requests per day?
‣   Average latency? 95% latency?
‣   Response code distribution per hour?
‣   Twitter searches per day?
‣   Unique users searching, unique queries?
Counting Big Data
‣   How many requests per day?
‣   Average latency? 95% latency?
‣   Response code distribution per hour?
‣   Twitter searches per day?
‣   Unique users searching, unique queries?
‣   Links tweeted per day? By domain?
Counting Big Data
‣   How many requests per day?
‣   Average latency? 95% latency?
‣   Response code distribution per hour?
‣   Twitter searches per day?
‣   Unique users searching, unique queries?
‣   Links tweeted per day? By domain?
‣   Geographic distribution of all of the above
Correlating Big Data
‣   Usage difference for mobile users?
Correlating Big Data
‣   Usage difference for mobile users?
‣   ... for users on desktop clients?
Correlating Big Data
‣   Usage difference for mobile users?
‣   ... for users on desktop clients?
‣   ... for users of #newtwitter?
Correlating Big Data
‣   Usage difference for mobile users?
‣   ... for users on desktop clients?
‣   ... for users of #newtwitter?
‣   Cohort analyses
Correlating Big Data
‣   Usage difference for mobile users?
‣   ... for users on desktop clients?
‣   ... for users of #newtwitter?
‣   Cohort analyses
‣   What features get users hooked?
Correlating Big Data
‣   Usage difference for mobile users?
‣   ... for users on desktop clients?
‣   ... for users of #newtwitter?
‣   Cohort analyses
‣   What features get users hooked?
‣   What features power Twitter users use often?
Research on Big Data
‣   What can we tell from a user’s tweets?
Research on Big Data
‣   What can we tell from a user’s tweets?
‣   ... from the tweets of their followers?
Research on Big Data
‣   What can we tell from a user’s tweets?
‣   ... from the tweets of their followers?
‣   ... from the tweets of those they follow?
Research on Big Data
‣   What can we tell from a user’s tweets?
‣   ... from the tweets of their followers?
‣   ... from the tweets of those they follow?
‣   What influences retweets? Depth of the retweet tree?
Research on Big Data
‣   What can we tell from a user’s tweets?
‣   ... from the tweets of their followers?
‣   ... from the tweets of those they follow?
‣   What influences retweets? Depth of the retweet tree?
‣   Duplicate detection (spam)
Research on Big Data
‣   What can we tell from a user’s tweets?
‣   ... from the tweets of their followers?
‣   ... from the tweets of those they follow?
‣   What influences retweets? Depth of the retweet tree?
‣   Duplicate detection (spam)
‣   Language detection (search)
Research on Big Data
‣   What can we tell from a user’s tweets?
‣   ... from the tweets of their followers?
‣   ... from the tweets of those they follow?
‣   What influences retweets? Depth of the retweet tree?
‣   Duplicate detection (spam)
‣   Language detection (search)
‣   Machine learning
Research on Big Data
‣   What can we tell from a user’s tweets?
‣   ... from the tweets of their followers?
‣   ... from the tweets of those they follow?
‣   What influences retweets? Depth of the retweet tree?
‣   Duplicate detection (spam)
‣   Language detection (search)
‣   Machine learning
‣   Natural language processing
Diving Deeper
‣   HBase and building products from Hadoop
‣   LZO Compression
‣   Protocol Buffers and Hadoop
‣   Our analytics-related open source: hadoop-lzo,

    elephant-bird
‣   Moving analytics to realtime


    http://github.com/kevinweil/hadoop-lzo
    http://github.com/kevinweil/elephant-bird
Questions?
             Follow me at
             twitter.com/kevinweil

More Related Content

Viewers also liked

Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdKevin Weil
 
Realtime Search at Twitter - Michael Busch
Realtime Search at Twitter - Michael BuschRealtime Search at Twitter - Michael Busch
Realtime Search at Twitter - Michael Buschlucenerevolution
 
Data Analytics on Twitter Feeds
Data Analytics on Twitter FeedsData Analytics on Twitter Feeds
Data Analytics on Twitter FeedsEu Jin Lok
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Milind Bhandarkar
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitMilind Bhandarkar
 
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSJane Man
 
Sentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use casesSentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use casesKarol Chlasta
 
SCaLE 2016 - syslog-ng: From Raw Data to Big Data
SCaLE 2016 - syslog-ng: From Raw Data to Big DataSCaLE 2016 - syslog-ng: From Raw Data to Big Data
SCaLE 2016 - syslog-ng: From Raw Data to Big DataBalaBit
 
Discovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterDiscovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterKamran Munshi
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsMilind Bhandarkar
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongFastly
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Bigdata analytics-twitter
Bigdata analytics-twitterBigdata analytics-twitter
Bigdata analytics-twitterdfilppi
 
Büyük Veri İşlemleri ve Hadoop
Büyük Veri İşlemleri ve HadoopBüyük Veri İşlemleri ve Hadoop
Büyük Veri İşlemleri ve HadoopCenk Derinozlu
 

Viewers also liked (20)

Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Hadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant birdHadoop summit 2010 frameworks panel elephant bird
Hadoop summit 2010 frameworks panel elephant bird
 
RainBird
RainBirdRainBird
RainBird
 
Realtime Search at Twitter - Michael Busch
Realtime Search at Twitter - Michael BuschRealtime Search at Twitter - Michael Busch
Realtime Search at Twitter - Michael Busch
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Scaling hadoopapplications
Scaling hadoopapplicationsScaling hadoopapplications
Scaling hadoopapplications
 
Data Analytics on Twitter Feeds
Data Analytics on Twitter FeedsData Analytics on Twitter Feeds
Data Analytics on Twitter Feeds
 
Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?Hadoop: The Default Machine Learning Platform ?
Hadoop: The Default Machine Learning Platform ?
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OS
 
Sentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use casesSentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use cases
 
SCaLE 2016 - syslog-ng: From Raw Data to Big Data
SCaLE 2016 - syslog-ng: From Raw Data to Big DataSCaLE 2016 - syslog-ng: From Raw Data to Big Data
SCaLE 2016 - syslog-ng: From Raw Data to Big Data
 
Discovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterDiscovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @Twitter
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
 
Measuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrongMeasuring CDN performance and why you're doing it wrong
Measuring CDN performance and why you're doing it wrong
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Bigdata analytics-twitter
Bigdata analytics-twitterBigdata analytics-twitter
Bigdata analytics-twitter
 
Büyük Veri İşlemleri ve Hadoop
Büyük Veri İşlemleri ve HadoopBüyük Veri İşlemleri ve Hadoop
Büyük Veri İşlemleri ve Hadoop
 

Similar to Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)

Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Peter Skomoroch
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindEMC
 
Where20 Spatial Analytics 2010
Where20 Spatial Analytics 2010Where20 Spatial Analytics 2010
Where20 Spatial Analytics 2010seagor
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueShay Sofer
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce Sina Ebrahimi
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - HortonworksAvery Ching
 
Cliff Click Explains GBM at Netflix October 10 2013
Cliff Click Explains GBM at Netflix October 10 2013Cliff Click Explains GBM at Netflix October 10 2013
Cliff Click Explains GBM at Netflix October 10 2013Sri Ambati
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache SparkNaukri.com
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaGlenn K. Lockwood
 
Hive Percona 2009
Hive Percona 2009Hive Percona 2009
Hive Percona 2009prasadc
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePedro Figueiredo
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2Tianwei Liu
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyData
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learningpauldix
 
Infrastructure for cloud_computing
Infrastructure for cloud_computingInfrastructure for cloud_computing
Infrastructure for cloud_computingJULIO GONZALEZ SANZ
 
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...StampedeCon
 

Similar to Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010) (20)

Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilind
 
Where20 Spatial Analytics 2010
Where20 Spatial Analytics 2010Where20 Spatial Analytics 2010
Where20 Spatial Analytics 2010
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
 
Cliff Click Explains GBM at Netflix October 10 2013
Cliff Click Explains GBM at Netflix October 10 2013Cliff Click Explains GBM at Netflix October 10 2013
Cliff Click Explains GBM at Netflix October 10 2013
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
 
Hive Percona 2009
Hive Percona 2009Hive Percona 2009
Hive Percona 2009
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReduce
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learning
 
Infrastructure for cloud_computing
Infrastructure for cloud_computingInfrastructure for cloud_computing
Infrastructure for cloud_computing
 
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...
 

Recently uploaded

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Analyzing Big Data at Twitter (Web 2.0 Expo NYC Sep 2010)

  • 1. Analyzing Big Data at Web 2.0 Expo, 2010 Kevin Weil @kevinweil
  • 2. Three Challenges ‣ Collecting Data ‣ Large-Scale Storage and Analysis ‣ Rapid Learning over Big Data
  • 3. My Background ‣ Studied Mathematics and Physics at Harvard, Physics at Stanford ‣ Tropos Networks (city-wide wireless): GBs of data ‣ Cooliris (web media): Hadoop for analytics, TBs of data ‣ Twitter: Hadoop, Pig, machine learning, visualization, social graph analysis, PBs of data
  • 4. Three Challenges ‣ Collecting Data ‣ Large-Scale Storage and Analysis ‣ Rapid Learning over Big Data
  • 5. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess?
  • 6. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess? ‣ 12 TB/day (4+ PB/yr)
  • 7. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess? ‣ 12 TB/day (4+ PB/yr) ‣ 20,000 CDs
  • 8. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess? ‣ 12 TB/day (4+ PB/yr) ‣ 20,000 CDs ‣ 10 million floppy disks
  • 9. Data, Data Everywhere ‣ You guys generate a lot of data ‣ Anybody want to guess? ‣ 12 TB/day (4+ PB/yr) ‣ 20,000 CDs ‣ 10 million floppy disks ‣ 450 GB while I give this talk
  • 10. Syslog? ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale
  • 11. Syslog? ‣ Started with syslog-ng ‣ As our volume grew, it didn’t scale ‣ Resources overwhelmed ‣ Lost data
  • 12. Scribe ‣ Surprise! FB had same problem, built and open- sourced Scribe ‣ Log collection framework over Thrift ‣ You “scribe” log lines, with categories ‣ It does the rest
  • 13. Scribe ‣ Runs locally; reliable in FE FE FE network outage
  • 14. Scribe ‣ Runs locally; reliable in FE FE FE network outage ‣ Nodes only know downstream writer; Agg Agg hierarchical, scalable
  • 15. Scribe ‣ Runs locally; reliable in FE FE FE network outage ‣ Nodes only know downstream writer; Agg Agg hierarchical, scalable ‣ Pluggable outputs File HDFS
  • 16. Scribe at Twitter ‣ Solved our problem, opened new vistas ‣ Currently 40 different categories logged from javascript, Ruby, Scala, Java, etc ‣ We improved logging, monitoring, behavior during failure conditions, writes to HDFS, etc ‣ Continuing to work with FB to make it better http://github.com/traviscrawford/scribe
  • 17. Three Challenges ‣ Collecting Data ‣ Large-Scale Storage and Analysis ‣ Rapid Learning over Big Data
  • 18. How Do You Store 12 TB/day? ‣ Single machine? ‣ What’s hard drive write speed?
  • 19. How Do You Store 12 TB/day? ‣ Single machine? ‣ What’s hard drive write speed? ‣ ~80 MB/s
  • 20. How Do You Store 12 TB/day? ‣ Single machine? ‣ What’s hard drive write speed? ‣ ~80 MB/s ‣ 42 hours to write 12 TB
  • 21. How Do You Store 12 TB/day? ‣ Single machine? ‣ What’s hard drive write speed? ‣ ~80 MB/s ‣ 42 hours to write 12 TB ‣ Uh oh.
  • 22. Where Do I Put 12TB/day? ‣ Need a cluster of machines ‣ ... which adds new layers of complexity
  • 23. Hadoop ‣ Distributed file system ‣ Automatic replication ‣ Fault tolerance ‣ Transparently read/write across multiple machines
  • 24. Hadoop ‣ Distributed file system ‣ Automatic replication ‣ Fault tolerance ‣ Transparently read/write across multiple machines ‣ MapReduce-based parallel computation ‣ Key-value based computation interface allows for wide applicability ‣ Fault tolerance, again
  • 25. Hadoop ‣ Open source: top-level Apache project ‣ Scalable: Y! has a 4000 node cluster ‣ Powerful: sorted 1TB random integers in 62 seconds ‣ Easy packaging/install: free Cloudera RPMs
  • 26. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 27. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 28. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 29. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 30. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 31. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 32. MapReduce Workflow Inputs ‣ Challenge: how many tweets Shuffle/ Map Sort per user, given tweets table? Map ‣ Input: key=row, value=tweet info Outputs Map Reduce ‣ Map: output key=user_id, value=1 Map Reduce ‣ Shuffle: sort by user_id Map Reduce ‣ Reduce: for each user_id, sum Map ‣ Output: user_id, tweet count Map ‣ With 2x machines, runs 2x faster
  • 33. Two Analysis Challenges ‣ Compute mutual followings in Twitter’s interest graph ‣ grep, awk? No way. ‣ If data is in MySQL... self join on an n- billion row table? ‣ n,000,000,000 x n,000,000,000 = ?
  • 34. Two Analysis Challenges ‣ Compute mutual followings in Twitter’s interest graph ‣ grep, awk? No way. ‣ If data is in MySQL... self join on an n- billion row table? ‣ n,000,000,000 x n,000,000,000 = ? ‣ I don’t know either.
  • 35. Two Analysis Challenges ‣ Large-scale grouping and counting ‣ select count(*) from users? maybe. ‣ select count(*) from tweets? uh... ‣ Imagine joining these two. ‣ And grouping. ‣ And sorting.
  • 36. Back to Hadoop ‣ Didn’t we have a cluster of machines? ‣ Hadoop makes it easy to distribute the calculation ‣ Purpose-built for parallel calculation ‣ Just a slight mindset adjustment
  • 37. Back to Hadoop ‣ Didn’t we have a cluster of machines? ‣ Hadoop makes it easy to distribute the calculation ‣ Purpose-built for parallel calculation ‣ Just a slight mindset adjustment ‣ But a fun one!
  • 38. Analysis at Scale ‣ Now we’re rolling ‣ Count all tweets: 20+ billion, 5 minutes ‣ Parallel network calls to FlockDB to compute interest graph aggregates ‣ Run PageRank across users and interest graph
  • 39. But... ‣ Analysis typically in Java ‣ Single-input, two-stage data flow is rigid ‣ Projections, filters: custom code ‣ Joins lengthy, error-prone ‣ n-stage jobs hard to manage ‣ Data exploration requires compilation
  • 40. Three Challenges ‣ Collecting Data ‣ Large-Scale Storage and Analysis ‣ Rapid Learning over Big Data
  • 41. Pig ‣ High level language ‣ Transformations on sets of records ‣ Process data one step at a time ‣ Easier than SQL?
  • 42. Why Pig? Because I bet you can read the following script
  • 43. A Real Pig Script ‣ Just for fun... the same calculation in Java next
  • 45. Pig Makes it Easy ‣ 5% of the code
  • 46. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time
  • 47. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time
  • 48. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time ‣ Readable, reusable
  • 49. Pig Makes it Easy ‣ 5% of the code ‣ 5% of the dev time ‣ Within 20% of the running time ‣ Readable, reusable ‣ As Pig improves, your calculations run faster
  • 50. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions
  • 51. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions ‣ Value the system that promotes innovation and iteration
  • 52. One Thing I’ve Learned ‣ It’s easy to answer questions ‣ It’s hard to ask the right questions ‣ Value the system that promotes innovation and iteration ‣ More minds contributing = more value from your data
  • 53. Counting Big Data ‣ How many requests per day?
  • 54. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency?
  • 55. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour?
  • 56. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day?
  • 57. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day? ‣ Unique users searching, unique queries?
  • 58. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day? ‣ Unique users searching, unique queries? ‣ Links tweeted per day? By domain?
  • 59. Counting Big Data ‣ How many requests per day? ‣ Average latency? 95% latency? ‣ Response code distribution per hour? ‣ Twitter searches per day? ‣ Unique users searching, unique queries? ‣ Links tweeted per day? By domain? ‣ Geographic distribution of all of the above
  • 60. Correlating Big Data ‣ Usage difference for mobile users?
  • 61. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients?
  • 62. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter?
  • 63. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter? ‣ Cohort analyses
  • 64. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter? ‣ Cohort analyses ‣ What features get users hooked?
  • 65. Correlating Big Data ‣ Usage difference for mobile users? ‣ ... for users on desktop clients? ‣ ... for users of #newtwitter? ‣ Cohort analyses ‣ What features get users hooked? ‣ What features power Twitter users use often?
  • 66. Research on Big Data ‣ What can we tell from a user’s tweets?
  • 67. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers?
  • 68. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow?
  • 69. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree?
  • 70. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam)
  • 71. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam) ‣ Language detection (search)
  • 72. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam) ‣ Language detection (search) ‣ Machine learning
  • 73. Research on Big Data ‣ What can we tell from a user’s tweets? ‣ ... from the tweets of their followers? ‣ ... from the tweets of those they follow? ‣ What influences retweets? Depth of the retweet tree? ‣ Duplicate detection (spam) ‣ Language detection (search) ‣ Machine learning ‣ Natural language processing
  • 74. Diving Deeper ‣ HBase and building products from Hadoop ‣ LZO Compression ‣ Protocol Buffers and Hadoop ‣ Our analytics-related open source: hadoop-lzo, elephant-bird ‣ Moving analytics to realtime http://github.com/kevinweil/hadoop-lzo http://github.com/kevinweil/elephant-bird
  • 75. Questions? Follow me at twitter.com/kevinweil

Editor's Notes

  1. Don’t worry about the “yellow grid lines” on the side, they’re just there to help you align everything. They won’t show up when you present.
  2. Change this to your big-idea call-outs...
  3. Change this to your big-idea call-outs...