6. Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
7. Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
• 10,000 CDs
8. Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
• 10,000 CDs
• 5 million floppy disks
9. Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
• 10,000 CDs
• 5 million floppy disks
• 225 GB while I give this talk
11. Syslog?
• Started with syslog-ng
• As our volume grew, it didn’t scale
• Resources
overwhelmed
• Lost data
12. Scribe
• Surprise! FB had same problem, built
and open-sourced Scribe
• Log collection framework over Thrift
• You write log lines, with categories
• It does the rest
13. Scribe
FE FE FE
• Runs locally; reliable
in network outage
14. Scribe
FE FE FE
• Runs locally; reliable
in network outage
• Nodes only know
downstream writer; Agg Agg
hierarchical, scalable
15. Scribe
FE FE FE
• Runs locally; reliable
in network outage
• Nodes only know
downstream writer; Agg Agg
hierarchical, scalable
• Pluggable outputs
File HDFS
16. Scribe at Twitter
• Solved our problem, opened new
vistas
• Currently 30 different categories
logged from javascript, RoR, Scala, etc
• We improved logging, monitoring,
writing to Hadoop, compression
17. Scribe at Twitter
• Continuing to work with FB
• GSoC project! Help make it more
awesome.
• http://github.com/traviscrawford/scribe
• http://wiki.developers.facebook.com/index.php/User:GSoC
19. How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
20. How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
• 80 MB/s
21. How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
• 80 MB/s
• 24.3 hrs to write 7 TB
22. How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
• 80 MB/s
• 24.3 hrs to write 7 TB
• Uh oh.
23. Where do I put 7TB/day?
• Need a cluster of
machines
24. Where do I put 7TB/day?
• Need a cluster of
machines
• ... which adds new
layers of complexity
25. Hadoop
• Distributed file system
• Automatic replication, fault
tolerance
• MapReduce-based parallel computation
• Key-value based computation
interface allows for wide applicability
26. Hadoop
• Open source: top-level Apache project
• Scalable: Y! has a 4000 node cluster
• Powerful: sorted 1TB random integers
in 62 seconds
• Easy packaging: free Cloudera RPMs
27. Inputs
MapReduce Workflow
Shuffle/
Map Sort
• Challenge: how many tweets per
Map
Outputs user, given tweets table?
Map Reduce • Input: key=row, value=tweet info
• Map: output key=user_id, value=1
Map Reduce
• Shuffle: sort by user_id
Map Reduce • Reduce: for each user_id, sum
Map
• Output: user_id, tweet count
• With 2x machines, runs 2x faster
Map
28. Inputs
MapReduce Workflow
Shuffle/
Map Sort
• Challenge: how many tweets per
Map
Outputs user, given tweets table?
Map Reduce • Input: key=row, value=tweet info
• Map: output key=user_id, value=1
Map Reduce
• Shuffle: sort by user_id
Map Reduce • Reduce: for each user_id, sum
Map
• Output: user_id, tweet count
• With 2x machines, runs 2x faster
Map
29. Inputs
MapReduce Workflow
Shuffle/
Map Sort
• Challenge: how many tweets per
Map
Outputs user, given tweets table?
Map Reduce • Input: key=row, value=tweet info
• Map: output key=user_id, value=1
Map Reduce
• Shuffle: sort by user_id
Map Reduce • Reduce: for each user_id, sum
Map
• Output: user_id, tweet count
• With 2x machines, runs 2x faster
Map
30. Inputs
MapReduce Workflow
Shuffle/
Map Sort
• Challenge: how many tweets per
Map
Outputs user, given tweets table?
Map Reduce • Input: key=row, value=tweet info
• Map: output key=user_id, value=1
Map Reduce
• Shuffle: sort by user_id
Map Reduce • Reduce: for each user_id, sum
Map
• Output: user_id, tweet count
• With 2x machines, runs 2x faster
Map
31. Inputs
MapReduce Workflow
Shuffle/
Map Sort
• Challenge: how many tweets per
Map
Outputs user, given tweets table?
Map Reduce • Input: key=row, value=tweet info
• Map: output key=user_id, value=1
Map Reduce
• Shuffle: sort by user_id
Map Reduce • Reduce: for each user_id, sum
Map
• Output: user_id, tweet count
• With 2x machines, runs 2x faster
Map
32. Inputs
MapReduce Workflow
Shuffle/
Map Sort
• Challenge: how many tweets per
Map
Outputs user, given tweets table?
Map Reduce • Input: key=row, value=tweet info
• Map: output key=user_id, value=1
Map Reduce
• Shuffle: sort by user_id
Map Reduce • Reduce: for each user_id, sum
Map
• Output: user_id, tweet count
• With 2x machines, runs 2x faster
Map
33. Inputs
MapReduce Workflow
Shuffle/
Map Sort
• Challenge: how many tweets per
Map
Outputs user, given tweets table?
Map Reduce • Input: key=row, value=tweet info
• Map: output key=user_id, value=1
Map Reduce
• Shuffle: sort by user_id
Map Reduce • Reduce: for each user_id, sum
Map
• Output: user_id, tweet count
• With 2x machines, runs 2x faster
Map
34. Two Analysis Challenges
1. Compute friendships in Twitter’s social
graph
• grep, awk? No way.
• Data is in MySQL... self join on an n-
billion row table?
• n,000,000,000 x n,000,000,000 = ?
35. Two Analysis Challenges
1. Compute friendships in Twitter’s social
graph
• grep, awk? No way.
• Data is in MySQL... self join on an n-
billion row table?
• n,000,000,000 x n,000,000,000 = ?
• I don’t know either.
36. Two Analysis Challenges
2. Large-scale grouping and counting
• select count(*) from users? maybe.
• select count(*) from tweets? uh...
• Imagine joining them.
• And grouping.
• And sorting.
37. Back to Hadoop
• Didn’t we have a cluster of machines?
• Hadoop makes it easy to distribute the
calculation
• Purpose-built for parallel calculation
• Just a slight mindset adjustment
38. Back to Hadoop
• Didn’t we have a cluster of machines?
• Hadoop makes it easy to distribute the
calculation
• Purpose-built for parallel calculation
• Just a slight mindset adjustment
• But a fun one!
39. Analysis at Scale
• Now we’re rolling
• Count all tweets: 12 billion, 5 minutes
• Hit FlockDB in parallel to assemble
social graph aggregates
• Run pagerank across users to calculate
reputations
40. But...
• Analysis typically in Java
• Single-input, two-stage data flow is rigid
• Projections, filters: custom code
• Joins lengthy, error-prone
• n-stage jobs: hard to manage
• Exploration requires compilation
47. Pig Makes it Easy
• 5% of the code
• 5% of the dev time
48. Pig Makes it Easy
• 5% of the code
• 5% of the dev time
• Within 25% of the running time
49. Pig Makes it Easy
• 5% of the code
• 5% of the dev time
• Within 25% of the running time
• Readable, reusable
50. One Thing I’ve Learned
• It’s easy to answer questions.
• It’s hard to ask the right questions
51. One Thing I’ve Learned
• It’s easy to answer questions.
• It’s hard to ask the right questions.
• Value the system that promotes
innovation and iteration
52. One Thing I’ve Learned
• It’s easy to answer questions.
• It’s hard to ask the right questions.
• Value the system that promotes
innovation and iteration
• More minds contributing = more value
from your data
54. Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
55. Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
56. Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
• Searches per day?
57. Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
• Searches per day?
• Unique users searching, unique queries?
58. Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
• Searches per day?
• Unique users searching, unique queries?
• Geographic distribution of queries?
60. Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
61. Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
• Cohort analyses
62. Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
• Cohort analyses
• What features get users hooked?
63. Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
• Cohort analyses
• What features get users hooked?
• What do successful users use often?
64. Research on Big Data
• What can we tell from a user’s tweets?
65. Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
66. Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the tweets of those they follow?
67. Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the tweets of those they follow?
• What influences retweet tree depth?
68. Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the tweets of those they follow?
• What influences retweet tree depth?
• Duplicate detection, language detection
69. Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the tweets of those they follow?
• What influences retweet tree depth?
• Duplicate detection, language detection
• Machine learning
70. If We Had More Time...
• HBase backing namesearch
• LZO compression
• Protocol Buffers and Hadoop
• Our open source: hadoop-lzo, elephant-
bird
• Realtime analytics with Cassandra