20081022cca

Data Management at Facebook
(Back in the Day)

Jeff Hammerbacher
VP Product and Chief Scientist, Cloudera
October 22, 2008

My Background
Thanks for Asking
▪ hammer@cloudera.com
▪ Studied Mathematics at Harvard
▪ Worked as a Quant on Wall Street
▪ Came to Facebook in early 2006 as a Research Scientist
▪ Managed the Facebook Data Team through September 2008
▪ Over 25 amazing engineers and data scientists
▪ Now a cofounder of Cloudera
▪ Hadoop support and optimization

Common Themes
1. Simplicity
▪ Do one thing well ...
2. Scalability
▪ ... a lot
3. Manageability
▪ Remove the humans
4. Open Source
▪ Build a community

Serving Facebook.com
Data Retrieval and Hardware
GET /index.php HTTP/1.1
Host: www.facebook.com
▪ Three main server proﬁles:
▪ Web
▪ Memcached
Web Tier
▪ MySQL (more than 10,000 Servers)

▪ Simpliﬁed away:
Memcached Tier
▪ AJAX (around 1,000 servers)
MySQL Tier
(around 2,000 servers)
▪ Photo and Video
▪ Services

Services Infrastructure
What’s an SOA?
▪ Almost all services written in Thrift
▪ Networks Type-ahead, Search, Ads, SMS Gateway, Chat, Notes
Import, Scribe
▪ Batteries included
▪ Network transport libraries
▪ Serialization libraries
▪ Code generation
▪ Robust server implementations (multithreaded, nonblocking, etc.)
▪ Now an Apache Incubator project
▪ For more information, read the whitepaper

Services Infrastructure
Thrift, Mainly
▪ Developing a Thrift service:
▪ Define your data structures
▪ JSON-like data model
▪ Define your service endpoints
▪ Select your languages
▪ Generate stub code
▪ Write service logic
▪ Write client
▪ Configure and deploy
▪ Monitor, provision, and upgrade

Data Infrastructure
Offline Batch Processing
Scribe Tier MySQL Tier

▪ “Data Warehousing”
▪ Began with Oracle database
▪ Schedule data collection via cron
▪ Collect data every 24 hours
▪ “ETL” scripts: hand-coded Python Data Collection
Server
▪ Data volumes quickly grew
▪ Started at tens of GB in early 2006 Oracle Database
Server
▪ Up to about 1 TB per day by mid-2007
▪ Log ﬁles largest source of data growth

Data Infrastructure
Distributed Processing with Cheetah
▪ Goal: summarize log ﬁles outside of the database
▪ Solution: Cheetah, a distributed log ﬁle processing system
▪ Distributor.pl: distribute binaries to processing nodes
▪ C++ Binaries: parse, agg, load

Partitioned Log File
Cheetah Master

Filer Processing Tier

Data Infrastructure
Moving from Cheetah to Hadoop
▪ Cheetah limitations
▪ Limited ﬁler bandwidth
▪ No centralized logﬁle metadata
▪ Writing a new Cheetah job requires writing C++ binaries
▪ Jobs are difficult to monitor and debug
▪ No support for ad hoc querying
▪ Not open source

Data Infrastructure
Hadoop as Enterprise Data Warehouse
Scribe Tier MySQL Tier

Hadoop Tier

Oracle RAC Servers

Initial Hadoop Applications
Unstructured text analysis
▪ Intern asked to understand brand sentiment and inﬂuence

▪ Many tools for supporting his project had to be built
▪ Understanding serialization format of wall post logs
▪ Common data operations: project, ﬁlter, join, group by
▪ Developed using Hadoop streaming for rapid prototyping in
Python
▪ Scheduling regular processing and recovering from failures
▪ Making it easy to regularly load new data

Initial Hadoop Applications
Ensemble Learning
▪ Build a lot of Decision Trees and average them
▪ Random Forests are a combination of tree predictors such that
each tree depends on the values of a random vector sampled
independently and with the same distribution for all trees in the
forest
▪ Can be used for regression or classiﬁcation
▪ See “Random Forests” by Leo Breiman

More Hadoop Applications
Insights
▪ Monitor performance of your Facebook Ad, Page, Application
▪ Regular aggregation of high volumes of log ﬁle data
▪ First hourly pipelines
▪ Publish data back to a MySQL tier
▪ System currently only running partially on Hadoop

More Hadoop Applications
Platform Application Reputation Scoring
▪ Users complaining about being spammed by Platform
applications
▪ Now, every Platform Application has a set of quotas
▪ Notiﬁcations
▪ News Feed story insertion
▪ Invitations
▪ Emails
▪ Quotas determined by calculating a “reputation score” for the
application

Hive
Structured Data Management with Hadoop
▪ Hadoop:
▪ HDFS
▪ MapReduce
▪ Resource Manager
▪ Job Scheduler
▪ Hive:
▪ Logical data partitioning
▪ Metadata store (command line and web interfaces)
▪ Query Operators
▪ Query Language

Hive
The Team
▪ Joydeep Sen Sarma
▪ Ashish Thusoo
▪ Pete Wyckoff
▪ Suresh Anthony
▪ Zheng Shao
▪ Venky Iyer
▪ Dhruba Borthakur
▪ Namit Jain
▪ Raghu Murthy
▪ Prasad Chakka

Hive
Some Stats
▪ Cluster size - 320 nodes, 2560 cores, 1.3 PB capacity
▪ Total data (compressed, deduplicated) - 180 TB
▪ Net data per day
▪ 10 TB uncompressed - 4 TB from databases, 6 TB from logs
▪ Over 2 TB compressed
▪ Data Processing Statistics
▪ 3,200 Jobs and 800,000 Tasks per day
▪ 55 TB of compressed data processed per day
▪ 15 TB of compressed data produced per day
▪ 80 M minutes of compute time per day

Cassandra
Structured Storage over a P2P Network
▪ Conceptually: BigTable data model on Dynamo infrastructure
▪ Design Goals:
▪ High availability
▪ Incremental scalability
▪ Eventual consistency (trade consistency for availability)
▪ Optimistic replication
▪ Low total cost of ownership
▪ Minimal administrative overhead
▪ Tunable tradeoffs between consistency, durability, and latency

Cassandra
Initial Application
▪ Inbox search

Cassandra
The Team
▪ Avinash Lakshman
▪ Prashant Malik
▪ Karthik Ranganathan
▪ Kannan Muthukkaruppan

Cassandra
Some Stats
▪ Cluster size - 120 nodes
▪ Single instance across two data centers
▪ Total data stored - 36 TB
▪ Writes - 300 million writes per day.
▪ Reads - 1 million reads per day.
▪ Read Latencies
▪ Min - 6.03 ms
▪ Mean - 90.6 ms
▪ Median - 18.24 ms

20081022cca

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 20081022cca

Similar to 20081022cca (20)

More from Jeff Hammerbacher

More from Jeff Hammerbacher (20)

Recently uploaded

Recently uploaded (20)

20081022cca