Human Factors of XR: Using Human Factors to Design XR Systems
20081022cca
1.
2. Data Management at Facebook
(Back in the Day)
Jeff Hammerbacher
VP Product and Chief Scientist, Cloudera
October 22, 2008
3. My Background
Thanks for Asking
▪ hammer@cloudera.com
▪ Studied Mathematics at Harvard
▪ Worked as a Quant on Wall Street
▪ Came to Facebook in early 2006 as a Research Scientist
▪ Managed the Facebook Data Team through September 2008
▪ Over 25 amazing engineers and data scientists
▪ Now a cofounder of Cloudera
▪ Hadoop support and optimization
4. Common Themes
1. Simplicity
▪ Do one thing well ...
2. Scalability
▪ ... a lot
3. Manageability
▪ Remove the humans
4. Open Source
▪ Build a community
5. Serving Facebook.com
Data Retrieval and Hardware
GET /index.php HTTP/1.1
Host: www.facebook.com
▪ Three main server profiles:
▪ Web
▪ Memcached
Web Tier
▪ MySQL (more than 10,000 Servers)
▪ Simplified away:
Memcached Tier
▪ AJAX (around 1,000 servers)
MySQL Tier
(around 2,000 servers)
▪ Photo and Video
▪ Services
6. Services Infrastructure
What’s an SOA?
▪ Almost all services written in Thrift
▪ Networks Type-ahead, Search, Ads, SMS Gateway, Chat, Notes
Import, Scribe
▪ Batteries included
▪ Network transport libraries
▪ Serialization libraries
▪ Code generation
▪ Robust server implementations (multithreaded, nonblocking, etc.)
▪ Now an Apache Incubator project
▪ For more information, read the whitepaper
7. Services Infrastructure
Thrift, Mainly
▪ Developing a Thrift service:
▪ Define your data structures
▪ JSON-like data model
▪ Define your service endpoints
▪ Select your languages
▪ Generate stub code
▪ Write service logic
▪ Write client
▪ Configure and deploy
▪ Monitor, provision, and upgrade
8. Data Infrastructure
Offline Batch Processing
Scribe Tier MySQL Tier
▪ “Data Warehousing”
▪ Began with Oracle database
▪ Schedule data collection via cron
▪ Collect data every 24 hours
▪ “ETL” scripts: hand-coded Python Data Collection
Server
▪ Data volumes quickly grew
▪ Started at tens of GB in early 2006 Oracle Database
Server
▪ Up to about 1 TB per day by mid-2007
▪ Log files largest source of data growth
9. Data Infrastructure
Distributed Processing with Cheetah
▪ Goal: summarize log files outside of the database
▪ Solution: Cheetah, a distributed log file processing system
▪ Distributor.pl: distribute binaries to processing nodes
▪ C++ Binaries: parse, agg, load
Partitioned Log File
Cheetah Master
Filer Processing Tier
10. Data Infrastructure
Moving from Cheetah to Hadoop
▪ Cheetah limitations
▪ Limited filer bandwidth
▪ No centralized logfile metadata
▪ Writing a new Cheetah job requires writing C++ binaries
▪ Jobs are difficult to monitor and debug
▪ No support for ad hoc querying
▪ Not open source
12. Initial Hadoop Applications
Unstructured text analysis
▪ Intern asked to understand brand sentiment and influence
▪ Many tools for supporting his project had to be built
▪ Understanding serialization format of wall post logs
▪ Common data operations: project, filter, join, group by
▪ Developed using Hadoop streaming for rapid prototyping in
Python
▪ Scheduling regular processing and recovering from failures
▪ Making it easy to regularly load new data
14. Initial Hadoop Applications
Ensemble Learning
▪ Build a lot of Decision Trees and average them
▪ Random Forests are a combination of tree predictors such that
each tree depends on the values of a random vector sampled
independently and with the same distribution for all trees in the
forest
▪ Can be used for regression or classification
▪ See “Random Forests” by Leo Breiman
15. More Hadoop Applications
Insights
▪ Monitor performance of your Facebook Ad, Page, Application
▪ Regular aggregation of high volumes of log file data
▪ First hourly pipelines
▪ Publish data back to a MySQL tier
▪ System currently only running partially on Hadoop
17. More Hadoop Applications
Platform Application Reputation Scoring
▪ Users complaining about being spammed by Platform
applications
▪ Now, every Platform Application has a set of quotas
▪ Notifications
▪ News Feed story insertion
▪ Invitations
▪ Emails
▪ Quotas determined by calculating a “reputation score” for the
application
18. Hive
Structured Data Management with Hadoop
▪ Hadoop:
▪ HDFS
▪ MapReduce
▪ Resource Manager
▪ Job Scheduler
▪ Hive:
▪ Logical data partitioning
▪ Metadata store (command line and web interfaces)
▪ Query Operators
▪ Query Language
20. Hive
The Team
▪ Joydeep Sen Sarma
▪ Ashish Thusoo
▪ Pete Wyckoff
▪ Suresh Anthony
▪ Zheng Shao
▪ Venky Iyer
▪ Dhruba Borthakur
▪ Namit Jain
▪ Raghu Murthy
▪ Prasad Chakka
21. Hive
Some Stats
▪ Cluster size - 320 nodes, 2560 cores, 1.3 PB capacity
▪ Total data (compressed, deduplicated) - 180 TB
▪ Net data per day
▪ 10 TB uncompressed - 4 TB from databases, 6 TB from logs
▪ Over 2 TB compressed
▪ Data Processing Statistics
▪ 3,200 Jobs and 800,000 Tasks per day
▪ 55 TB of compressed data processed per day
▪ 15 TB of compressed data produced per day
▪ 80 M minutes of compute time per day
22. Cassandra
Structured Storage over a P2P Network
▪ Conceptually: BigTable data model on Dynamo infrastructure
▪ Design Goals:
▪ High availability
▪ Incremental scalability
▪ Eventual consistency (trade consistency for availability)
▪ Optimistic replication
▪ Low total cost of ownership
▪ Minimal administrative overhead
▪ Tunable tradeoffs between consistency, durability, and latency
25. Cassandra
The Team
▪ Avinash Lakshman
▪ Prashant Malik
▪ Karthik Ranganathan
▪ Kannan Muthukkaruppan
26. Cassandra
Some Stats
▪ Cluster size - 120 nodes
▪ Single instance across two data centers
▪ Total data stored - 36 TB
▪ Writes - 300 million writes per day.
▪ Reads - 1 million reads per day.
▪ Read Latencies
▪ Min - 6.03 ms
▪ Mean - 90.6 ms
▪ Median - 18.24 ms
27. (c) 2008 Facebook, Inc. or its licensors. quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0