Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
WaterlooHiveTalk
1. Petabyte Scale Data Warehousing at Facebook Ning Zhang Data Infrastructure Facebook
2. Overview Motivations Data-driven model Challenges Data Infrastructure Hadoop & Hive In-house tools Hive Details Architecture Data model Query language Extensibility Research Problems
5. … at Large Scale The social graph is large 400 million monthly active users 250 million daily active users 160 million active objects (groups/events/pages) 130 friend connections per user on average 60 object (groups/events/pages) connections per user on average Activities on the social graph People spent 500 billion minutes per month on FB Average user creates 70 pieces of content each month 25 billion pieces of content are shared each month Millions of search queries per day Facebook is still growing fast New users, features, services …
7. Under the Hook Data flow from users’ perspective Clients (browser/phone/3rd party apps) Web Services Users Another big topic on the Web Services To complete the feedback system … The developers want to know how a new app/feature received by the users (A/B test) The advertisers want to know how their ads perform (dashboard/reports) Based on historical data, how to construct a model and predicate the future (machine learning) Need data analytics! Data warehouse: ETL, data processing, BI … Closing the loop: decision-making based on analyzing the data (users’ feedback)
8. Data-driven Business/R&D/Science … DSS is not new but Web gives it new elements. “In 2009, more data will be generated by individuals than the entire history of mankind through 2008.” -- by Andreas Weigend, Harvard Business Review “The center of the universe has shifted from e-business to me-business.” -- same as above “Invariably, simple models and a lot of data trump more elaborate models based on less data.” -- by Alon Halevy, Peter Norvig and Fernando Pereira, The Unreasonable Effectiveness of Data
9. Problems and Challenges Data-driven development/business Huge amount of log data/user data generated every day Need to analyze these data to feedback development/business decisions Machine learning, report/dashboard generation, A/B testing And many more problems Scalability (more than petabytes) Availability (HA) Manageability (e.g., scheduling) Performance (CPU, memory, disk/network I/O) And many more…
10. Facebook Engineering Teams (backend) Facebook Infrastructure Building foundations that serves end users/applications OLTP workload Components include MySQL, memcached, HipHop (PHP), thrift, Cassandra, Haystack, flashcache, … Facebook Data Infrastructure (data warehouse) Building systems that serves data analysts, research scientists, engineers, product managers, executives, etc. OLAP workload Components include Hadoop, Hive, HDFS, scribe, HBase, tools (ETL, UI, workflow management etc.) Other Engineering teams Platform, search, site integrity, monetization, apps, growth, etc.
11. DI Key Challenges (I) – scalability Data, data and more data 200 GB/day in March 2008 12 TB/day at the end of 2009 About 8x increase per year Total size is 5 PB now (x3 when considering replication) Same order as the Web (~25 billion indexable pages)
12. DI Key Challenges (II) – Performance Queries, queries and more queries More than 200 unique users query on the data warehouse every day 7K queries/day at the end of 2009 25K queries/day now Workload is a mixture of ad-hoc queries and ETL/reporting queries. Fast, faster and real-time Users expect faster response time on fresher data (e.g., fighting with spam/fraud in near real-time) Sampling subset of data are not always good enough
13. Other Requirements Accessibility Everyone should be be able to log & access data easily, not only engineers (a lot of our users do not have CS degrees!) Schema discovery (more than 20K tables) Data exploration and visualization (learning the data by looking) Leverage existing prevalent and familiar tools (e.g., BI tools) Flexibility Schema changes frequently (adding new columns, changing column types, different partitions of tables, etc.) Data formats could be different (plain text, row store, column store, complex data types) Extensibility Easy to plug in user defined functions, aggregations etc. Data storage could be files, web services, “NoSQL stores”
14. Why not Existing Data Warehousing Systems? Cost of analysis and storage on proprietary systems does not support trends towards more data. Cost based on data size (15 PB costs a lot!) Expensive hardware and supports Limited Scalability does not support trends towards more data Product designed decades ago (not suitable for petabyte DW) ETL is a big bottleneck Long product development & release cycle Users requirements changes frequently (agile programming practice) Closed and proprietary systems
15. Lets try Hadoop (MapReduce + HDFS) … Pros Superior in availability/scalability/manageability (99.9%) Large and healthy open source community (popular in both industry and academic organizations)
16. But not quite … Cons: Programmability and Metadata Efficiency not that great, but throw more hardware MapReduce hard to program (users know SQL/bash/python) hard to debug, so it takes longer to get the results No schema Solution: Hive!
17. What is Hive ? A system for managing and querying structured data built on top of Hadoop Map-Reduce for execution HDFS for storage RDBMS for metadata Key Building Principles: SQL is a familiarlanguage on data warehouses Extensibility – Types, Functions, Formats, Scripts (connecting to HBase, Pig, Hybertable, Cassandra etc.) Scalability and Performance Interoperability (JDBC/ODBC/thrift)
26. Optimizations Column Pruning Also pushed down to scan in columnar storage (RCFILE) Predicate Pushdown Not pushed below Non-deterministic functions (eg. rand()) Partition Pruning Sample Pruning Handle small files Merge while writing CombinedHiveInputFormat while reading Small Jobs SELECT * with partition predicates in the client Restartability (Work In Progress)
28. MapReduceScripts Examples add file page_url_to_id.py; add file my_python_session_cutter.py; FROM (SELECT TRANSFORM(uhash, page_url, unix_time) USING 'page_url_to_id.py' AS (uhash, page_id, unix_time) FROM mylog DISTRIBUTE BY uhash SORT BY uhash, unix_time) mylog2 SELECT TRANSFORM(uhash, page_id, unix_time) USING 'my_python_session_cutter.py' AS (uhash, session_info);
30. Hive: Making Optimizations Transparent Joins: Joins try to reduce the number of map/reduce jobs needed. Memory efficient joins by streaming largest tables. Map Joins User specified small tables stored in hash tables on the mapper No reducer needed Aggregations: Map side partial aggregations Hash-based aggregates Serialized key/values in hash tables 90% speed improvement on Query SELECT count(1) FROM t; Load balancing for data skew
31. Hive: Making Optimizations Transparent Storage: Column oriented data formats Column and Partition pruning to reduce scanned data Lazy de-serialization of data Plan Execution Parallel Execution of Parts of the Plan
32. Hive: Open & Extensible Different on-disk storage(file) formats Text File, Sequence File, … Different serialization formats and data types LazySimpleSerDe, ThriftSerDe … User-provided map/reduce scripts In any language, use stdin/stdout to transfer data … User-defined Functions Substr, Trim, From_unixtime … User-defined Aggregation Functions Sum, Average … User-define Table Functions Explode …
33. Hive: Interoperability with Other Tools JDBC Enables integration with JDBC based SQL clients ODBC Enables integration with Microstrategy Thrift Enables writing cross language clients Main form of integration with php based Web UI
36. Usage Types of Applications: Reporting Eg: Daily/Weekly aggregations of impression/click counts Measures of user engagement Microstrategy reports Ad hoc Analysis Eg: how many group admins broken down by state/country Machine Learning (Assembling training data) Ad Optimization Eg: User Engagement as a function of user attributes Many others
37. Hadoop & Hive Cluster @ Facebook Hadoop/Hive cluster 13600 cores Raw Storage capacity ~ 17PB 8 cores + 12 TB per node 32 GB RAM per node Two level network topology 1 Gbit/sec from node to rack switch 4 Gbit/sec to top level rack switch 2 clusters One for adhoc users One for strict SLA jobs
38. Hive & Hadoop Usage @ Facebook Statistics per day: 800TB of I/O per day 10K – 25K Hive jobs per day Hive simplifies Hadoop: New engineers go though a Hive training session Analysts (non-engineers) use Hadoop through Hive Most of jobs are Hive Jobs
39. Data Flow Architecture at Facebook Scirbe-HDFS Web Servers Scribe-Hadoop Cluster Hive replication Adhoc Hive-Hadoop Cluster Production Hive-Hadoop Cluster Oracle RAC Federated MySQL
40. Scribe-HDFS: 101 HDFS Data Node Scribed Append to /staging/<category>/<file> Scribed <category, msgs> HDFS Data Node Scribed Scribed Scribed HDFS Data Node Scribe-HDFS
41. Scribe-HDFS: Near real time Hadoop Clusters collocated with the web servers Network is the biggest bottleneck Typical cluster has about 50 nodes. Stats: 50TB/day of raw data logged 99% of the time data is available within 20 seconds
42. Warehousing at Facebook Instrumentation (PHP/Python etc.) Automatic ETL Continuous copy data to Hive tables Metadata Discovery (CoHive) Query (Hive) Workflow specification and execution (Chronos) Reporting tools Monitoring and alerting
43. Future Work Scaling in a Dynamic and Fast Growing Environment Erasure codes for Hadoop Namenode scalability past 150 million objects Isolating Adhoc queries from jobs with strict deadlines Hive Replication Resource Sharing Pools for slots More scalable loading of data Incremental load of site data Continuous load of log data
44. Future Work Discovering Data from > 20K tables Collaborative Hive Finding Unused/rarely used Data
45. Future Dynamic Inserts into multiple partitions More join optimizations Persistent UDFs, UDAFs and UDTFs Benchmarks for monitoring performance IN, exists and correlated sub-queries Statistics Materialized Views
46. Research Challenges Reducing response time for small/medium jobs 20 thousands queries per day 1 million queries per day Indexes on Hadoop, data mart strategy Near real-time query processing – pipelining MapReduce Distributed systems problems in large scale: Job scheduling problem: mixed throughput and response time workloads Orchestra commits on thousands of machines (scribe conf files) Cross data center replication and consistency Full SQL compliant Required by 3rd party tools (e.g., BI) through ODBC/JDBC.
47. Query Optimizations Efficiently compute histograms, median, distinct values in a distributed shared-nothing architecture Cost models in the MapReduce framework
48. Social Graph Every user sees a different, personalized stream of information (news feed) 130 friend + 60 object updates in real time Edge-rank: ranking of updates that should be shown on the top Social graph is stored in distributed MySQL databases Data replication between data centers: an update to one data center should be replicated to other data centers as well How to partition a dense graph such that data transfer from different partitions is minimized.
Motivations: - The problems we face - The role of data infrastructure team in FB - Why we chose the current infrastructure?
List of apps, news feed, ads/notifications Dynamic web site What boils down to is a set of web services, not a big deal
-- As of Feb 2010, U.S. congress library archived about 160 terabytes of data. -- As of March 2009, there are 25.21 billion indexable web pages. Given average size is 300KB, the internet size is around 5000 petabytes. Estimated Google’s index size 200TB-2PB.
-- As of Feb 2010, U.S. congress library archived about 160 terabytes of data. -- As of March 2009, there are 25.21 billion indexable web pages. Given average size is 300KB, the internet size is around 5000 petabytes. Estimated Google’s index size 200TB-2PB.
-- As of Feb 2010, U.S. congress library archived about 160 terabytes of data. -- As of March 2009, there are 25.21 billion indexable web pages. Given average size is 300KB, the internet size is around 5000 petabytes. Estimated Google’s index size 200TB-2PB.
1GB connectivity within a rack, 100MB across racks? Are all disks 7200 SATA?