WaterlooHiveTalk

Petabyte Scale Data Warehousing at Facebook Ning Zhang Data Infrastructure Facebook

Overview Motivations Data-driven model Challenges Data Infrastructure Hadoop & Hive In-house tools Hive Details Architecture Data model Query language Extensibility Research Problems

Facebook is just a Set of Web Services …

… at Large Scale The social graph is large 400 million monthly active users 250 million daily active users 160 million active objects (groups/events/pages) 130 friend connections per user on average 60 object (groups/events/pages) connections per user on average Activities on the social graph People spent 500 billion minutes per month on FB Average user creates 70 pieces of content each month 25 billion pieces of content are shared each month Millions of search queries per day Facebook is still growing fast New users, features, services …

Facebook is still growing and changing

Under the Hook Data flow from users’ perspective Clients (browser/phone/3rd party apps)  Web Services  Users Another big topic on the Web Services To complete the feedback system … The developers want to know how a new app/feature received by the users (A/B test) The advertisers want to know how their ads perform (dashboard/reports) Based on historical data, how to construct a model and predicate the future (machine learning) Need data analytics! Data warehouse: ETL, data processing, BI … Closing the loop: decision-making based on analyzing the data (users’ feedback)

Data-driven Business/R&D/Science … DSS is not new but Web gives it new elements. “In 2009, more data will be generated by individuals than the entire history of mankind through 2008.” -- by Andreas Weigend, Harvard Business Review “The center of the universe has shifted from e-business to me-business.” -- same as above “Invariably, simple models and a lot of data trump more elaborate models based on less data.” -- by Alon Halevy, Peter Norvig and Fernando Pereira, The Unreasonable Effectiveness of Data

Problems and Challenges Data-driven development/business Huge amount of log data/user data generated every day Need to analyze these data to feedback development/business decisions Machine learning, report/dashboard generation, A/B testing And many more problems Scalability (more than petabytes) Availability (HA) Manageability (e.g., scheduling) Performance (CPU, memory, disk/network I/O) And many more…

Facebook Engineering Teams (backend) Facebook Infrastructure Building foundations that serves end users/applications OLTP workload Components include MySQL, memcached, HipHop (PHP), thrift, Cassandra, Haystack, flashcache, … Facebook Data Infrastructure (data warehouse) Building systems that serves data analysts, research scientists, engineers, product managers, executives, etc. OLAP workload Components include Hadoop, Hive, HDFS, scribe, HBase, tools (ETL, UI, workflow management etc.) Other Engineering teams Platform, search, site integrity, monetization, apps, growth, etc.

DI Key Challenges (I) – scalability Data, data and more data 200 GB/day in March 2008 12 TB/day at the end of 2009 About 8x increase per year Total size is 5 PB now (x3 when considering replication) Same order as the Web (~25 billion indexable pages)

DI Key Challenges (II) – Performance Queries, queries and more queries More than 200 unique users query on the data warehouse every day 7K queries/day at the end of 2009 25K queries/day now Workload is a mixture of ad-hoc queries and ETL/reporting queries. Fast, faster and real-time Users expect faster response time on fresher data (e.g., fighting with spam/fraud in near real-time) Sampling subset of data are not always good enough

Other Requirements Accessibility Everyone should be be able to log & access data easily, not only engineers (a lot of our users do not have CS degrees!) Schema discovery (more than 20K tables) Data exploration and visualization (learning the data by looking) Leverage existing prevalent and familiar tools (e.g., BI tools) Flexibility Schema changes frequently (adding new columns, changing column types, different partitions of tables, etc.) Data formats could be different (plain text, row store, column store, complex data types) Extensibility Easy to plug in user defined functions, aggregations etc. Data storage could be files, web services, “NoSQL stores”

Why not Existing Data Warehousing Systems? Cost of analysis and storage on proprietary systems does not support trends towards more data. Cost based on data size (15 PB costs a lot!) Expensive hardware and supports Limited Scalability does not support trends towards more data Product designed decades ago (not suitable for petabyte DW) ETL is a big bottleneck Long product development & release cycle Users requirements changes frequently (agile programming practice) Closed and proprietary systems

Lets try Hadoop (MapReduce + HDFS) … Pros Superior in availability/scalability/manageability (99.9%) Large and healthy open source community (popular in both industry and academic organizations)

But not quite … Cons: Programmability and Metadata Efficiency not that great, but throw more hardware MapReduce hard to program (users know SQL/bash/python) hard to debug, so it takes longer to get the results No schema Solution: Hive!

What is Hive ? A system for managing and querying structured data built on top of Hadoop Map-Reduce for execution HDFS for storage RDBMS for metadata Key Building Principles: SQL is a familiarlanguage on data warehouses Extensibility – Types, Functions, Formats, Scripts (connecting to HBase, Pig, Hybertable, Cassandra etc.) Scalability and Performance Interoperability (JDBC/ODBC/thrift)

Hive: Familiar Schema Concepts

Column Data Types ,[object Object]

integer types, float, string, date, boolean

structures with attributes which can be of any-type,[object Object]

Optimizations Column Pruning Also pushed down to scan in columnar storage (RCFILE) Predicate Pushdown Not pushed below Non-deterministic functions (eg. rand()) Partition Pruning Sample Pruning Handle small files Merge while writing CombinedHiveInputFormat while reading Small Jobs SELECT * with partition predicates in the client Restartability (Work In Progress)

Hive: Simplifying Hadoop Programming $ cat > /tmp/reducer.sh uniq-c | awk '{print $2""$1}‘ $ cat > /tmp/map.sh awk -F '01' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mappermap.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1 $ bin/hadoopdfs –cat /tmp/largekey/part* vs. hive> select key, count(1) from kv1 where key > 100 group by key;

MapReduceScripts Examples add file page_url_to_id.py; add file my_python_session_cutter.py; FROM (SELECT TRANSFORM(uhash, page_url, unix_time) USING 'page_url_to_id.py' AS (uhash, page_id, unix_time) FROM mylog DISTRIBUTE BY uhash SORT BY uhash, unix_time) mylog2 SELECT TRANSFORM(uhash, page_id, unix_time) USING 'my_python_session_cutter.py' AS (uhash, session_info);

Hive: Making Optimizations Transparent Joins: Joins try to reduce the number of map/reduce jobs needed. Memory efficient joins by streaming largest tables. Map Joins User specified small tables stored in hash tables on the mapper No reducer needed Aggregations: Map side partial aggregations Hash-based aggregates Serialized key/values in hash tables 90% speed improvement on Query SELECT count(1) FROM t; Load balancing for data skew

Hive: Making Optimizations Transparent Storage: Column oriented data formats Column and Partition pruning to reduce scanned data Lazy de-serialization of data Plan Execution Parallel Execution of Parts of the Plan

Hive: Open & Extensible Different on-disk storage(file) formats Text File, Sequence File, … Different serialization formats and data types LazySimpleSerDe, ThriftSerDe … User-provided map/reduce scripts In any language, use stdin/stdout to transfer data … User-defined Functions Substr, Trim, From_unixtime … User-defined Aggregation Functions Sum, Average … User-define Table Functions Explode …

Hive: Interoperability with Other Tools JDBC Enables integration with JDBC based SQL clients ODBC Enables integration with Microstrategy Thrift Enables writing cross language clients Main form of integration with php based Web UI

Usage Types of Applications: Reporting Eg: Daily/Weekly aggregations of impression/click counts Measures of user engagement Microstrategy reports Ad hoc Analysis Eg: how many group admins broken down by state/country Machine Learning (Assembling training data) Ad Optimization Eg: User Engagement as a function of user attributes Many others

Hadoop & Hive Cluster @ Facebook Hadoop/Hive cluster 13600 cores Raw Storage capacity ~ 17PB 8 cores + 12 TB per node 32 GB RAM per node Two level network topology 1 Gbit/sec from node to rack switch 4 Gbit/sec to top level rack switch 2 clusters One for adhoc users One for strict SLA jobs

Hive & Hadoop Usage @ Facebook Statistics per day: 800TB of I/O per day 10K – 25K Hive jobs per day Hive simplifies Hadoop: New engineers go though a Hive training session Analysts (non-engineers) use Hadoop through Hive Most of jobs are Hive Jobs

Data Flow Architecture at Facebook Scirbe-HDFS Web Servers Scribe-Hadoop Cluster Hive replication Adhoc Hive-Hadoop Cluster Production Hive-Hadoop Cluster Oracle RAC Federated MySQL

Scribe-HDFS: 101 HDFS Data Node Scribed Append to /staging/<category>/<file> Scribed <category, msgs> HDFS Data Node Scribed Scribed Scribed HDFS Data Node Scribe-HDFS

Scribe-HDFS: Near real time Hadoop Clusters collocated with the web servers Network is the biggest bottleneck Typical cluster has about 50 nodes. Stats: 50TB/day of raw data logged 99% of the time data is available within 20 seconds

Warehousing at Facebook Instrumentation (PHP/Python etc.) Automatic ETL Continuous copy data to Hive tables Metadata Discovery (CoHive) Query (Hive) Workflow specification and execution (Chronos) Reporting tools Monitoring and alerting

Future Work Scaling in a Dynamic and Fast Growing Environment Erasure codes for Hadoop Namenode scalability past 150 million objects Isolating Adhoc queries from jobs with strict deadlines Hive Replication Resource Sharing Pools for slots More scalable loading of data Incremental load of site data Continuous load of log data

Future Work Discovering Data from > 20K tables Collaborative Hive Finding Unused/rarely used Data

WaterlooHiveTalk

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to WaterlooHiveTalk

Similar to WaterlooHiveTalk (20)

Recently uploaded

Recently uploaded (20)

WaterlooHiveTalk

Editor's Notes