SlideShare a Scribd company logo
1 of 44
Petabyte Scale Data Warehousing at Facebook Ning Zhang Data Infrastructure Facebook
Overview Motivations Data-driven model Challenges Data Infrastructure Hadoop & Hive In-house tools Hive Details Architecture Data model Query language Extensibility Research Problems
Motivations
Facebook is just a Set of Web Services …
 … at Large Scale The social graph is large 400 million monthly active users 250 million daily active users 160 million active objects (groups/events/pages) 130 friend connections per user on average 60 object (groups/events/pages) connections per user on average Activities on the social graph People spent 500 billion minutes per month on FB Average user creates 70 pieces of content each month 25 billion pieces of content are shared each month Millions of search queries per day Facebook is still growing fast New users, features, services …
Facebook is still growing and changing
Under the Hook Data flow from users’ perspective Clients (browser/phone/3rd party apps)  Web Services  Users Another big topic on the Web Services To complete the feedback system … The developers want to know how a new app/feature received by the users (A/B test) The advertisers want to know how their ads perform (dashboard/reports) Based on historical data, how to construct a model and predicate the future (machine learning) Need data analytics!  Data warehouse: ETL, data processing, BI … Closing the loop: decision-making based on analyzing the data (users’ feedback)
Data-driven Business/R&D/Science … DSS is not new but Web gives it new elements. “In 2009, more data will be generated by individuals than the entire history of mankind through 2008.” -- by Andreas Weigend, Harvard Business Review “The center of the universe has shifted from e-business to me-business.” -- same as above “Invariably, simple models and a lot of data trump more elaborate models based on less data.”  -- by Alon Halevy, Peter Norvig and Fernando Pereira, The Unreasonable Effectiveness of Data
Problems and Challenges Data-driven development/business  Huge amount of log data/user data generated every day Need to analyze these data to feedback development/business decisions Machine learning, report/dashboard generation, A/B testing And many more problems Scalability (more than petabytes) Availability (HA) Manageability (e.g., scheduling) Performance (CPU, memory, disk/network I/O) And many more…
Facebook Engineering Teams (backend) Facebook Infrastructure Building foundations that serves end users/applications OLTP workload Components include MySQL, memcached, HipHop (PHP), thrift, Cassandra, Haystack, flashcache, … Facebook Data Infrastructure (data warehouse) Building systems that serves data analysts, research scientists, engineers, product managers, executives, etc. OLAP workload Components include Hadoop, Hive, HDFS, scribe, HBase, tools (ETL, UI, workflow management etc.) Other Engineering teams  Platform, search, site integrity, monetization, apps, growth, etc.
DI Key Challenges (I) – scalability Data, data and more data 200 GB/day in March 2008 12 TB/day at the end of 2009 About 8x increase per year  Total size is 5 PB now (x3 when considering replication) Same order as the Web (~25 billion indexable pages)
DI Key Challenges (II) – Performance Queries, queries and more queries More than 200 unique users query on the data warehouse every day 7K queries/day at the end of 2009 25K queries/day now Workload is a mixture of ad-hoc queries and ETL/reporting queries. Fast, faster and real-time Users expect faster response time on fresher data (e.g., fighting with spam/fraud in near real-time) Sampling subset of data are not always good enough
Other Requirements Accessibility Everyone should be be able to log & access data easily, not only engineers (a lot of our users do not have CS degrees!) Schema discovery (more than 20K tables) Data exploration and visualization (learning the data by looking) Leverage existing prevalent and familiar tools (e.g., BI tools)  Flexibility Schema changes frequently (adding new columns, changing column types, different partitions of tables, etc.) Data formats could be different (plain text, row store, column store, complex data types) Extensibility Easy to plug in user defined functions, aggregations etc.  Data storage could be files, web services, “NoSQL stores”
Why not Existing Data Warehousing Systems? Cost of analysis and storage on proprietary systems does not support trends towards more data. Cost based on data size (15 PB costs a lot!) Expensive hardware and supports  Limited Scalability does not support trends towards more data Product designed decades ago (not suitable for petabyte DW) ETL is a big bottleneck Long product development & release cycle Users requirements changes frequently (agile programming practice) Closed and proprietary systems
Lets try Hadoop (MapReduce + HDFS) … Pros Superior in availability/scalability/manageability (99.9%) Large and healthy open source community (popular in both industry and academic organizations)
But not quite … Cons: Programmability and Metadata Efficiency not that great, but throw more hardware MapReduce hard to program (users know SQL/bash/python) hard to debug, so it takes longer to get the results No schema Solution: Hive!
What is Hive ? A system for managing and querying structured data built on top of Hadoop Map-Reduce for execution HDFS for storage RDBMS for metadata Key Building Principles: SQL is a familiarlanguage on data warehouses Extensibility – Types, Functions, Formats, Scripts (connecting to HBase, Pig, Hybertable, Cassandra etc.) Scalability and Performance Interoperability (JDBC/ODBC/thrift)
Hive: Familiar Schema Concepts
Column Data Types ,[object Object]
integer types, float, string, date, boolean
Nest-able Collections
array<any-type>
map<primitive-type, any-type>
User-defined types
structures with attributes which can be of any-type,[object Object]
Optimizations Column Pruning Also pushed down to scan in columnar storage (RCFILE) Predicate Pushdown Not pushed below Non-deterministic functions (eg. rand()) Partition Pruning Sample Pruning Handle small files Merge while writing CombinedHiveInputFormat while reading Small Jobs SELECT * with partition predicates in the client  Restartability (Work In Progress)
Hive: Simplifying Hadoop Programming $ cat > /tmp/reducer.sh uniq-c | awk '{print $2""$1}‘ $ cat > /tmp/map.sh awk -F '01' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mappermap.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1  $ bin/hadoopdfs –cat /tmp/largekey/part* vs. hive> select key, count(1) from kv1 where key > 100 group by key;
MapReduceScripts Examples add file page_url_to_id.py; add file my_python_session_cutter.py; FROM     (SELECT TRANSFORM(uhash, page_url, unix_time) USING 'page_url_to_id.py'       AS (uhash, page_id, unix_time)      FROM mylog      DISTRIBUTE BY uhash      SORT BY uhash, unix_time) mylog2 SELECT TRANSFORM(uhash, page_id, unix_time) USING 'my_python_session_cutter.py' AS (uhash, session_info);
Hive Architecture
Hive: Making Optimizations Transparent  Joins: Joins try to reduce the number of map/reduce jobs needed. Memory efficient joins by streaming largest tables. Map Joins User specified small tables stored in hash tables on the mapper No reducer needed Aggregations: Map side partial aggregations Hash-based aggregates Serialized key/values in hash tables 90% speed improvement on Query SELECT count(1) FROM t; Load balancing for data skew
Hive: Making Optimizations Transparent Storage: Column oriented data formats Column and Partition pruning to reduce scanned data Lazy de-serialization of data Plan Execution Parallel Execution of Parts of the Plan
Hive: Open & Extensible Different on-disk storage(file) formats Text File, Sequence File, … Different serialization formats and data types LazySimpleSerDe, ThriftSerDe … User-provided map/reduce scripts In any language, use stdin/stdout to transfer data … User-defined Functions Substr, Trim, From_unixtime … User-defined Aggregation Functions Sum, Average … User-define Table Functions Explode …
Hive: Interoperability with Other Tools JDBC Enables integration with JDBC based SQL clients ODBC Enables integration with Microstrategy Thrift Enables writing cross language clients Main form of integration with php based Web UI
Powered by Hive
Usage in Facebook
Usage Types of Applications: Reporting  Eg: Daily/Weekly aggregations of impression/click counts Measures of user engagement  Microstrategy reports Ad hoc Analysis Eg: how many group admins broken down by state/country Machine Learning (Assembling training data)	 Ad Optimization Eg: User Engagement as a function of user attributes Many others
Hadoop & Hive Cluster @ Facebook Hadoop/Hive cluster 13600 cores Raw Storage capacity ~ 17PB 8 cores + 12 TB per node 32 GB RAM per node Two level network topology 1 Gbit/sec from node to rack switch 4 Gbit/sec to top level rack switch 2 clusters One for adhoc users One for strict SLA jobs
Hive & Hadoop Usage @ Facebook Statistics per day: 800TB of I/O per day 10K – 25K Hive jobs per day Hive simplifies Hadoop: New engineers go though a Hive training session Analysts (non-engineers) use Hadoop through Hive Most of jobs are Hive Jobs
Data Flow Architecture at Facebook Scirbe-HDFS Web Servers Scribe-Hadoop Cluster Hive replication Adhoc Hive-Hadoop Cluster Production Hive-Hadoop Cluster Oracle RAC Federated MySQL
Scribe-HDFS: 101 HDFS Data Node Scribed Append to  /staging/<category>/<file> Scribed <category, msgs> HDFS Data Node Scribed Scribed Scribed HDFS Data Node Scribe-HDFS
Scribe-HDFS: Near real time Hadoop Clusters collocated with the web servers Network is the biggest bottleneck Typical cluster has about 50 nodes. Stats: 50TB/day of raw data logged 99% of the time data is available within 20 seconds
Warehousing at Facebook Instrumentation (PHP/Python etc.) Automatic ETL Continuous copy data to Hive tables Metadata Discovery (CoHive) Query (Hive) Workflow specification and execution (Chronos) Reporting tools Monitoring and alerting
Future Work Scaling in a Dynamic and Fast Growing Environment Erasure codes for Hadoop Namenode scalability past 150 million objects Isolating Adhoc queries from jobs with strict deadlines Hive Replication Resource Sharing Pools for slots More scalable loading of data Incremental load of site data Continuous load of log data
Future Work Discovering Data from > 20K tables Collaborative Hive Finding Unused/rarely used Data

More Related Content

What's hot

Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Hive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly CompetitionHive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly CompetitionXplenty
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovVasil Remeniuk
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010Jonathan Seidman
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 

What's hot (20)

Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
 
Hadoop
HadoopHadoop
Hadoop
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Hive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly CompetitionHive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly Competition
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex Gryzlov
 
Hive
HiveHive
Hive
 
Hive
HiveHive
Hive
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 

Similar to WaterlooHiveTalk

Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hiveSubhas Kumar Ghosh
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiativeMansi Mehra
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Yahoo Developer Network
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBhavya Gulati
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiativeMansi Mehra
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisYuanyuan Tian
 
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
Hw09   Rethinking The Data Warehouse With Hadoop And HiveHw09   Rethinking The Data Warehouse With Hadoop And Hive
Hw09 Rethinking The Data Warehouse With Hadoop And HiveCloudera, Inc.
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdfavenkatram
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCPBlibBlobb
 

Similar to WaterlooHiveTalk (20)

Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
Bigdata
BigdataBigdata
Bigdata
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Big Data SE vs. SE for Big Data
Big Data SE vs. SE for Big DataBig Data SE vs. SE for Big Data
Big Data SE vs. SE for Big Data
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
20080529dublinpt3
20080529dublinpt320080529dublinpt3
20080529dublinpt3
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
 
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
Hw09   Rethinking The Data Warehouse With Hadoop And HiveHw09   Rethinking The Data Warehouse With Hadoop And Hive
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
 
Chapter01
Chapter01Chapter01
Chapter01
 
Ena ch01
Ena ch01Ena ch01
Ena ch01
 
Ena ch01
Ena ch01Ena ch01
Ena ch01
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
 

Recently uploaded

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 

WaterlooHiveTalk

  • 1. Petabyte Scale Data Warehousing at Facebook Ning Zhang Data Infrastructure Facebook
  • 2. Overview Motivations Data-driven model Challenges Data Infrastructure Hadoop & Hive In-house tools Hive Details Architecture Data model Query language Extensibility Research Problems
  • 4. Facebook is just a Set of Web Services …
  • 5. … at Large Scale The social graph is large 400 million monthly active users 250 million daily active users 160 million active objects (groups/events/pages) 130 friend connections per user on average 60 object (groups/events/pages) connections per user on average Activities on the social graph People spent 500 billion minutes per month on FB Average user creates 70 pieces of content each month 25 billion pieces of content are shared each month Millions of search queries per day Facebook is still growing fast New users, features, services …
  • 6. Facebook is still growing and changing
  • 7. Under the Hook Data flow from users’ perspective Clients (browser/phone/3rd party apps)  Web Services  Users Another big topic on the Web Services To complete the feedback system … The developers want to know how a new app/feature received by the users (A/B test) The advertisers want to know how their ads perform (dashboard/reports) Based on historical data, how to construct a model and predicate the future (machine learning) Need data analytics! Data warehouse: ETL, data processing, BI … Closing the loop: decision-making based on analyzing the data (users’ feedback)
  • 8. Data-driven Business/R&D/Science … DSS is not new but Web gives it new elements. “In 2009, more data will be generated by individuals than the entire history of mankind through 2008.” -- by Andreas Weigend, Harvard Business Review “The center of the universe has shifted from e-business to me-business.” -- same as above “Invariably, simple models and a lot of data trump more elaborate models based on less data.” -- by Alon Halevy, Peter Norvig and Fernando Pereira, The Unreasonable Effectiveness of Data
  • 9. Problems and Challenges Data-driven development/business Huge amount of log data/user data generated every day Need to analyze these data to feedback development/business decisions Machine learning, report/dashboard generation, A/B testing And many more problems Scalability (more than petabytes) Availability (HA) Manageability (e.g., scheduling) Performance (CPU, memory, disk/network I/O) And many more…
  • 10. Facebook Engineering Teams (backend) Facebook Infrastructure Building foundations that serves end users/applications OLTP workload Components include MySQL, memcached, HipHop (PHP), thrift, Cassandra, Haystack, flashcache, … Facebook Data Infrastructure (data warehouse) Building systems that serves data analysts, research scientists, engineers, product managers, executives, etc. OLAP workload Components include Hadoop, Hive, HDFS, scribe, HBase, tools (ETL, UI, workflow management etc.) Other Engineering teams Platform, search, site integrity, monetization, apps, growth, etc.
  • 11. DI Key Challenges (I) – scalability Data, data and more data 200 GB/day in March 2008 12 TB/day at the end of 2009 About 8x increase per year Total size is 5 PB now (x3 when considering replication) Same order as the Web (~25 billion indexable pages)
  • 12. DI Key Challenges (II) – Performance Queries, queries and more queries More than 200 unique users query on the data warehouse every day 7K queries/day at the end of 2009 25K queries/day now Workload is a mixture of ad-hoc queries and ETL/reporting queries. Fast, faster and real-time Users expect faster response time on fresher data (e.g., fighting with spam/fraud in near real-time) Sampling subset of data are not always good enough
  • 13. Other Requirements Accessibility Everyone should be be able to log & access data easily, not only engineers (a lot of our users do not have CS degrees!) Schema discovery (more than 20K tables) Data exploration and visualization (learning the data by looking) Leverage existing prevalent and familiar tools (e.g., BI tools) Flexibility Schema changes frequently (adding new columns, changing column types, different partitions of tables, etc.) Data formats could be different (plain text, row store, column store, complex data types) Extensibility Easy to plug in user defined functions, aggregations etc. Data storage could be files, web services, “NoSQL stores”
  • 14. Why not Existing Data Warehousing Systems? Cost of analysis and storage on proprietary systems does not support trends towards more data. Cost based on data size (15 PB costs a lot!) Expensive hardware and supports Limited Scalability does not support trends towards more data Product designed decades ago (not suitable for petabyte DW) ETL is a big bottleneck Long product development & release cycle Users requirements changes frequently (agile programming practice) Closed and proprietary systems
  • 15. Lets try Hadoop (MapReduce + HDFS) … Pros Superior in availability/scalability/manageability (99.9%) Large and healthy open source community (popular in both industry and academic organizations)
  • 16. But not quite … Cons: Programmability and Metadata Efficiency not that great, but throw more hardware MapReduce hard to program (users know SQL/bash/python) hard to debug, so it takes longer to get the results No schema Solution: Hive!
  • 17. What is Hive ? A system for managing and querying structured data built on top of Hadoop Map-Reduce for execution HDFS for storage RDBMS for metadata Key Building Principles: SQL is a familiarlanguage on data warehouses Extensibility – Types, Functions, Formats, Scripts (connecting to HBase, Pig, Hybertable, Cassandra etc.) Scalability and Performance Interoperability (JDBC/ODBC/thrift)
  • 19.
  • 20. integer types, float, string, date, boolean
  • 25.
  • 26. Optimizations Column Pruning Also pushed down to scan in columnar storage (RCFILE) Predicate Pushdown Not pushed below Non-deterministic functions (eg. rand()) Partition Pruning Sample Pruning Handle small files Merge while writing CombinedHiveInputFormat while reading Small Jobs SELECT * with partition predicates in the client Restartability (Work In Progress)
  • 27. Hive: Simplifying Hadoop Programming $ cat > /tmp/reducer.sh uniq-c | awk '{print $2""$1}‘ $ cat > /tmp/map.sh awk -F '01' '{if($1 > 100) print $1}‘ $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mappermap.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1 $ bin/hadoopdfs –cat /tmp/largekey/part* vs. hive> select key, count(1) from kv1 where key > 100 group by key;
  • 28. MapReduceScripts Examples add file page_url_to_id.py; add file my_python_session_cutter.py; FROM (SELECT TRANSFORM(uhash, page_url, unix_time) USING 'page_url_to_id.py' AS (uhash, page_id, unix_time) FROM mylog DISTRIBUTE BY uhash SORT BY uhash, unix_time) mylog2 SELECT TRANSFORM(uhash, page_id, unix_time) USING 'my_python_session_cutter.py' AS (uhash, session_info);
  • 30. Hive: Making Optimizations Transparent Joins: Joins try to reduce the number of map/reduce jobs needed. Memory efficient joins by streaming largest tables. Map Joins User specified small tables stored in hash tables on the mapper No reducer needed Aggregations: Map side partial aggregations Hash-based aggregates Serialized key/values in hash tables 90% speed improvement on Query SELECT count(1) FROM t; Load balancing for data skew
  • 31. Hive: Making Optimizations Transparent Storage: Column oriented data formats Column and Partition pruning to reduce scanned data Lazy de-serialization of data Plan Execution Parallel Execution of Parts of the Plan
  • 32. Hive: Open & Extensible Different on-disk storage(file) formats Text File, Sequence File, … Different serialization formats and data types LazySimpleSerDe, ThriftSerDe … User-provided map/reduce scripts In any language, use stdin/stdout to transfer data … User-defined Functions Substr, Trim, From_unixtime … User-defined Aggregation Functions Sum, Average … User-define Table Functions Explode …
  • 33. Hive: Interoperability with Other Tools JDBC Enables integration with JDBC based SQL clients ODBC Enables integration with Microstrategy Thrift Enables writing cross language clients Main form of integration with php based Web UI
  • 36. Usage Types of Applications: Reporting Eg: Daily/Weekly aggregations of impression/click counts Measures of user engagement Microstrategy reports Ad hoc Analysis Eg: how many group admins broken down by state/country Machine Learning (Assembling training data) Ad Optimization Eg: User Engagement as a function of user attributes Many others
  • 37. Hadoop & Hive Cluster @ Facebook Hadoop/Hive cluster 13600 cores Raw Storage capacity ~ 17PB 8 cores + 12 TB per node 32 GB RAM per node Two level network topology 1 Gbit/sec from node to rack switch 4 Gbit/sec to top level rack switch 2 clusters One for adhoc users One for strict SLA jobs
  • 38. Hive & Hadoop Usage @ Facebook Statistics per day: 800TB of I/O per day 10K – 25K Hive jobs per day Hive simplifies Hadoop: New engineers go though a Hive training session Analysts (non-engineers) use Hadoop through Hive Most of jobs are Hive Jobs
  • 39. Data Flow Architecture at Facebook Scirbe-HDFS Web Servers Scribe-Hadoop Cluster Hive replication Adhoc Hive-Hadoop Cluster Production Hive-Hadoop Cluster Oracle RAC Federated MySQL
  • 40. Scribe-HDFS: 101 HDFS Data Node Scribed Append to /staging/<category>/<file> Scribed <category, msgs> HDFS Data Node Scribed Scribed Scribed HDFS Data Node Scribe-HDFS
  • 41. Scribe-HDFS: Near real time Hadoop Clusters collocated with the web servers Network is the biggest bottleneck Typical cluster has about 50 nodes. Stats: 50TB/day of raw data logged 99% of the time data is available within 20 seconds
  • 42. Warehousing at Facebook Instrumentation (PHP/Python etc.) Automatic ETL Continuous copy data to Hive tables Metadata Discovery (CoHive) Query (Hive) Workflow specification and execution (Chronos) Reporting tools Monitoring and alerting
  • 43. Future Work Scaling in a Dynamic and Fast Growing Environment Erasure codes for Hadoop Namenode scalability past 150 million objects Isolating Adhoc queries from jobs with strict deadlines Hive Replication Resource Sharing Pools for slots More scalable loading of data Incremental load of site data Continuous load of log data
  • 44. Future Work Discovering Data from > 20K tables Collaborative Hive Finding Unused/rarely used Data
  • 45. Future Dynamic Inserts into multiple partitions More join optimizations Persistent UDFs, UDAFs and UDTFs Benchmarks for monitoring performance IN, exists and correlated sub-queries Statistics Materialized Views
  • 46. Research Challenges Reducing response time for small/medium jobs 20 thousands queries per day 1 million queries per day Indexes on Hadoop, data mart strategy Near real-time query processing – pipelining MapReduce Distributed systems problems in large scale: Job scheduling problem: mixed throughput and response time workloads Orchestra commits on thousands of machines (scribe conf files) Cross data center replication and consistency Full SQL compliant Required by 3rd party tools (e.g., BI) through ODBC/JDBC.
  • 47. Query Optimizations Efficiently compute histograms, median, distinct values in a distributed shared-nothing architecture Cost models in the MapReduce framework
  • 48. Social Graph Every user sees a different, personalized stream of information (news feed) 130 friend + 60 object updates in real time Edge-rank: ranking of updates that should be shown on the top Social graph is stored in distributed MySQL databases Data replication between data centers: an update to one data center should be replicated to other data centers as well How to partition a dense graph such that data transfer from different partitions is minimized.

Editor's Notes

  1. Motivations: - The problems we face - The role of data infrastructure team in FB - Why we chose the current infrastructure?
  2. List of apps, news feed, ads/notifications Dynamic web site What boils down to is a set of web services, not a big deal
  3. -- As of Feb 2010, U.S. congress library archived about 160 terabytes of data. -- As of March 2009, there are 25.21 billion indexable web pages. Given average size is 300KB, the internet size is around 5000 petabytes. Estimated Google’s index size 200TB-2PB.
  4. -- As of Feb 2010, U.S. congress library archived about 160 terabytes of data. -- As of March 2009, there are 25.21 billion indexable web pages. Given average size is 300KB, the internet size is around 5000 petabytes. Estimated Google’s index size 200TB-2PB.
  5. -- As of Feb 2010, U.S. congress library archived about 160 terabytes of data. -- As of March 2009, there are 25.21 billion indexable web pages. Given average size is 300KB, the internet size is around 5000 petabytes. Estimated Google’s index size 200TB-2PB.
  6. 1GB connectivity within a rack, 100MB across racks? Are all disks 7200 SATA?