SlideShare a Scribd company logo
1 of 48
FASTER, FASTER, FASTER:
THE TRUE STORY OF A
MOBILE ANALYTICS DATA
MART ON HIVE
Mithun Radhakrishnan
Josh Walters
3
• Mithun Radhakrishnan
• Hive Engineer at Yahoo
• Hive Committer
• Has an irrational fear of spider
monkeys
• mithun@apache.org
• @mithunrk
About myself
4
RECAP
55 2015 Hadoop Summit, San Jose, California
6
From: The [REDACTED] ETL team
To: The Yahoo Hive Team
Subject: A small matter of size...
Dear YHive team,
We have partitioned our table using the following 6 partition keys:
{hourly-timestamp, name, property, geo-location, shoe-size, and so on…}.
For a given timestamp, the combined cardinality of the remaining
partition-keys is about 10000/hr.
If queries on partitioned tables are supposed to be faster, how come
queries on our table take forever just to get off the ground?
Yours gigantically,
Project [REDACTED]
7
ABOUT ME
• Josh Walters
• Data Engineer at Yahoo
• I build lots of data pipelines
• Can eat a whole plate of deep fried cookie dough
• http://joshwalters.com
• @joshwalters
8
WHAT IS THE CUSTOMER NEED?
• Faster ETL
• Faster queries
• Faster ramp up
9
CASE STUDY: MOBILE DATA MART
• Mobile app usage data
• Optimize performance
• Interactive analytics
10
LOW HANGING FRUIT
• Tez Tez Tez!
• Vectorized query execution
• Map-side aggregations
• Auto-convert map join
11
DATA PARTITIONING
• Want thousands of partitions
• Deep data partitioning
• Difficult to do at scale
12
DEEP PARTITIONING
• Greatly helps with compression
• 2015 XLDB talk on methods used
• http://www.youtube.com/
watch?v=P-vrzYYdfL8
• http://www-conf.slac.stanford.edu
/xldb2015/lightning_abstracts.asp
13
SOLID STATE DRIVES
• Didn’t really help
• Ended up CPU bound
• Regular drives are fine
14
ORC!
• Used in largest data systems
• 90% boost on sorted columns
• 30x compression versus raw text
• Fits well with our tech stack
15
SKETCH ALL THE THINGS
• Very accurate
• Can store sketches in Hive
• Union, intersection, difference
• 75% boost on relevant queries
16
SKETCH ALL THE THINGS
SELECT COUNT(DISTINCT id)
FROM DB.TABLE
WHERE ...; -- ~100 seconds
SELECT estimate(sketch(id))
FROM DB.TABLE
WHERE ...; -- ~25 seconds
17
SKETCH ALL THE THINGS
Standard Deviation 1 2 3
Confidence Interval 68% 95% 99%
K = 16 25% 51% 77%
K = 512 4% 8% 13%
K = 4096 1% 3% 4%
K = 16384 < 1% 1% 2%
18
MORE SKETCH INFO
• Summarization, Approx. and
Sampling: Tradeoffs for
Improving Query,
Hadoop Summit, 2015
• http://datasketches.github.io
19
ADVANCED QUERIES
• Desire for complex queries
• Retention, funnels, etc
• A lot can be done with UDFs
20
FUNNEL ANALYSIS
• Complex to write, difficult to reuse
• Slow, requires multiple joins
• Using UDFs, now runs in seconds, not hours
• https://github.com/yahoo/hive-funnel-udf
21
REALLY FAST OLAP
• OLAP type queries are the most common
• Aggregate only queries: group, count, sum, …
• Can we optimize for such queries?
22
OLAP WITH DRUID
• Interactive, sub-second latency
• Ingest raw records, then aggregate
• Open source, actively developed
• http://druid.io
23
BI TOOL
• Many options
• Don’t cover all needs
• Need graphs and dashboards
24
CARAVEL
• Hive, Druid, Redshift, MySQL, …
• Simple query construction
• Open source, actively developed
• https://github.com/airbnb/caravel
25
WHAT WE LEARNED
• Product teams need custom data marts
• Complex to build and run
• Just want to focus on business logic
26
DATA MART IN A BOX!
• Generalized ETL pipeline
• Easy to spin-up
• Automatic continuous delivery
• Just give us a query!
27
DATA MART ARCHITECTURE
28
INFRASTRUCTURE WORK
• We didn’t do this alone
• Partners in grid team fixed many pain points
Y!HIVE
30
Dedicated Queue Metrics:
Shared Cluster Metrics:
Hive on Tez - Interactive Queries in Shared Clusters
31
32
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Hive 0.2 Hive 0.3 Hive 0.4 Hive 0.5 Hive 0.6 Hive 0.7 Hive 0.8 Hive 0.9 Hive
0.10
Hive
0.11
Hive
0.12
Hive
0.13
Hive
0.14
Hive 1.0 Hive 1.1 Hive 1.2 Hive 1 Hive 2.0 Hive
Master
Increased Configurability or Increased complexity?
LOC
33
• Out of the box:
• Tez container reuse
• set tez.am.container.reuse.enabled=true;
• Tez speculative execution
• set tez.am.speculation.enabled=true;
• Reduce-side vectorization
• set hive.vectorized.execution.reduce.enabled=true;
• set hive.vectorized.execution.reduce.groupby.enabled=true;
Performance Tuning
34
• Understand your data:
• Use ORC’s index-based filtering:
• set hive.optimize.index.filter=true;
• Bloom filters
• ALTER TABLE my_orc SET TBLPROPERTIES(“orc.bloom.filter.columns”=“foo,bar”);
• Cardinality?
• Sort on filter-column
• Trade-offs: Parallelism vs. filtering
Performance Tuning
35
• Understand your queries:
• Prefer LIKE and INSTR over REGEXP*
• Compile-time date/time functions:
• current_date()
• current_timestamp()
• Queries generated from UI tools
Performance Tuning
36
• Index-based filtering available to Pig / MR users
• HCatLoader, HCatInputFormat
• Split-calculation improvements
• Block-based BI
• Parallel ETL
• Disabled dictionaries for Complex data types
• OOMs
Performance Improvements - ORC
37
• Skew Joins
• Already solved for Pig
• Hive for ETL
• Current Hive solution: Explicit values. (Wishful thinking)
• Poisson sampling
• Faster sorted-merge joins
• Wide-tables
• SpillableRowContainers
Performance Improvements - Joins
38
• Improvements for data-discovery
• HCatClient-based users
• Oozie, GDM
• 10x improvement
• Fetch Operator improvements:
• SELECT * FROM partitioned_table LIMIT 100;
• Lazy-load partitions
Performance Improvements – Various Sundries
39
• Avro Format is popular
• Self describing
• Flexible
• Generic
• Quirky
• Intermediate stages in pipelines
• Development
Performance Improvements: Hive’s AvroSerDe
40
“There is no mature, no stable. The only constant is change…
... [Our] work on feeds often involves new columns, several times a day.”
41
42
• AvroSerDe needs read-schema at job-runtime (i.e. map-side)
• Stored on HDFS
• ETL Jobs need 10-20K maps
• Replication factor
• Data-node outage
• It gets steadily worse
• Block-replication on node-loss
• Task attempt retry
• More nodes lost
• Rinse and repeat
The Problem
43
44
• Reconcile metastore-schema against read-schema?
• toAvroSchema( fromAvroSchema( avroSchema )) != avroSchema
• Store schema in TBLPROPERTIES?
• Cache read-schema during SerDe::initialize()
• Once per map-task
• Prefetch read-schema at query-planning phase
• Once per job
• Separate optimizer
The Solution
4545
• Row-oriented format
• Skew-join
• Stats storage
We’re not done yet
46
• Team effort
• Chris Drome
• Selina Zhang
• Michael Natkovich
• Olga Natkovich
• Sameer Raheja
• Ravi Sankurati
Thanks
Q&A
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

More Related Content

What's hot

Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streamsJoey Echeverria
 
Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid DataWorks Summit
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaDataWorks Summit
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesDataWorks Summit
 
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021StreamNative
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop DataWorks Summit/Hadoop Summit
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Sparkrhatr
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksDataWorks Summit/Hadoop Summit
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 

What's hot (20)

Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
 
Spark Technology Center IBM
Spark Technology Center IBMSpark Technology Center IBM
Spark Technology Center IBM
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
 
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
 
Cooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython NotebookCooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython Notebook
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
Machine Learning in the IoT with Apache NiFi
Machine Learning in the IoT with Apache NiFiMachine Learning in the IoT with Apache NiFi
Machine Learning in the IoT with Apache NiFi
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 

Viewers also liked

Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisDataWorks Summit/Hadoop Summit
 
Big Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondBig Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondDataWorks Summit/Hadoop Summit
 
Solving Performance Problems on Hadoop
Solving Performance Problems on HadoopSolving Performance Problems on Hadoop
Solving Performance Problems on HadoopTyler Mitchell
 
Machine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of DataMachine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of DataDataWorks Summit/Hadoop Summit
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseDataWorks Summit/Hadoop Summit
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJDaniel Madrigal
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewDataWorks Summit/Hadoop Summit
 
Open Source Ingredients for Interactive Data Analysis in Spark
Open Source Ingredients for Interactive Data Analysis in Spark Open Source Ingredients for Interactive Data Analysis in Spark
Open Source Ingredients for Interactive Data Analysis in Spark DataWorks Summit/Hadoop Summit
 
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
 
Workload Automation + Hadoop?
Workload Automation + Hadoop?Workload Automation + Hadoop?
Workload Automation + Hadoop?
 
Big Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondBig Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyond
 
Solving Performance Problems on Hadoop
Solving Performance Problems on HadoopSolving Performance Problems on Hadoop
Solving Performance Problems on Hadoop
 
Machine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of DataMachine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of Data
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture View
 
Accelerating Data Warehouse Modernization
Accelerating Data Warehouse ModernizationAccelerating Data Warehouse Modernization
Accelerating Data Warehouse Modernization
 
Open Source Ingredients for Interactive Data Analysis in Spark
Open Source Ingredients for Interactive Data Analysis in Spark Open Source Ingredients for Interactive Data Analysis in Spark
Open Source Ingredients for Interactive Data Analysis in Spark
 
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success
 
Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Toward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFSToward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFS
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
 
SQL and Search with Spark in your browser
SQL and Search with Spark in your browserSQL and Search with Spark in your browser
SQL and Search with Spark in your browser
 
From Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFiFrom Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFi
 
Big Data Security and Governance
Big Data Security and GovernanceBig Data Security and Governance
Big Data Security and Governance
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
 

Similar to Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017Roy Russo
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersLucidworks
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Nicolas Poggi
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
Presto: Fast SQL on Everything
Presto: Fast SQL on EverythingPresto: Fast SQL on Everything
Presto: Fast SQL on EverythingDavid Phillips
 
BTV PHP - Building Fast Websites
BTV PHP - Building Fast WebsitesBTV PHP - Building Fast Websites
BTV PHP - Building Fast WebsitesJonathan Klein
 
Elasticsearch meetup final_2014_04
Elasticsearch meetup final_2014_04Elasticsearch meetup final_2014_04
Elasticsearch meetup final_2014_04marc_harrison
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleData Con LA
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleBig Data Joe™ Rossi
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Managing growth in Production Hadoop Deployments
Managing growth in Production Hadoop DeploymentsManaging growth in Production Hadoop Deployments
Managing growth in Production Hadoop DeploymentsDataWorks Summit
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018Roy Russo
 
Hadoop Robot from eBay at China Hadoop Summit 2015
Hadoop Robot from eBay at China Hadoop Summit 2015Hadoop Robot from eBay at China Hadoop Summit 2015
Hadoop Robot from eBay at China Hadoop Summit 2015polo li
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterJohn Adams
 
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldSearch in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldAlex Moundalexis
 

Similar to Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive (20)

Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
 
Interactive query using hadoop
Interactive query using hadoopInteractive query using hadoop
Interactive query using hadoop
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Presto: Fast SQL on Everything
Presto: Fast SQL on EverythingPresto: Fast SQL on Everything
Presto: Fast SQL on Everything
 
BTV PHP - Building Fast Websites
BTV PHP - Building Fast WebsitesBTV PHP - Building Fast Websites
BTV PHP - Building Fast Websites
 
Elasticsearch meetup final_2014_04
Elasticsearch meetup final_2014_04Elasticsearch meetup final_2014_04
Elasticsearch meetup final_2014_04
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Managing growth in Production Hadoop Deployments
Managing growth in Production Hadoop DeploymentsManaging growth in Production Hadoop Deployments
Managing growth in Production Hadoop Deployments
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
 
Hadoop Robot from eBay at China Hadoop Summit 2015
Hadoop Robot from eBay at China Hadoop Summit 2015Hadoop Robot from eBay at China Hadoop Summit 2015
Hadoop Robot from eBay at China Hadoop Summit 2015
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldSearch in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 

Recently uploaded (20)

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 

Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

  • 1.
  • 2. FASTER, FASTER, FASTER: THE TRUE STORY OF A MOBILE ANALYTICS DATA MART ON HIVE Mithun Radhakrishnan Josh Walters
  • 3. 3 • Mithun Radhakrishnan • Hive Engineer at Yahoo • Hive Committer • Has an irrational fear of spider monkeys • mithun@apache.org • @mithunrk About myself
  • 5. 55 2015 Hadoop Summit, San Jose, California
  • 6. 6 From: The [REDACTED] ETL team To: The Yahoo Hive Team Subject: A small matter of size... Dear YHive team, We have partitioned our table using the following 6 partition keys: {hourly-timestamp, name, property, geo-location, shoe-size, and so on…}. For a given timestamp, the combined cardinality of the remaining partition-keys is about 10000/hr. If queries on partitioned tables are supposed to be faster, how come queries on our table take forever just to get off the ground? Yours gigantically, Project [REDACTED]
  • 7. 7 ABOUT ME • Josh Walters • Data Engineer at Yahoo • I build lots of data pipelines • Can eat a whole plate of deep fried cookie dough • http://joshwalters.com • @joshwalters
  • 8. 8 WHAT IS THE CUSTOMER NEED? • Faster ETL • Faster queries • Faster ramp up
  • 9. 9 CASE STUDY: MOBILE DATA MART • Mobile app usage data • Optimize performance • Interactive analytics
  • 10. 10 LOW HANGING FRUIT • Tez Tez Tez! • Vectorized query execution • Map-side aggregations • Auto-convert map join
  • 11. 11 DATA PARTITIONING • Want thousands of partitions • Deep data partitioning • Difficult to do at scale
  • 12. 12 DEEP PARTITIONING • Greatly helps with compression • 2015 XLDB talk on methods used • http://www.youtube.com/ watch?v=P-vrzYYdfL8 • http://www-conf.slac.stanford.edu /xldb2015/lightning_abstracts.asp
  • 13. 13 SOLID STATE DRIVES • Didn’t really help • Ended up CPU bound • Regular drives are fine
  • 14. 14 ORC! • Used in largest data systems • 90% boost on sorted columns • 30x compression versus raw text • Fits well with our tech stack
  • 15. 15 SKETCH ALL THE THINGS • Very accurate • Can store sketches in Hive • Union, intersection, difference • 75% boost on relevant queries
  • 16. 16 SKETCH ALL THE THINGS SELECT COUNT(DISTINCT id) FROM DB.TABLE WHERE ...; -- ~100 seconds SELECT estimate(sketch(id)) FROM DB.TABLE WHERE ...; -- ~25 seconds
  • 17. 17 SKETCH ALL THE THINGS Standard Deviation 1 2 3 Confidence Interval 68% 95% 99% K = 16 25% 51% 77% K = 512 4% 8% 13% K = 4096 1% 3% 4% K = 16384 < 1% 1% 2%
  • 18. 18 MORE SKETCH INFO • Summarization, Approx. and Sampling: Tradeoffs for Improving Query, Hadoop Summit, 2015 • http://datasketches.github.io
  • 19. 19 ADVANCED QUERIES • Desire for complex queries • Retention, funnels, etc • A lot can be done with UDFs
  • 20. 20 FUNNEL ANALYSIS • Complex to write, difficult to reuse • Slow, requires multiple joins • Using UDFs, now runs in seconds, not hours • https://github.com/yahoo/hive-funnel-udf
  • 21. 21 REALLY FAST OLAP • OLAP type queries are the most common • Aggregate only queries: group, count, sum, … • Can we optimize for such queries?
  • 22. 22 OLAP WITH DRUID • Interactive, sub-second latency • Ingest raw records, then aggregate • Open source, actively developed • http://druid.io
  • 23. 23 BI TOOL • Many options • Don’t cover all needs • Need graphs and dashboards
  • 24. 24 CARAVEL • Hive, Druid, Redshift, MySQL, … • Simple query construction • Open source, actively developed • https://github.com/airbnb/caravel
  • 25. 25 WHAT WE LEARNED • Product teams need custom data marts • Complex to build and run • Just want to focus on business logic
  • 26. 26 DATA MART IN A BOX! • Generalized ETL pipeline • Easy to spin-up • Automatic continuous delivery • Just give us a query!
  • 28. 28 INFRASTRUCTURE WORK • We didn’t do this alone • Partners in grid team fixed many pain points
  • 30. 30 Dedicated Queue Metrics: Shared Cluster Metrics: Hive on Tez - Interactive Queries in Shared Clusters
  • 31. 31
  • 32. 32 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Hive 0.2 Hive 0.3 Hive 0.4 Hive 0.5 Hive 0.6 Hive 0.7 Hive 0.8 Hive 0.9 Hive 0.10 Hive 0.11 Hive 0.12 Hive 0.13 Hive 0.14 Hive 1.0 Hive 1.1 Hive 1.2 Hive 1 Hive 2.0 Hive Master Increased Configurability or Increased complexity? LOC
  • 33. 33 • Out of the box: • Tez container reuse • set tez.am.container.reuse.enabled=true; • Tez speculative execution • set tez.am.speculation.enabled=true; • Reduce-side vectorization • set hive.vectorized.execution.reduce.enabled=true; • set hive.vectorized.execution.reduce.groupby.enabled=true; Performance Tuning
  • 34. 34 • Understand your data: • Use ORC’s index-based filtering: • set hive.optimize.index.filter=true; • Bloom filters • ALTER TABLE my_orc SET TBLPROPERTIES(“orc.bloom.filter.columns”=“foo,bar”); • Cardinality? • Sort on filter-column • Trade-offs: Parallelism vs. filtering Performance Tuning
  • 35. 35 • Understand your queries: • Prefer LIKE and INSTR over REGEXP* • Compile-time date/time functions: • current_date() • current_timestamp() • Queries generated from UI tools Performance Tuning
  • 36. 36 • Index-based filtering available to Pig / MR users • HCatLoader, HCatInputFormat • Split-calculation improvements • Block-based BI • Parallel ETL • Disabled dictionaries for Complex data types • OOMs Performance Improvements - ORC
  • 37. 37 • Skew Joins • Already solved for Pig • Hive for ETL • Current Hive solution: Explicit values. (Wishful thinking) • Poisson sampling • Faster sorted-merge joins • Wide-tables • SpillableRowContainers Performance Improvements - Joins
  • 38. 38 • Improvements for data-discovery • HCatClient-based users • Oozie, GDM • 10x improvement • Fetch Operator improvements: • SELECT * FROM partitioned_table LIMIT 100; • Lazy-load partitions Performance Improvements – Various Sundries
  • 39. 39 • Avro Format is popular • Self describing • Flexible • Generic • Quirky • Intermediate stages in pipelines • Development Performance Improvements: Hive’s AvroSerDe
  • 40. 40 “There is no mature, no stable. The only constant is change… ... [Our] work on feeds often involves new columns, several times a day.”
  • 41. 41
  • 42. 42 • AvroSerDe needs read-schema at job-runtime (i.e. map-side) • Stored on HDFS • ETL Jobs need 10-20K maps • Replication factor • Data-node outage • It gets steadily worse • Block-replication on node-loss • Task attempt retry • More nodes lost • Rinse and repeat The Problem
  • 43. 43
  • 44. 44 • Reconcile metastore-schema against read-schema? • toAvroSchema( fromAvroSchema( avroSchema )) != avroSchema • Store schema in TBLPROPERTIES? • Cache read-schema during SerDe::initialize() • Once per map-task • Prefetch read-schema at query-planning phase • Once per job • Separate optimizer The Solution
  • 45. 4545 • Row-oriented format • Skew-join • Stats storage We’re not done yet
  • 46. 46 • Team effort • Chris Drome • Selina Zhang • Michael Natkovich • Olga Natkovich • Sameer Raheja • Ravi Sankurati Thanks
  • 47. Q&A

Editor's Notes

  1. This table is our largest. We use this to test and break our system.
  2. Customers always want data faster Everyone wants data ETL’ed faster Analysts and product owners want faster queries Users need to be able to ramp up quickly and be able to use the data
  3. Mobile app data: swipes, clicks, usage time, etc Query at the speed of thought Analysts needs results now, not in 3 hours
  4. Tez provided huge benefits to our jobs, massive performance improvements, not used for all jobs at Yahoo. Vectorized execution is easy to enable. Will perform transformations on batches of records, greatly increasing performance. Map-side aggregations can help to limit the work done in the reduce stage by performing some of the transformations in the map stage. With auto-convert join, you don’t have to provide hints in the query (which few users do). Can help speedup lots of join queries. Makes things faster, but still not good enough.
  5. More partitions, more control over reads, smaller data read, faster queries! Really deep partitions, multiple nested levels of partitions. Too many partitions have other problems, too many part files cluttering up your namespace. Reducers have to have handles open to many part files, causing a slowdown. Can cause a lot of problems. HCatalog can’t handle that many partitions. The time it takes HCatalog to lookup the metadata can greatly reduce the gains from partitioning the data. We would like thousands of partitions, have to settle for hundreds.
  6. Deep partitions group similar data together, helping with compression We gave a talk on this at Stanford’s XLDB conference, if you want more info watch the video!
  7. Next we wanted to see if we swapped our cluster to use SSDs, would that help with Hive performance? Not much improvement, our jobs were mostly CPU bound. Does it make sense for intermediate task attempt output to be stored on solid state drives?
  8. Our audience data pipeline processes about 200 billion events a day. This comes out to roughly 400 Terabytes of uncompressed data a day. This compresses down to 15 Terabytes of data a day with ORC. We have to store this data for 18 months, so you can see where compression can be really important to us. Our users may also want to run queries over that whole time period, so our file format must be efficient enough to handle that
  9. Sketches, or streaming algorithms, provide some useful features for very large datasets Queries like distinct count are very common for analysts, and can be quite slow Sketches can perform these queries in a single pass, with minimal memory usage These sketches can be used to do distinct counts, but they can also be used in unions, intersections, and differences We observed a 75% speed boost on relevant queries
  10. Information about Sketches has been presented before The code is open source, and there are UDFs for Hive and Pig
  11. Users occasionally want to run very complex queries that would be too difficult to write in Pig or Hive One of the most common for our users was funnel analysis. In these instances, UDFs can provide a lot of help to our data users
  12. Funnel analysis is used to measure how users are flowing through a series of actions For example: How many people go to the signup page? How many of those people complete the information? How many of them then submit the information? Each stage should have the same or fewer users Usually you would have to do multiple selects and joins to get this to work The query can become very large and unwieldy We came up with a simple UDF to perform this whole process in a single map reduce job, greatly simplifying and speeding up the process This UDF is open source, feel free to contribute!
  13. Analyst queries can commonly be answered by an OLAP system Can these queries run with sub-second latencies? Aggregate only, no single record results
  14. Really fast, useable, interactive queries Don’t have to do anything special to the data, Druid ingests the records raw and then aggregates Open source system, lots of contributors, very actively developed
  15. We began a search for a user interface to sit on top of these data marts There are many options, but they don’t cover all our needs: support for many database systems, open source, actively developed, and so on Dashboards was one of the most important features we were looking for
  16. Caravel, out of AirBnb, was what we decided to go with Has support for Druid and any system that has a SQLAlchemy connector (which is just about everything) The project is very active, and we are contributing to it
  17. This mobile data mart wasn’t the only of its kind at Yahoo We had many different teams trying to build similar systems We decided it would be a good idea to build a data mart framework for other teams to use Data marts are a slice of a data warehouse, a small projection and transformation for a specific business unit These data marts cover the use case of the business unit Analysts, marketing, and sales teams may not know Oozie, how to setup continuous delivery for data pipelines, or other data pipeline best practices Could they just provide some ETL logic, and magically build a data mart pipeline?
  18. A data pipeline framework Fast to spin up (less than an hour) Only need a Hive ETL query Comes with continuous delivery, windowed aggregates, low latency OLAP processing, and a business intelligence UI Low latency OLAP? How?
  19. Simple architecture, by keeping it general it is able to cover many different use cases Features such as windowed aggregates and Druid can be easily removed if not needed Can be made even more real-time by using a lower time granularity in the initial ETL step We have successfully used 10 minute granularity, resulting in an almost real-time data system
  20. Example project runs in 4000 nodes busy cluster. A dedicated queue is configured to control the query concurrency. For the example use case, 50% queries runs < 0.5 min, 75% queries runs < 5 mins. Y! does not have the luxury of dedicated, underutilized clusters purely for interactive use. HDFS bandwidth, disk bandwidth, network bandwidth are all shared, even if the Yarn queue is different.
  21. (1min – T6 min) For batch and interactive queries, also used as ETL tool Support Looker/Tableau/Microstrategy for dashboard and ad-hoc query Example project runs in 4000 nodes busy cluster. A dedicated queue is configured to control the query concurrency. For the example use case, 50% queries runs < 0.5 min, 75% queries runs < 5 mins.
  22. HiveConf.java is huge now. Configuration can be tricky.
  23. Here are some settings that you should be enabling out of the box. Container reuse: Useful not just to amortize the cost of container spin-up, but also to place task output closer to the next stage. Speculative execution: Same as in MR. Slow task-attempts can be worked around. Reduce-side vectorization: “Only” 10-30% improvements.
  24. Explain index-based filtering. aka PPD. ORC Files are split into Stripes, with several row-groups per stripe. Each stripe has rows stored in columnar fashion, and column-statistics, including max/min values per column. Index-based filtering skips a row-group based on your query predicate if the value doesn’t fall within the min/max limits for the row-group. Simple, right? 1.2 now has Bloom-filters. You can choose your columns. Greater likelihood of false positives if cardinality is large? Confirm! Sorting on a column has tradeoffs. Similar column-values being contiguous helps compression/encoding, and skipping more rows together. But you could land up with a few tasks with all the data to be processed.
  25. REGEXP is generic, and will perform worse than LIKE and INSTR. Prefer the latter, if you don’t absolutely require REGEXP. 1.2 has compile-time date/time functions. At query-build! As opposed to once per row. Using BI/UI tools? Look closely at the generated queries. Might be using REGEXP, unix_timestamp(), etc. Tableau used to use “SELECT * FROM your_table LIMIT 0; “ to discover metadata.
  26. Column-projection pushdown was available in Pig through Hcat for some time. Now, PPD as well. We’ve improved split-calculation: Block-based BI: 1 split per block! (Checked in independently in Apache.) ETL: Not usable at large scale, in current form. Better memory-usage when writing complex types in ORC, by disabling dictionaries (just for complex types).
  27. Skew-joins are available in Pig. The need for it is apparently specific to Y!. Current approach in Hive is a little clunky. We have a fix coming. Better memory usage with SpillableRowContainers, especially for wide-tables.
  28. Self describing (Inline write-schema) Flexible INT-> STRING -> struct{ INT, STRING, … } Generic Custom read-schema to span the data-evolution interval Unions Quirky Self referential
  29. The loyalty to a data-format can approach fundamentalist proportions, as illustrated by this Y!Hive user, who was asked to consider ORC format for when his column-schema matures.
  30. Whatchusay?
  31. At scale, reading from a single schema-file on HDFS can be detrimental.
  32. This has gotten entirely too silly.
  33. Eugene O’Neill: “There is no present or future… Only the past happening over and over again, now. “ Schema stored on disk. Statistics/histograms stored alongside data.
  34. Bucky Lasek at the X-Games in 2001. Notice where he’s looking… Not at the camera, but setting up his next trick.