SlideShare a Scribd company logo
1 of 20
Download to read offline
Hive on Spark
Szehon Ho // Cloudera Software Engineer, Apache Hive PMC
2© 2014 Cloudera, Inc. All rights reserved.
Background	
  (Hive)	
  
•  Apache Hive: SQL-based data query and management tool for a
distributed dataset
•  Founded in 2007 at Facebook, most of our customers run Hive
jobs in production.
3© 2014 Cloudera, Inc. All rights reserved.
Background	
  (Hive)	
  
•  Inflexibility of MapReduce framework => Inefficient Hive
•  Map(), Reduce() primitives, not designed for long data pipelines
•  Complex SQL-like queries inefficiently expressed as many MR stages.
•  Disk IO between MR’s
•  Shuffle-sort between M+R
Map() Red()
Hive Query
Map() Red() Map() Red()
HDFS
4© 2014 Cloudera, Inc. All rights reserved.
Background	
  (Hive)	
  
•  2013 Hive Community started work on Hive on Tez
•  Tez DAG execution graph
Map() Red()
Hive Query
Map() Red()
Red()
HDFS
5© 2014 Cloudera, Inc. All rights reserved.
Background (Spark)	
  
•  Generalized distributed processing framework created in ~2011 by
UC Berkeley AMPLab
•  Popular framework, heading to succeed MapReduce
6© 2014 Cloudera, Inc. All rights reserved.
Background (Spark)
•  Clean	
  programming	
  abstrac:on:	
  Resilient	
  Distributed	
  Dataset	
  (RDD):	
  
•  A	
  fault-­‐tolerant	
  dataset,	
  can	
  be	
  a	
  stage	
  in	
  a	
  data	
  pipeline.	
  
•  Created	
  from	
  exis:ng	
  data	
  set	
  like	
  HDFS	
  file,	
  or	
  transforma:on	
  from	
  other	
  RDD	
  
(chain-­‐up	
  RDD’s)	
  
•  Expressive	
  API’s,	
  much	
  more	
  than	
  MapReduce	
  
•  Transforma:ons:	
  	
  map,	
  filter,	
  groupBy	
  
•  Ac:ons:	
  cache,	
  save	
  
•  =>	
  More	
  efficient	
  representa:on	
  of	
  Hive	
  queries	
  
7© 2014 Cloudera, Inc. All rights reserved.
Background (Spark)	
  
•  Community Momentum:
•  Spark Summit 2014: Already the most active project in Hadoop ecosystem, top
3 most active Apache projects.
•  Since Spark 1.0 in June, two more biggest releases 1.1, 1.2
Compared to Other Projects
MapReduce
YARN
HDFS
Storm
Spark
0
200
400
600
800
1000
1200
1400
MapReduce
YARN
HDFS
Storm
Spark
0
50000
100000
150000
200000
250000
300000
Commits
 Lines of Code Changed
Activity in past 6 months
Compared to Other Projects
MapReduce
YARN
HDFS
Storm
Spark
0
200
400
600
800
1000
1200
1400
MapReduce
YARN
HDFS
Storm
Spark
0
50000
100000
150000
200000
250000
300000
Commits
 Lines of Code Changed
Activity in past 6 months
8© 2014 Cloudera, Inc. All rights reserved.
Background (Spark)	
  
•  Community Momentum:
•  Advanced analytics, data science, ML, graph processing, etc.
•  Integration from with many Hadoop tools, ie Pig, Flume, Mahout, Crunch, Solr
•  Hive jobs can now leverage these Spark clusters as well
9© 2014 Cloudera, Inc. All rights reserved.
Hive on Spark
•  Shark	
  Project:	
  
•  AMPLab	
  github	
  project,	
  fork	
  of	
  Hive	
  
•  Not	
  maintained	
  by	
  Hive	
  community,	
  sunseUed	
  2014	
  
•  Hive	
  on	
  Spark:	
  
•  Done	
  in	
  Hive	
  community	
  
•  Architecturally	
  compa:ble,	
  by	
  keeping	
  same	
  physical	
  abstrac:on	
  for	
  Hive	
  on	
  
Spark	
  as	
  Hive	
  on	
  Tez/MR.	
  
•  Code	
  maintenance	
  
•  Maximize	
  re-­‐use	
  of	
  common	
  func:onality	
  across	
  execu:on	
  engine	
  
10© 2014 Cloudera, Inc. All rights reserved.
High-Level Design
10
Hive Query
Logical Op Tree
Task
TaskCompiler
Work
MapRedTask
MapWork
TezTask SparkTask
Common across engines:
•  HQL syntax
•  Tool Integrations (auditing plugins,
authorization, Drivers, Thrift clients, UDF,
StorageHandler)
•  Logical optimizations
ReduceWork
MapWork
ReduceWork
MapWork MapWk
RedWk
MapWk
SparkCompilerMapRedCompiler TezCompiler
11© 2014 Cloudera, Inc. All rights reserved.
Simple Example
11
SELECT COUNT(*) from status_updates
where ds = ‘2014-10-01’ group by region;
TableScan
(status_updates)
Filter (ds=‘2014 10-01’)
Select (region)
Group-By (count)
Select
Operator Tree:
Hive Query:
GBY trigger
reduce-boundary:
12© 2014 Cloudera, Inc. All rights reserved.
Simple Example
12
Reducer
GroupBy
Select
FileOutput
Mapper
TableScan
Filter
Select
Group-By
ReduceSink
MapRed Work Tree
•  Map->Reduce
ShuffleSort
13© 2014 Cloudera, Inc. All rights reserved.
Simple Example
13
mapPartition()
GroupBy
Select
FileOutput
mapPartition()
TableScan
Filter
Select
Group-By
ReduceSink
Spark Work Tree:
•  RDD Chain
groupBy()
No sorting
14© 2014 Cloudera, Inc. All rights reserved.
Join Example
TableScan
Filter
Select
Join
Select
Sort
Select
TableScan
Filter
Select
SELECT * FROM
(SELECT key FROM src WHERE src.key <
10) src1
JOIN
(SELECT key FROM src WHERE src.key <
10) src2
ON src1.key = src2.key
ORDER BY src1.key;
Hive Query:
15© 2014 Cloudera, Inc. All rights reserved.
Join Example
Map
ReduceSink
(Sort)
TableScan
Map
TableScan
Filter
Select
Reduce Sink Reduce
Join
Select
FileOutput
Reduce
FileOutput
Select
Map
TableScan
Filter
Select
Reduce Sink
HDFS
ShuffleSort ShuffleSort
Disk IO
MapRed Work Tree
•  2 MapReduce Works
16© 2014 Cloudera, Inc. All rights reserved.
Join Example
mapPartition()
Join
Select
Reduce Sink
mapPartition()
FileOutput
Select
union() Partition/
Sort()
sortBy()
No spill to disk
mapPartition()
TableScan
Filter
Select
Reduce Sink
mapPartition()
TableScan
Filter
Select
Reduce Sink
Spark Work Tree:
RDD Transform Chain
17© 2014 Cloudera, Inc. All rights reserved.
Demo
18© 2014 Cloudera, Inc. All rights reserved.
Improvements to Spark
•  Largest	
  MR	
  Java	
  app	
  ported	
  on	
  to	
  Spark,	
  can	
  serve	
  as	
  reference.	
  
•  Spark	
  Umbrella	
  JIRA	
  for	
  improvements	
  needed	
  by	
  Hive:	
  SPARK-­‐3145 	
  	
  
•  Implement	
  Java	
  version	
  of	
  Scala	
  API’s	
  (various),	
  shade	
  Spark	
  Guava	
  Library:	
  SPARK-­‐2848	
  
•  Monitoring	
  API’s	
  (SPARK-­‐2636,	
  various)	
  
•  Shuffle-­‐Sort	
  Transform:	
  SPARK-­‐2978	
  
•  Spark	
  had	
  group(),	
  sort(),	
  but	
  not	
  par::on+sort	
  like	
  MR-­‐style	
  shuffle-­‐sort.	
  
•  Elas:c	
  scaling	
  of	
  Spark	
  applica:on:	
  SPARK-­‐3174	
  
19© 2014 Cloudera, Inc. All rights reserved.
Community
•  Thanks	
  to	
  contributors	
  from	
  many	
  organiza:ons:	
  
•  Follow	
  our	
  progress	
  on	
  HIVE-­‐7292	
  
•  Thank	
  you!	
  
Thank you.

More Related Content

What's hot

Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...DataWorks Summit/Hadoop Summit
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersDataWorks Summit
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache TezGetInData
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceHortonworks
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperDataWorks Summit
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 

What's hot (20)

Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Powering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big DataPowering a Virtual Power Station with Big Data
Powering a Virtual Power Station with Big Data
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 

Viewers also liked

Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresSteve Loughran
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwordsSzehon Ho
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizationsSzehon Ho
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013alanfgates
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionCloudera, Inc.
 
唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pubChao Zhu
 

Viewers also liked (10)

Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizations
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub
 

Similar to Hive on Spark: An Efficient Way to Run SQL Queries

Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop BigDataEverywhere
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitSaptak Sen
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platformnvvrajesh
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 

Similar to Hive on Spark: An Efficient Way to Run SQL Queries (20)

Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Yarn
YarnYarn
Yarn
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platform
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Data Science
Data ScienceData Science
Data Science
 
Hive paris
Hive parisHive paris
Hive paris
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 

More from trihug

TriHUG October: Apache Ranger
TriHUG October: Apache RangerTriHUG October: Apache Ranger
TriHUG October: Apache Rangertrihug
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Productiontrihug
 
TriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache SentryTriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache Sentrytrihug
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Sharktrihug
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
Practical pig
Practical pigPractical pig
Practical pigtrihug
 
Financial services trihug
Financial services trihugFinancial services trihug
Financial services trihugtrihug
 
TriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris ShainTriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris Shaintrihug
 
TriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan GatesTriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan Gatestrihug
 
TriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan GatesTriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan Gatestrihug
 
MapR, Implications for Integration
MapR, Implications for IntegrationMapR, Implications for Integration
MapR, Implications for Integrationtrihug
 

More from trihug (11)

TriHUG October: Apache Ranger
TriHUG October: Apache RangerTriHUG October: Apache Ranger
TriHUG October: Apache Ranger
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Production
 
TriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache SentryTriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache Sentry
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Practical pig
Practical pigPractical pig
Practical pig
 
Financial services trihug
Financial services trihugFinancial services trihug
Financial services trihug
 
TriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris ShainTriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris Shain
 
TriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan GatesTriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan Gates
 
TriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan GatesTriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan Gates
 
MapR, Implications for Integration
MapR, Implications for IntegrationMapR, Implications for Integration
MapR, Implications for Integration
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Hive on Spark: An Efficient Way to Run SQL Queries

  • 1. Hive on Spark Szehon Ho // Cloudera Software Engineer, Apache Hive PMC
  • 2. 2© 2014 Cloudera, Inc. All rights reserved. Background  (Hive)   •  Apache Hive: SQL-based data query and management tool for a distributed dataset •  Founded in 2007 at Facebook, most of our customers run Hive jobs in production.
  • 3. 3© 2014 Cloudera, Inc. All rights reserved. Background  (Hive)   •  Inflexibility of MapReduce framework => Inefficient Hive •  Map(), Reduce() primitives, not designed for long data pipelines •  Complex SQL-like queries inefficiently expressed as many MR stages. •  Disk IO between MR’s •  Shuffle-sort between M+R Map() Red() Hive Query Map() Red() Map() Red() HDFS
  • 4. 4© 2014 Cloudera, Inc. All rights reserved. Background  (Hive)   •  2013 Hive Community started work on Hive on Tez •  Tez DAG execution graph Map() Red() Hive Query Map() Red() Red() HDFS
  • 5. 5© 2014 Cloudera, Inc. All rights reserved. Background (Spark)   •  Generalized distributed processing framework created in ~2011 by UC Berkeley AMPLab •  Popular framework, heading to succeed MapReduce
  • 6. 6© 2014 Cloudera, Inc. All rights reserved. Background (Spark) •  Clean  programming  abstrac:on:  Resilient  Distributed  Dataset  (RDD):   •  A  fault-­‐tolerant  dataset,  can  be  a  stage  in  a  data  pipeline.   •  Created  from  exis:ng  data  set  like  HDFS  file,  or  transforma:on  from  other  RDD   (chain-­‐up  RDD’s)   •  Expressive  API’s,  much  more  than  MapReduce   •  Transforma:ons:    map,  filter,  groupBy   •  Ac:ons:  cache,  save   •  =>  More  efficient  representa:on  of  Hive  queries  
  • 7. 7© 2014 Cloudera, Inc. All rights reserved. Background (Spark)   •  Community Momentum: •  Spark Summit 2014: Already the most active project in Hadoop ecosystem, top 3 most active Apache projects. •  Since Spark 1.0 in June, two more biggest releases 1.1, 1.2 Compared to Other Projects MapReduce YARN HDFS Storm Spark 0 200 400 600 800 1000 1200 1400 MapReduce YARN HDFS Storm Spark 0 50000 100000 150000 200000 250000 300000 Commits Lines of Code Changed Activity in past 6 months Compared to Other Projects MapReduce YARN HDFS Storm Spark 0 200 400 600 800 1000 1200 1400 MapReduce YARN HDFS Storm Spark 0 50000 100000 150000 200000 250000 300000 Commits Lines of Code Changed Activity in past 6 months
  • 8. 8© 2014 Cloudera, Inc. All rights reserved. Background (Spark)   •  Community Momentum: •  Advanced analytics, data science, ML, graph processing, etc. •  Integration from with many Hadoop tools, ie Pig, Flume, Mahout, Crunch, Solr •  Hive jobs can now leverage these Spark clusters as well
  • 9. 9© 2014 Cloudera, Inc. All rights reserved. Hive on Spark •  Shark  Project:   •  AMPLab  github  project,  fork  of  Hive   •  Not  maintained  by  Hive  community,  sunseUed  2014   •  Hive  on  Spark:   •  Done  in  Hive  community   •  Architecturally  compa:ble,  by  keeping  same  physical  abstrac:on  for  Hive  on   Spark  as  Hive  on  Tez/MR.   •  Code  maintenance   •  Maximize  re-­‐use  of  common  func:onality  across  execu:on  engine  
  • 10. 10© 2014 Cloudera, Inc. All rights reserved. High-Level Design 10 Hive Query Logical Op Tree Task TaskCompiler Work MapRedTask MapWork TezTask SparkTask Common across engines: •  HQL syntax •  Tool Integrations (auditing plugins, authorization, Drivers, Thrift clients, UDF, StorageHandler) •  Logical optimizations ReduceWork MapWork ReduceWork MapWork MapWk RedWk MapWk SparkCompilerMapRedCompiler TezCompiler
  • 11. 11© 2014 Cloudera, Inc. All rights reserved. Simple Example 11 SELECT COUNT(*) from status_updates where ds = ‘2014-10-01’ group by region; TableScan (status_updates) Filter (ds=‘2014 10-01’) Select (region) Group-By (count) Select Operator Tree: Hive Query: GBY trigger reduce-boundary:
  • 12. 12© 2014 Cloudera, Inc. All rights reserved. Simple Example 12 Reducer GroupBy Select FileOutput Mapper TableScan Filter Select Group-By ReduceSink MapRed Work Tree •  Map->Reduce ShuffleSort
  • 13. 13© 2014 Cloudera, Inc. All rights reserved. Simple Example 13 mapPartition() GroupBy Select FileOutput mapPartition() TableScan Filter Select Group-By ReduceSink Spark Work Tree: •  RDD Chain groupBy() No sorting
  • 14. 14© 2014 Cloudera, Inc. All rights reserved. Join Example TableScan Filter Select Join Select Sort Select TableScan Filter Select SELECT * FROM (SELECT key FROM src WHERE src.key < 10) src1 JOIN (SELECT key FROM src WHERE src.key < 10) src2 ON src1.key = src2.key ORDER BY src1.key; Hive Query:
  • 15. 15© 2014 Cloudera, Inc. All rights reserved. Join Example Map ReduceSink (Sort) TableScan Map TableScan Filter Select Reduce Sink Reduce Join Select FileOutput Reduce FileOutput Select Map TableScan Filter Select Reduce Sink HDFS ShuffleSort ShuffleSort Disk IO MapRed Work Tree •  2 MapReduce Works
  • 16. 16© 2014 Cloudera, Inc. All rights reserved. Join Example mapPartition() Join Select Reduce Sink mapPartition() FileOutput Select union() Partition/ Sort() sortBy() No spill to disk mapPartition() TableScan Filter Select Reduce Sink mapPartition() TableScan Filter Select Reduce Sink Spark Work Tree: RDD Transform Chain
  • 17. 17© 2014 Cloudera, Inc. All rights reserved. Demo
  • 18. 18© 2014 Cloudera, Inc. All rights reserved. Improvements to Spark •  Largest  MR  Java  app  ported  on  to  Spark,  can  serve  as  reference.   •  Spark  Umbrella  JIRA  for  improvements  needed  by  Hive:  SPARK-­‐3145     •  Implement  Java  version  of  Scala  API’s  (various),  shade  Spark  Guava  Library:  SPARK-­‐2848   •  Monitoring  API’s  (SPARK-­‐2636,  various)   •  Shuffle-­‐Sort  Transform:  SPARK-­‐2978   •  Spark  had  group(),  sort(),  but  not  par::on+sort  like  MR-­‐style  shuffle-­‐sort.   •  Elas:c  scaling  of  Spark  applica:on:  SPARK-­‐3174  
  • 19. 19© 2014 Cloudera, Inc. All rights reserved. Community •  Thanks  to  contributors  from  many  organiza:ons:   •  Follow  our  progress  on  HIVE-­‐7292   •  Thank  you!