SlideShare a Scribd company logo
1 of 30
Download to read offline
Hive Join Optimizations:
MR and Spark
Szehon Ho
@hkszehon
Cloudera Software Engineer, Hive Committer and PMC
2© 2014 Cloudera, Inc. All rights reserved.
Background
•  Joins were one of the more challenging pieces of the Hive on Spark
project
•  Many joins added throughout the years in Hive
•  Common (Reduce-side) Join
•  Broadcast (Map-side) Join
•  Bucket Map Join
•  Sort Merge Bucket Join
•  Skew Join
•  More to come
•  Share our research on how different joins work in MR
•  Share how joins are implemented in Hive on Spark
3© 2014 Cloudera, Inc. All rights reserved.
Common Join
•  Known as Reduce-side join
•  Background: Hive (Equi) Join High-Level Requirement:
•  Scan n tables
•  Rows with same value on joinKeys are combined -> Result
•  Process:
•  Mapper: scan, process n tables and produces HiveKey = {JoinKey, TableAlias}, Value = {row}
•  Shuffle Phase:
•  JoinKey used to hash rows of same joinKey value to same reducer
•  TableAlias makes sure reducers gets rows in sorted order by origin table
•  Reducer: Join operator combine rows from different tables to produce JoinResult
•  Worst performance
•  All table data is shuffled around
4© 2014 Cloudera, Inc. All rights reserved.
Common Join
•  Ex: Join by CityId
•  CityId=1 goes to First Reducer (sorted by table)
•  CityId=2 goes to Second Reducer (sorted by table)
CityId CityName
1 San Jose
2 SF
CityId Sales
1 500
2 600
2 400
CityId TableAlias Row Value
1 C San Jose
1 S 500
CityId TableAlias Row Value
2 C SF
2 S 600
2 S 400
{1, San Jose, 500}
{2, San Jose, 600}
{2, San Francisco, 400}
Mapper (Reduce Sink) Reducer (Join Operator)
Cities: C
Sales: S
{1, C}
{1, S}
{2, C}
{2, S}
{2, S}
HiveKey Result
5© 2014 Cloudera, Inc. All rights reserved.
Common Join (MR)
TS Sel/FIl RS
TS Sel/FIl RS
Join Sel/FIl FileSinkOperator Tree
MR Work Tree
MapRedWork
ReduceWorkMapWork
TS Sel/FIl RS
TS Sel/FIl RS
Join Sel/FIl FileSink
Produces HIveKey
Execute on Mapper Execute on Reducer
6© 2014 Cloudera, Inc. All rights reserved.
Common Join
Spark Work Tree
SparkWork
MapWork
TS Sel/FIl ReduceSink
MapWork
TS Sel/FIl ReduceSink
ReduceWork
Join Sel/FIl ReduceSink
union() RepartitionAndSort
WithinPartitions()
Shuffle-Sort Transform
(SPARK-2978)
mapPartition()
mapPartition()
MapWork
MapWork
mapPartition()
ReduceWork
Spark RDD
Transforms
In Spark:
•  Table = RDD
•  Data Operation = RDD transformation
Table RDD
Table RDD
7© 2014 Cloudera, Inc. All rights reserved.
MapJoin
•  Known as Broadcast join
•  Create hashtable from (n-1) small table(s) keyed by Joinkey, broadcasted them in-memory to
mappers processing big-table.
•  Each big-table mapper does lookup of joinkey in small table(s) hashmap -> Join Result
•  Ex: Join by “CityId”
CityId CityName
1 San Jose
2 San Francisco
CityId Sales
1 500
2 600
2 400
CityId Sales
1 700
2 200
2 100
Small Table (HashTable) Big Table (Mapper)
{1, San Jose, 500}
{2, San Francisco, 600}
{2, San Francisco, 400}
{1, San Jose, 700}
{2, San Francisco, 200}
{2, San Francisco, 200}
8© 2014 Cloudera, Inc. All rights reserved.
MapJoin Overview (MR)
HS2
Node1
(Small Table Data)
Node2
(Small Table Data)
Node3 Node4
1. Local Work read,
process small table
HS2
Node3 Node4
2. Create/Upload hashtable
file to distributed cache
Node1 Node2
HS2
Node1
(Big Table)
3. Big Table Mapper
Reads hashTable from
Distributed Cache
Node2
Node3
(Big Table)
Node4
(Big Table)
LocalWork
MapWork MapWork
MapWork
LocalWork
•  More efficient than common join
•  Only small-table(s) are moved around
9© 2014 Cloudera, Inc. All rights reserved.
MapJoin Overview (Spark)
•  Spark Work for Mapjoin very similar to MR Version, use hashtable file
with high replication factor
•  Note1: We run the small-table processing non-local (parallel)
•  Note2: Consideration of Spark broadcast variables for broadcast
10© 2014 Cloudera, Inc. All rights reserved.
MapJoin Decision Implementation
•  Memory req: N-1 tables need to fit into mapper memory
•  Two ways Hive decides a mapjoin
•  Query Hints:
•  SELECT /*+ MAPJOIN(cities) */ * FROM cities JOIN sales on cities.cityId=sales.cityId;
•  Auto-converesion based on file-size (“hive.auto.convert.join”)
•  If N-1 small tables smaller than: “hive.mapjoin.smalltable.filesize”
11© 2014 Cloudera, Inc. All rights reserved.
MapJoin Optimizers
•  Multiple decision-points in query-planning = Different Optimizer Paths
•  MapJoin Optimizers are processors that convert Query Plan
•  “Logical (Compile-time) optimizers” modify a operator-tree, if known at compile-time
how to optimize to mapjoin
•  “Physical (Runtime) optimizers” modify a physical work (MapRedWork, TezWork,
SparkWork), involves more-complex conditional task, when Hive has no info at
compile-time
MapRedLocalWork
TS (Small
Table)
Sel/FIl HashTableSink
MapRedWork
MapWork
TS
(Big Table)
Sel/
FIl MapJoin
Sel/
FIl
FileSink
HashTableDummy
Local
Work
Logical Optimizers => TS (Big Table) Sel/FIl
TS (Small
Table)
Sel/FIl RS
MapJoin Sel/FIl FileSink
Physical Optimizers =>
12© 2014 Cloudera, Inc. All rights reserved.
MapJoin Optimizers (MR)
•  Query Hint: Big/Small Table(s) known at compile-time from hints.
•  Logical Optimizer: MapJoinProcessor
•  Auto-conversion: Table size not known at compile-time
•  Physical Optimizer: CommonJoinResolver, MapJoinResolver.
•  Create Conditional Tasks with all big/small table possibilities: one picked at runtime
•  Noconditional mode: For some cases, table file-size is known at compile-time and can skip
conditional task, but cannot do this for all queries (join of intermediate results..)
MapRedLocalWork
Cities to HashTable
MapRedWork
Sales to MapJoin
MapRedLocalWork
Sales to HashTable
MapRedWork
Cities to MapJoin
MapRedLocalWork
Cities Sales
Join
Condition1 (Cities Small) Condition2 (Sales Small) Condition3 (Neither Small Enough)
13© 2014 Cloudera, Inc. All rights reserved.
MapJoin Optimizers: Spark
•  Spark Plan: Support for both query-hints and auto-conversion
decisions.
•  Query Hints
•  Logical Optimizer: Reuse MapJoinProcessor
•  Auto-conversion: Use statistics annotated on operators that give estimated
output size (like Tez, CBO), so big/small tables known at compile-time too
•  Logical Optimizer: SparkMapJoinOptimizer
Logical Optimizers =>
MapJoin Operators
TS (Big Table) Sel/FIl
TS (Small
Table)
Sel/FIl RS
MapJoin Sel/FIl FileSink
14© 2014 Cloudera, Inc. All rights reserved.
BucketMapJoin
•  Bucketed tables: rows hash to different bucket files based on bucket-key
•  CREATE TABLE cities (cityid int, value string) CLUSTERED BY (cityId) INTO 2 BUCKETS;
•  Join tables bucketed on join key: For each bucket of table, rows with matching joinKey
values will be in corresponding bucket of other table
•  Like Mapjoin, but big-table mappers load to memory only relevant small-table bucket’s
hashmap
•  Ex: Bucketed by “CityId”, Join by “CityId”
CityId CityName
3 New York
1 San Jose
CityId CityName
2 San Francisco
4 Los Angeles
CityId Sales
1 500
3 6000
1 400
CityId Sales
4 50
2 200
4 45
{1, San Jose, 500}
{3, New York, 6000}
{1, San Jose, 400}
{4, Los Angeles, 50}
{2, San Francisco, 200}
{4, Los Angeles, 45}
15© 2014 Cloudera, Inc. All rights reserved.
Bucket MapJoin Execution
•  Very similar to MapJoin
•  HashTableSink (small-table) writes per-bucket instead of per-table
•  HashTableLoader (big-table mapper) reads per-bucket
16© 2014 Cloudera, Inc. All rights reserved.
BucketMapJoin Optimizers (MR, Spark)
•  Memory Req: Corresponding bucket(s) of small table(s) fit into memory of big
table mapper (less than mapjoin)
•  MR:
•  Query hint && “hive.optimize.bucket.mapjoin”, all information known at compile-time
•  Logical Optimizer: MapJoinProcessor (intermediate operator tree)
•  Spark:
•  Query hint && “hive.optimize.buckert.mapjoin”
•  Logical Optimizer: Reuse MapJoinProcessor
•  Auto-Trigger, done via stats like mapjoin (size calculation estimated to be size/numBuckets)
•  Logical Optimizer: SparkMapJoinOptimizer, does size calculation of small tables via statistics, divides original
number by numBuckets
17© 2014 Cloudera, Inc. All rights reserved.
SMB Join
•  CREATE TABLE cities (cityid int, cityName string) CLUSTERED BY (cityId)
SORTED BY (cityId) INTO 2 BUCKETS;
•  Join tables are bucketed and sorted (per bucket)
•  This allows sort-merge join per bucket.
•  Advance table until find a match
CityId CityName
1 San Jose
3 New York
CityId Sales
1 500
1 400
3 6000
CityId Sales
2 200
4 50
4 45
CityId CityName
2 San Francisco
4 Los Angeles
{1, San Jose, 500}
{1, San Jose, 400}
{3, New York, 6000}
{2, San Francisco, 200}
{4, Los Angeles, 50}
{4, Los Angeles, 45}
18© 2014 Cloudera, Inc. All rights reserved.
SMB Join
•  Same Execution in MR and Spark
•  Run mapper process against a “big-table”, which loads corresponding small-table buckets
•  Mapper reads directly from small-table, no need to create, broadcast small-table hashmap.
•  No size limit on small table (no need to load table into memory)
Node: Small
Table Bucket 2
Node: Small
Table Bucket 1
Node: Big Table Bucket
1
Node: Big Table Bucket
2
MR: Mappers
Spark: MapPartition() Transform
MapWork
MapWork
HS2
19© 2014 Cloudera, Inc. All rights reserved.
SMB Join Optimizers: MR
•  SMB plan needs to identify ‘big-table’: one that mappers run against, will
load ‘small-tables’. Generally can be determined at compile-time
•  User gives query-hints to identify small-tables
•  Triggered by “hive.optimize.bucketmapjoin.sortedmerge”
•  Logical Optimizer: SortedMergeBucketMapJoinProc
•  Auto-trigger: “hive.auto.convert.sortmerge.join.bigtable.selection.policy”
class chooses big-table
•  Triggered by “hive.auto.convert.sortmerge.join”
•  Logical Optimizer: SortedBucketMapJoinProc
TS (Big Table) Sel/FIl
TS (Small
Table)
Sel/FIl DummyStore
SMBMapJoin Sel/FIl FileSink
Logical Optimizers:
SMB Join Operator
20© 2014 Cloudera, Inc. All rights reserved.
SMB Join Optimizers: Spark
•  Query-hints
•  Logical Optimizer: SparkSMBJoinHintOptimizer
•  Auto-Conversion
•  Logical Optimizer: SparkSortMergeJoinOptimizer
TS (Big Table) Sel/FIl
TS (Small
Table)
Sel/FIl DummyStore
SMBMapJoin Sel/FIl FileSink
Logical Optimizers:
SMB Join Operator
21© 2014 Cloudera, Inc. All rights reserved.
SMB vs MapJoin Decision (MR)
•  SMB->MapJoin path
•  In many cases, mapjoin is faster than SMB join so we choose mapjoin if possible
•  We spawn 1 mapper per bucket = large overhead if table has huge number bucket files
•  Enabled by “hive.auto.convert.sortmerge.join.to.mapjoin”
•  Physical Optimizer: SortMergeJoinResolver
MapRedWork
MapWork
SMB Join Work
MapRedLocalWork
MapRedWork
MapRedLocalWork
MapRedWork
Conditional MapJoin Work
MapRedWork
MapWork
MapJoin Option MapJoin Option SMB Option
22© 2014 Cloudera, Inc. All rights reserved.
SMB vs MapJoin Decision (Spark)
•  Make decision at compile-time via stats and config for Mapjoin vs
SMB join (can determine mapjoin at compile-time)
•  Logical Optimizer: SparkJoinOptimizer
•  If hive.auto.convert.join && hive.auto.convert.sortmerge.join.to.mapjoin and
tables fit into memory, delegate to MapJoin logical optimizers
•  If SMB enabled && (! hive.auto.convert.sortmerge.join.to.mapjoin or tables do
not fit into memory) , delegate to SMB Join logical optimizers.
23© 2014 Cloudera, Inc. All rights reserved.
Skew Join
•  Skew keys = key with high frequencies, will overwhelm that key’s
reducer in common join
•  Perform a common join for non-skew keys, and perform map join for skewed
keys.
•  A join B on A.id=B.id, with A skewing for id=1, becomes
•  A join B on A.id=B.id and A.id!=1 union
•  A join B on A.id=B.id and A.id=1
•  If B doesn’t skew on id=1, then #2 will be a map join.
24© 2014 Cloudera, Inc. All rights reserved.
Skew Join Optimizers (Compile Time, MR)
•  Skew keys identified by: create table … skewed by (key) on
(key_value);
•  Activated by “hive.optimize.skewjoin.compiletime”
•  Logical Optimizer: SkewJoinOptimizer looks at table metadata
•  We fixed bug with converting to mapjoin for skewed rows, HIVE-8610
TS Fil (Skewed Rows) ReduceSink
TS Fil (Skewed Rows) ReduceSink
Join
TS Fil (non-skewed) ReduceSink
TS Fil (non-skewed) ReduceSink
Join
Union
25© 2014 Cloudera, Inc. All rights reserved.
Skew Join Optimizers (Runtime, MR)
•  Activated by “hive.optimize.skewjoin”
•  Physical Optimizer: SkewJoinResolver
•  During join operator, key is skewed if it passes “hive.skewjoin.key” threshold
•  Skew key is skipped and values are copied to separate directories
•  Those directories are processed by conditional mapjoin task.
MapRedLocalWork
Tab1 to HashTable
MapRedWork
Tab2 is bigtable
MapRedLocalWork
Tab2 to HashTable
MapRedWork
Tab1 is bigtable
MapRedLocalWork
Tab1 Tab2
Join
Condition1(Skew Key Join) Condition2(Skew Key Join)Task3
Tab1 Skew Keys
Tab2 Skew Keys
26© 2014 Cloudera, Inc. All rights reserved.
Skew Join (Spark)
•  Compile-time optimizer
•  Logical Optimizer: Re-use SkewJoinOptimizer
•  Runtime optimizer
•  Physical Optimizer: SparkSkewJoinResolver, similar to SkewJoinResolver.
•  Main challenge is to break up some SparkTask that involve aggregations follow
by join, in skewjoin case, in order to insert conditional task.
27© 2014 Cloudera, Inc. All rights reserved.
MR Join Class Diagram (Enjoy)
SkewJoinOptimizer
(hive.optimize.skewjoin.compiletime)
MapJoinProcessor
BucketMapJoinOptimizer
(hive.optimize.bucket.mapjoin)
Tables are skewed N-1 join tables fit
in memory
User provides join hints
&& Tables bucketed
Users provides
Join hints
&& Tables bucketed
&& Tables Sorted
User provides
Join hints
Tables are skewed,
Skew metadata
available
Tables bucketed &&
Tables Sorted
SortedMergeBucketMapJoinOptimizer
(hive.optimize.bucketmapjoin.sortedmerge)
SortedMergeBucketMapJoinProc
(if contains MapJoin operator)
SortedBucketMapJoinProc
(ihive.auto.convert.sortmerge.join)
MapJoinFactory
(if contains MapJoin, SMBJoin operator)
SortMergeJoinResolver
(hive.auto.convert.join && hive.auto.convert.sortmerge.join.to.mapjoin)
MapJoinResolver
(if contains MapWork with MapLocalWork)
SkewJoinResolver
(hive.optimize.skew.join)
CommonJoinResolver
(hive.auto.convert.join)
SMB MapJoin
Skew Join
With MapJoin
Skew Join
With MapJoin
MapJoin MapJoin Bucket MapJoin Bucket MapJoin Bucket MapJoin
28© 2014 Cloudera, Inc. All rights reserved.
Spark Join Class Diagram (Enjoy)
SkewJoinOptimizer
(hive.optimize.skewjoin.compiletime)
SparkMapJoinProcessor
BucketMapJoinOptimizer
(hive.optimize.bucket.mapjoin)
Tables are skewed,
Skew metadata
available
N-1 join tables fit
in memory
User provides join hints
&& Tables bucketed
Users provides
Join hints
&& Tables bucketed
&& Tables Sorted
User provides
Join hints
Tables are skewed Tables bucketed &&
Tables Sorted
SparkSMBJoinHintOptimizer
(if contains MapJoin operator)
SparkSortMergeJoinOptimizer
(hive.auto.convert.sortmerge.join &&
! hive.auto.convert.sortmerge.join.to.mapjoin)
GenSparkWork
SparkSortMergeMapJoinFactory
(if contains SMBMapJoin operator)
SparkMapJoinResolver
(if SparkWork contains MapJoinOperator)
SkewJoinResolver
(hive.optimize.skew.join)
Skew Join
With MapJoin
Skew Join
With MapJoin
MapJoin or
Bucket Mapjoin
MapJoin Bucket MapJoin SMB MapJoin SMB MapJoin
SparkMapJoinOptimizer
(hive.auto.convert.join &&
hive.auto.convert.sortmerge.join.to.mapjoin)
29© 2014 Cloudera, Inc. All rights reserved.
Hive on Spark Join Team
•  Szehon Ho (Cloudera)
•  Chao Sun (Cloudera)
•  Jimmy Xiang (Cloudera)
•  Rui Li (Intel)
•  Suhas Satish (MapR)
•  Na Yang (MapR)
Thank you.

More Related Content

What's hot

What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Chris Fregly
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureVARUN SAXENA
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...DataStax
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
An overview of Neo4j Internals
An overview of Neo4j InternalsAn overview of Neo4j Internals
An overview of Neo4j InternalsTobias Lindaaker
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsDataWorks Summit
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is FailingDataWorks Summit
 

What's hot (20)

What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hive
HiveHive
Hive
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
An overview of Neo4j Internals
An overview of Neo4j InternalsAn overview of Neo4j Internals
An overview of Neo4j Internals
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 

Viewers also liked

Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwordsSzehon Ho
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparktrihug
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013alanfgates
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
Hive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive TeamHive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive TeamZheng Shao
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresSteve Loughran
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Mark Rittman
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat SheetHortonworks
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 

Viewers also liked (18)

Hive tuning
Hive tuningHive tuning
Hive tuning
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Hive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive TeamHive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive Team
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Empower Hive with Spark
Empower Hive with SparkEmpower Hive with Spark
Empower Hive with Spark
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 

Similar to Hive Join Optimizations in MR and Spark

Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over YarnInMobi Technology
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigKhanKhaja1
 
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGHadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGPradeep MG
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Databricks
 
Streaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesStreaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesC4Media
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationAhmad El Tawil
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010John Sichi
 

Similar to Hive Join Optimizations in MR and Spark (20)

Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
Hive
HiveHive
Hive
 
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGHadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MG
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Streaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+TablesStreaming SQL Foundations: Why I ❤ Streams+Tables
Streaming SQL Foundations: Why I ❤ Streams+Tables
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Patch Maps
Patch MapsPatch Maps
Patch Maps
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 

Recently uploaded

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 

Hive Join Optimizations in MR and Spark

  • 1. Hive Join Optimizations: MR and Spark Szehon Ho @hkszehon Cloudera Software Engineer, Hive Committer and PMC
  • 2. 2© 2014 Cloudera, Inc. All rights reserved. Background •  Joins were one of the more challenging pieces of the Hive on Spark project •  Many joins added throughout the years in Hive •  Common (Reduce-side) Join •  Broadcast (Map-side) Join •  Bucket Map Join •  Sort Merge Bucket Join •  Skew Join •  More to come •  Share our research on how different joins work in MR •  Share how joins are implemented in Hive on Spark
  • 3. 3© 2014 Cloudera, Inc. All rights reserved. Common Join •  Known as Reduce-side join •  Background: Hive (Equi) Join High-Level Requirement: •  Scan n tables •  Rows with same value on joinKeys are combined -> Result •  Process: •  Mapper: scan, process n tables and produces HiveKey = {JoinKey, TableAlias}, Value = {row} •  Shuffle Phase: •  JoinKey used to hash rows of same joinKey value to same reducer •  TableAlias makes sure reducers gets rows in sorted order by origin table •  Reducer: Join operator combine rows from different tables to produce JoinResult •  Worst performance •  All table data is shuffled around
  • 4. 4© 2014 Cloudera, Inc. All rights reserved. Common Join •  Ex: Join by CityId •  CityId=1 goes to First Reducer (sorted by table) •  CityId=2 goes to Second Reducer (sorted by table) CityId CityName 1 San Jose 2 SF CityId Sales 1 500 2 600 2 400 CityId TableAlias Row Value 1 C San Jose 1 S 500 CityId TableAlias Row Value 2 C SF 2 S 600 2 S 400 {1, San Jose, 500} {2, San Jose, 600} {2, San Francisco, 400} Mapper (Reduce Sink) Reducer (Join Operator) Cities: C Sales: S {1, C} {1, S} {2, C} {2, S} {2, S} HiveKey Result
  • 5. 5© 2014 Cloudera, Inc. All rights reserved. Common Join (MR) TS Sel/FIl RS TS Sel/FIl RS Join Sel/FIl FileSinkOperator Tree MR Work Tree MapRedWork ReduceWorkMapWork TS Sel/FIl RS TS Sel/FIl RS Join Sel/FIl FileSink Produces HIveKey Execute on Mapper Execute on Reducer
  • 6. 6© 2014 Cloudera, Inc. All rights reserved. Common Join Spark Work Tree SparkWork MapWork TS Sel/FIl ReduceSink MapWork TS Sel/FIl ReduceSink ReduceWork Join Sel/FIl ReduceSink union() RepartitionAndSort WithinPartitions() Shuffle-Sort Transform (SPARK-2978) mapPartition() mapPartition() MapWork MapWork mapPartition() ReduceWork Spark RDD Transforms In Spark: •  Table = RDD •  Data Operation = RDD transformation Table RDD Table RDD
  • 7. 7© 2014 Cloudera, Inc. All rights reserved. MapJoin •  Known as Broadcast join •  Create hashtable from (n-1) small table(s) keyed by Joinkey, broadcasted them in-memory to mappers processing big-table. •  Each big-table mapper does lookup of joinkey in small table(s) hashmap -> Join Result •  Ex: Join by “CityId” CityId CityName 1 San Jose 2 San Francisco CityId Sales 1 500 2 600 2 400 CityId Sales 1 700 2 200 2 100 Small Table (HashTable) Big Table (Mapper) {1, San Jose, 500} {2, San Francisco, 600} {2, San Francisco, 400} {1, San Jose, 700} {2, San Francisco, 200} {2, San Francisco, 200}
  • 8. 8© 2014 Cloudera, Inc. All rights reserved. MapJoin Overview (MR) HS2 Node1 (Small Table Data) Node2 (Small Table Data) Node3 Node4 1. Local Work read, process small table HS2 Node3 Node4 2. Create/Upload hashtable file to distributed cache Node1 Node2 HS2 Node1 (Big Table) 3. Big Table Mapper Reads hashTable from Distributed Cache Node2 Node3 (Big Table) Node4 (Big Table) LocalWork MapWork MapWork MapWork LocalWork •  More efficient than common join •  Only small-table(s) are moved around
  • 9. 9© 2014 Cloudera, Inc. All rights reserved. MapJoin Overview (Spark) •  Spark Work for Mapjoin very similar to MR Version, use hashtable file with high replication factor •  Note1: We run the small-table processing non-local (parallel) •  Note2: Consideration of Spark broadcast variables for broadcast
  • 10. 10© 2014 Cloudera, Inc. All rights reserved. MapJoin Decision Implementation •  Memory req: N-1 tables need to fit into mapper memory •  Two ways Hive decides a mapjoin •  Query Hints: •  SELECT /*+ MAPJOIN(cities) */ * FROM cities JOIN sales on cities.cityId=sales.cityId; •  Auto-converesion based on file-size (“hive.auto.convert.join”) •  If N-1 small tables smaller than: “hive.mapjoin.smalltable.filesize”
  • 11. 11© 2014 Cloudera, Inc. All rights reserved. MapJoin Optimizers •  Multiple decision-points in query-planning = Different Optimizer Paths •  MapJoin Optimizers are processors that convert Query Plan •  “Logical (Compile-time) optimizers” modify a operator-tree, if known at compile-time how to optimize to mapjoin •  “Physical (Runtime) optimizers” modify a physical work (MapRedWork, TezWork, SparkWork), involves more-complex conditional task, when Hive has no info at compile-time MapRedLocalWork TS (Small Table) Sel/FIl HashTableSink MapRedWork MapWork TS (Big Table) Sel/ FIl MapJoin Sel/ FIl FileSink HashTableDummy Local Work Logical Optimizers => TS (Big Table) Sel/FIl TS (Small Table) Sel/FIl RS MapJoin Sel/FIl FileSink Physical Optimizers =>
  • 12. 12© 2014 Cloudera, Inc. All rights reserved. MapJoin Optimizers (MR) •  Query Hint: Big/Small Table(s) known at compile-time from hints. •  Logical Optimizer: MapJoinProcessor •  Auto-conversion: Table size not known at compile-time •  Physical Optimizer: CommonJoinResolver, MapJoinResolver. •  Create Conditional Tasks with all big/small table possibilities: one picked at runtime •  Noconditional mode: For some cases, table file-size is known at compile-time and can skip conditional task, but cannot do this for all queries (join of intermediate results..) MapRedLocalWork Cities to HashTable MapRedWork Sales to MapJoin MapRedLocalWork Sales to HashTable MapRedWork Cities to MapJoin MapRedLocalWork Cities Sales Join Condition1 (Cities Small) Condition2 (Sales Small) Condition3 (Neither Small Enough)
  • 13. 13© 2014 Cloudera, Inc. All rights reserved. MapJoin Optimizers: Spark •  Spark Plan: Support for both query-hints and auto-conversion decisions. •  Query Hints •  Logical Optimizer: Reuse MapJoinProcessor •  Auto-conversion: Use statistics annotated on operators that give estimated output size (like Tez, CBO), so big/small tables known at compile-time too •  Logical Optimizer: SparkMapJoinOptimizer Logical Optimizers => MapJoin Operators TS (Big Table) Sel/FIl TS (Small Table) Sel/FIl RS MapJoin Sel/FIl FileSink
  • 14. 14© 2014 Cloudera, Inc. All rights reserved. BucketMapJoin •  Bucketed tables: rows hash to different bucket files based on bucket-key •  CREATE TABLE cities (cityid int, value string) CLUSTERED BY (cityId) INTO 2 BUCKETS; •  Join tables bucketed on join key: For each bucket of table, rows with matching joinKey values will be in corresponding bucket of other table •  Like Mapjoin, but big-table mappers load to memory only relevant small-table bucket’s hashmap •  Ex: Bucketed by “CityId”, Join by “CityId” CityId CityName 3 New York 1 San Jose CityId CityName 2 San Francisco 4 Los Angeles CityId Sales 1 500 3 6000 1 400 CityId Sales 4 50 2 200 4 45 {1, San Jose, 500} {3, New York, 6000} {1, San Jose, 400} {4, Los Angeles, 50} {2, San Francisco, 200} {4, Los Angeles, 45}
  • 15. 15© 2014 Cloudera, Inc. All rights reserved. Bucket MapJoin Execution •  Very similar to MapJoin •  HashTableSink (small-table) writes per-bucket instead of per-table •  HashTableLoader (big-table mapper) reads per-bucket
  • 16. 16© 2014 Cloudera, Inc. All rights reserved. BucketMapJoin Optimizers (MR, Spark) •  Memory Req: Corresponding bucket(s) of small table(s) fit into memory of big table mapper (less than mapjoin) •  MR: •  Query hint && “hive.optimize.bucket.mapjoin”, all information known at compile-time •  Logical Optimizer: MapJoinProcessor (intermediate operator tree) •  Spark: •  Query hint && “hive.optimize.buckert.mapjoin” •  Logical Optimizer: Reuse MapJoinProcessor •  Auto-Trigger, done via stats like mapjoin (size calculation estimated to be size/numBuckets) •  Logical Optimizer: SparkMapJoinOptimizer, does size calculation of small tables via statistics, divides original number by numBuckets
  • 17. 17© 2014 Cloudera, Inc. All rights reserved. SMB Join •  CREATE TABLE cities (cityid int, cityName string) CLUSTERED BY (cityId) SORTED BY (cityId) INTO 2 BUCKETS; •  Join tables are bucketed and sorted (per bucket) •  This allows sort-merge join per bucket. •  Advance table until find a match CityId CityName 1 San Jose 3 New York CityId Sales 1 500 1 400 3 6000 CityId Sales 2 200 4 50 4 45 CityId CityName 2 San Francisco 4 Los Angeles {1, San Jose, 500} {1, San Jose, 400} {3, New York, 6000} {2, San Francisco, 200} {4, Los Angeles, 50} {4, Los Angeles, 45}
  • 18. 18© 2014 Cloudera, Inc. All rights reserved. SMB Join •  Same Execution in MR and Spark •  Run mapper process against a “big-table”, which loads corresponding small-table buckets •  Mapper reads directly from small-table, no need to create, broadcast small-table hashmap. •  No size limit on small table (no need to load table into memory) Node: Small Table Bucket 2 Node: Small Table Bucket 1 Node: Big Table Bucket 1 Node: Big Table Bucket 2 MR: Mappers Spark: MapPartition() Transform MapWork MapWork HS2
  • 19. 19© 2014 Cloudera, Inc. All rights reserved. SMB Join Optimizers: MR •  SMB plan needs to identify ‘big-table’: one that mappers run against, will load ‘small-tables’. Generally can be determined at compile-time •  User gives query-hints to identify small-tables •  Triggered by “hive.optimize.bucketmapjoin.sortedmerge” •  Logical Optimizer: SortedMergeBucketMapJoinProc •  Auto-trigger: “hive.auto.convert.sortmerge.join.bigtable.selection.policy” class chooses big-table •  Triggered by “hive.auto.convert.sortmerge.join” •  Logical Optimizer: SortedBucketMapJoinProc TS (Big Table) Sel/FIl TS (Small Table) Sel/FIl DummyStore SMBMapJoin Sel/FIl FileSink Logical Optimizers: SMB Join Operator
  • 20. 20© 2014 Cloudera, Inc. All rights reserved. SMB Join Optimizers: Spark •  Query-hints •  Logical Optimizer: SparkSMBJoinHintOptimizer •  Auto-Conversion •  Logical Optimizer: SparkSortMergeJoinOptimizer TS (Big Table) Sel/FIl TS (Small Table) Sel/FIl DummyStore SMBMapJoin Sel/FIl FileSink Logical Optimizers: SMB Join Operator
  • 21. 21© 2014 Cloudera, Inc. All rights reserved. SMB vs MapJoin Decision (MR) •  SMB->MapJoin path •  In many cases, mapjoin is faster than SMB join so we choose mapjoin if possible •  We spawn 1 mapper per bucket = large overhead if table has huge number bucket files •  Enabled by “hive.auto.convert.sortmerge.join.to.mapjoin” •  Physical Optimizer: SortMergeJoinResolver MapRedWork MapWork SMB Join Work MapRedLocalWork MapRedWork MapRedLocalWork MapRedWork Conditional MapJoin Work MapRedWork MapWork MapJoin Option MapJoin Option SMB Option
  • 22. 22© 2014 Cloudera, Inc. All rights reserved. SMB vs MapJoin Decision (Spark) •  Make decision at compile-time via stats and config for Mapjoin vs SMB join (can determine mapjoin at compile-time) •  Logical Optimizer: SparkJoinOptimizer •  If hive.auto.convert.join && hive.auto.convert.sortmerge.join.to.mapjoin and tables fit into memory, delegate to MapJoin logical optimizers •  If SMB enabled && (! hive.auto.convert.sortmerge.join.to.mapjoin or tables do not fit into memory) , delegate to SMB Join logical optimizers.
  • 23. 23© 2014 Cloudera, Inc. All rights reserved. Skew Join •  Skew keys = key with high frequencies, will overwhelm that key’s reducer in common join •  Perform a common join for non-skew keys, and perform map join for skewed keys. •  A join B on A.id=B.id, with A skewing for id=1, becomes •  A join B on A.id=B.id and A.id!=1 union •  A join B on A.id=B.id and A.id=1 •  If B doesn’t skew on id=1, then #2 will be a map join.
  • 24. 24© 2014 Cloudera, Inc. All rights reserved. Skew Join Optimizers (Compile Time, MR) •  Skew keys identified by: create table … skewed by (key) on (key_value); •  Activated by “hive.optimize.skewjoin.compiletime” •  Logical Optimizer: SkewJoinOptimizer looks at table metadata •  We fixed bug with converting to mapjoin for skewed rows, HIVE-8610 TS Fil (Skewed Rows) ReduceSink TS Fil (Skewed Rows) ReduceSink Join TS Fil (non-skewed) ReduceSink TS Fil (non-skewed) ReduceSink Join Union
  • 25. 25© 2014 Cloudera, Inc. All rights reserved. Skew Join Optimizers (Runtime, MR) •  Activated by “hive.optimize.skewjoin” •  Physical Optimizer: SkewJoinResolver •  During join operator, key is skewed if it passes “hive.skewjoin.key” threshold •  Skew key is skipped and values are copied to separate directories •  Those directories are processed by conditional mapjoin task. MapRedLocalWork Tab1 to HashTable MapRedWork Tab2 is bigtable MapRedLocalWork Tab2 to HashTable MapRedWork Tab1 is bigtable MapRedLocalWork Tab1 Tab2 Join Condition1(Skew Key Join) Condition2(Skew Key Join)Task3 Tab1 Skew Keys Tab2 Skew Keys
  • 26. 26© 2014 Cloudera, Inc. All rights reserved. Skew Join (Spark) •  Compile-time optimizer •  Logical Optimizer: Re-use SkewJoinOptimizer •  Runtime optimizer •  Physical Optimizer: SparkSkewJoinResolver, similar to SkewJoinResolver. •  Main challenge is to break up some SparkTask that involve aggregations follow by join, in skewjoin case, in order to insert conditional task.
  • 27. 27© 2014 Cloudera, Inc. All rights reserved. MR Join Class Diagram (Enjoy) SkewJoinOptimizer (hive.optimize.skewjoin.compiletime) MapJoinProcessor BucketMapJoinOptimizer (hive.optimize.bucket.mapjoin) Tables are skewed N-1 join tables fit in memory User provides join hints && Tables bucketed Users provides Join hints && Tables bucketed && Tables Sorted User provides Join hints Tables are skewed, Skew metadata available Tables bucketed && Tables Sorted SortedMergeBucketMapJoinOptimizer (hive.optimize.bucketmapjoin.sortedmerge) SortedMergeBucketMapJoinProc (if contains MapJoin operator) SortedBucketMapJoinProc (ihive.auto.convert.sortmerge.join) MapJoinFactory (if contains MapJoin, SMBJoin operator) SortMergeJoinResolver (hive.auto.convert.join && hive.auto.convert.sortmerge.join.to.mapjoin) MapJoinResolver (if contains MapWork with MapLocalWork) SkewJoinResolver (hive.optimize.skew.join) CommonJoinResolver (hive.auto.convert.join) SMB MapJoin Skew Join With MapJoin Skew Join With MapJoin MapJoin MapJoin Bucket MapJoin Bucket MapJoin Bucket MapJoin
  • 28. 28© 2014 Cloudera, Inc. All rights reserved. Spark Join Class Diagram (Enjoy) SkewJoinOptimizer (hive.optimize.skewjoin.compiletime) SparkMapJoinProcessor BucketMapJoinOptimizer (hive.optimize.bucket.mapjoin) Tables are skewed, Skew metadata available N-1 join tables fit in memory User provides join hints && Tables bucketed Users provides Join hints && Tables bucketed && Tables Sorted User provides Join hints Tables are skewed Tables bucketed && Tables Sorted SparkSMBJoinHintOptimizer (if contains MapJoin operator) SparkSortMergeJoinOptimizer (hive.auto.convert.sortmerge.join && ! hive.auto.convert.sortmerge.join.to.mapjoin) GenSparkWork SparkSortMergeMapJoinFactory (if contains SMBMapJoin operator) SparkMapJoinResolver (if SparkWork contains MapJoinOperator) SkewJoinResolver (hive.optimize.skew.join) Skew Join With MapJoin Skew Join With MapJoin MapJoin or Bucket Mapjoin MapJoin Bucket MapJoin SMB MapJoin SMB MapJoin SparkMapJoinOptimizer (hive.auto.convert.join && hive.auto.convert.sortmerge.join.to.mapjoin)
  • 29. 29© 2014 Cloudera, Inc. All rights reserved. Hive on Spark Join Team •  Szehon Ho (Cloudera) •  Chao Sun (Cloudera) •  Jimmy Xiang (Cloudera) •  Rui Li (Intel) •  Suhas Satish (MapR) •  Na Yang (MapR)