SlideShare a Scribd company logo
1 of 14
Download to read offline
Effec%ve	
  Hive	
  Queries	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Secrets	
  From	
  the	
  Pros	
  
We	
  will	
  be	
  star,ng	
  at	
  11:03	
  PDT	
  
Use	
  the	
  Chat	
  Pane	
  in	
  GoToWebinar	
  to	
  Ask	
  Ques%ons!	
  
Assess	
  your	
  level	
  and	
  learn	
  new	
  stuff	
  
This	
  webinar	
  is	
  intended	
  for	
  intermediate	
  audiences	
  
(familiar	
  with	
  Apache	
  Hive	
  and	
  Hadoop,	
  but	
  not	
  experts)	
  
?	
  
AGENDA	
  
This	
  Webinar	
  provides	
  %ps	
  on	
  improving	
  the	
  performance	
  and	
  
beJer	
  u%lizing	
  resources	
  using	
  the	
  following	
  best	
  prac%ces:	
  
•  Data	
  Layout	
  (Par%%ons	
  and	
  Buckets)	
  
•  Data	
  Sampling	
  (Bucket	
  and	
  Block	
  sampling)	
  
•  Data	
  Processing	
  (Bucket	
  Map	
  Join	
  and	
  Parallel	
  
execu%on)	
  
Dataset	
  Used	
  
#	
  of	
  records:	
  	
  276M	
  records	
  
Columns:	
  
I%nerary	
  ID	
  
Year	
  &Quarter	
  of	
  Travel	
  
Trip	
  Origin	
  City	
  &	
  State	
  
Trip	
  Des%na%on	
  City	
  &	
  State	
  
Distance	
  between	
  Origin	
  &	
  Des%na%on	
  
Airline	
  Bookings	
  All	
  
Includes	
  stops	
  at	
  intermediate	
  ci%es	
  
#	
  of	
  records:	
  	
  116M	
  records	
  
Columns:	
  
I%nerary	
  ID	
  
Year	
  &Quarter	
  of	
  Travel	
  
Trip	
  Origin	
  City	
  &	
  State	
  
Trip	
  Des%na%on	
  City	
  &	
  State	
  
Distance	
  between	
  Origin	
  &	
  Des%na%on	
  
Airline	
  Bookings	
  Origin	
  Only	
  
Only	
  first	
  leg	
  of	
  travel	
  
#	
  of	
  records:	
  	
  50	
  
Columns:	
  
State	
  code	
  &	
  Name	
  
Popula%on	
  
Census	
  
Human	
  popula%on	
  by	
  US	
  State	
  
#1	
  -­‐	
  Data	
  Par%%oning	
  	
  
•  Problem	
  PaJern	
  
–  Query	
  a	
  subset	
  of	
  data	
  in	
  a	
  table	
  
–  Subset	
  iden%fied	
  by	
  “Column_Name	
  =	
  X”	
  filter	
  
•  Solu%on	
  paJern	
  
–  Layout	
  data	
  in	
  sub-­‐directories	
  with	
  each	
  directory	
  associated	
  
with	
  a	
  value	
  of	
  the	
  par%%on	
  column	
  
–  The	
  filter	
  on	
  par%%on	
  column	
  just	
  picks	
  a	
  single	
  sub	
  directory	
  
•  Approach	
  
–  Use	
  PARTITION	
  BY	
  clause	
  
•  Benefit	
  
–  Par%%on	
  pruning	
  
–  2.7x	
  faster	
  on	
  a	
  query	
  on	
  Airline	
  Bookings	
  Dataset	
  (29	
  seconds)	
  
#1	
  -­‐	
  Data	
  Par%%oning	
  
Airline	
  Bookings	
  All	
  Table	
  
Origin	
  State	
  (Par%%on	
  
Column	
  /	
  Sub-­‐directory)	
   CA	
   WY	
  AL	
  
File1001.dat	
  
File1002.dat	
  
File100n.dat	
  
File3001.dat	
  
File3002.dat	
  
File300n.dat	
  
Filex001.dat	
  
Filex002.dat	
  
Filex00n.dat	
  
Files	
  inside	
  the	
  
par%%on	
  
SELECT	
  origin_city,	
  origin_state	
  
FROM	
  Airline_Bookings_All	
  
WHERE	
  origin_state	
  =	
  ‘CA’	
  
CREATE	
  TABLE	
  Airline_Bookings_All	
  
….	
  
PARTITIONED	
  BY	
  (origin_state	
  STRING)	
  
#2	
  -­‐	
  Data	
  Bucke%ng	
  
•  Problem	
  PaJern	
  
–  Join	
  data	
  in	
  two	
  large	
  tables	
  efficiently	
  
–  Sample	
  data	
  inside	
  a	
  table	
  efficiently	
  
•  Solu%on	
  paJern	
  
–  More	
  efficient	
  processing	
  by	
  storing	
  data	
  in	
  hash	
  buckets	
  
•  Approach	
  
•  Use	
  bucke%ng	
  using	
  CLUSTERED	
  BY	
  ..	
  INTO	
  n	
  BUCKETS	
  
•  Benefit	
  
–  Bucket	
  Map	
  Join	
  
–  Bucket	
  Sampling	
  
#2	
  –	
  Data	
  Bucke%ng	
  
CREATE	
  TABLE	
  Airline_Bookings_All	
  
…	
  
CLUSTERED	
  BY	
  (i%nid)	
  INTO	
  64	
  BUCKETS	
  
set	
  hive.enforce.bucke%ng	
  =	
  true;	
  
INSERT	
  OVERWRITE	
  TABLE	
  Airline_Bookings_All	
  
SELECT	
  …	
  
FROM	
  ..	
  
Ailrine_Bookings_All	
  
File00.dat	
  
File63.dat	
  
File01.dat	
  
Each	
  File	
  contains	
  all	
  
the	
  rows	
  that	
  
correspond	
  to	
  the	
  
same	
  hash	
  of	
  i%nid	
  
column	
  
#2	
  -­‐	
  Data	
  Bucke%ng	
  
a	
  
File1001.dat	
  
File1002.dat	
  
File100n.dat	
  
Filex001.dat	
  
Filex002.dat	
  
Filex00m.dat	
  
Files	
  containing	
  table	
  
data	
  bucketed	
  on	
  a	
  
column	
  
b	
  
set	
  hive.op%mize.bucketmapjoin	
  =	
  true;	
  
	
  
SELECT	
  /*+	
  MAPJOIN(a,	
  b)	
  */	
  a.*,	
  b.*	
  
FROM	
  Airline_Bookings_All	
  a	
  JOIN	
  Airline_Bookings_Origin_Only	
  b	
  
ON	
  a.i%nid	
  =	
  b.i%nid	
  
Note:	
  	
  
1.  Both	
  the	
  tables	
  are	
  bucketed	
  on	
  i%nid	
  column	
  
2.  The	
  numbers	
  of	
  buckets	
  in	
  the	
  two	
  tables	
  are	
  a	
  strict	
  mul%ple	
  of	
  each	
  other	
  
#3	
  -­‐	
  Bucket	
  Sampling	
  
•  Problem	
  PaJern	
  
–  Work	
  on	
  joinable	
  samples	
  of	
  data	
  from	
  different	
  tables	
  
•  Solu%on	
  paJern	
  
–  Use	
  Bucket	
  Sampling	
  
•  Approach	
  
•  TABLESAMPLE	
  (BUCKET	
  x	
  OUT	
  OF	
  Y	
  ON	
  column)	
  
•  Benefit	
  
–  Useful	
  while	
  working	
  with	
  sample	
  data	
  and	
  joins	
  
#3	
  -­‐	
  Bucket	
  Sampling	
  
Filex002.dat	
  
Filex030.dat	
  
Filex064.dat	
  
Files	
  containing	
  bookings	
  data	
  
bucketed	
  on	
  i%nid	
  
a	
  
SELECT	
  a.*,	
  b.*	
  
FROM	
  Airline_Bookings_All	
  TABLESAMPLE(bucket	
  30	
  out	
  of	
  64	
  on	
  i%nid)	
  a	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ,	
  Airline_Bookings_Origin_Only	
  TABLESAMPLE(bucket	
  30	
  out	
  of	
  64	
  on	
  i%nid)	
  b	
  
WHERE	
  a.i%nid	
  =	
  b.i%nid	
  
Filex001.dat	
  
Filex063.dat	
  
Filey002.dat	
  
Filey030.dat	
  
Filey064.dat	
  
b	
  
Filey001.dat	
  
Filey063.dat	
  
#4	
  –	
  Block	
  Sampling	
  
•  Problem	
  PaJern	
  
–  View	
  a	
  sample	
  of	
  a	
  data	
  with	
  in	
  a	
  table	
  
–  Sample	
  size	
  expressed	
  as	
  number	
  of	
  rows,	
  %age	
  of	
  data,	
  or	
  
number	
  of	
  MBs	
  
•  Solu%on	
  paJern	
  
–  Use	
  Block	
  sampling	
  
•  Approach	
  
–  Use	
  TABLESAMPLE	
  (n%,	
  nM,	
  or	
  n	
  ROWS)	
  
•  Benefit	
  
–  Geyng	
  a	
  random	
  sample	
  from	
  the	
  table	
  
–  More	
  op%ons	
  to	
  specify	
  how	
  many	
  samples	
  to	
  generate	
  
#5	
  –	
  Parallel	
  Execu%on	
  
SELECT a.year, a.quarter, a.origin, a.originstate, count(*) ct
FROM
(
SELECT itinid,
year,
quarter,
origin,
originstate
FROM air_travel_bookings_8
)a
JOIN
(
SELECT itinid,
origin,
originstate
FROM air_travel_origins_8
)B
ON
( A.itinid = b.itinid
and a.origin = b.origin
and a.originstate = b.originstate)
GROUP BY
a.year, a.quarter, a.origin, a.originstate;
Stage	
  1	
  
Stage	
  2	
  
Stage	
  3	
  
Stage	
  1	
  
Stage	
  2	
  
Stage	
  3	
  
Stage	
  1	
   Stage	
  2	
  
Stage	
  3	
  
set	
  hive.exec.parallel	
  =	
  false;	
  
set	
  hive.exec.parallel	
  =	
  true;	
  
Summary	
  
•  Iterate	
  quickly	
  on	
  Query	
  Design	
  
– Use	
  Bucket	
  and	
  Block	
  Sampling	
  
•  Run	
  queries	
  faster	
  
– Par%%oning	
  to	
  invoke	
  Par%%on	
  Pruning	
  
– Bucke%ng	
  to	
  invoke	
  Bucket	
  Map	
  Joins	
  
– Execute	
  complex	
  queries	
  in	
  parallel	
  
THANK	
  YOU	
  
Managed	
  Cluster	
   Built-­‐In	
  Connectors	
   Friendly	
  User-­‐Interface	
   Dedicated	
  Support	
  
•  100%	
  Managed	
  Hadoop	
  Cluster	
  in	
  the	
  Cloud	
  
•  Auto-­‐Scaling	
  Cluster.	
  Full	
  Life-­‐cycle	
  Management	
  
•  +12	
  Connectors	
  to	
  Applica%ons	
  and	
  Data	
  Sources	
  
•  14-­‐Day	
  Free	
  Trial	
  (free	
  account	
  available)	
  
•  24/7	
  Customer	
  Support	
  
What’s	
  Included?	
  
è	
  www.qubole.com/try	
  ç	
  

More Related Content

Similar to Effective Hive Queries

Polymer Brush Data Processor
Polymer Brush Data ProcessorPolymer Brush Data Processor
Polymer Brush Data ProcessorCory Bethrant
 
Esoteric Data structures
Esoteric Data structures Esoteric Data structures
Esoteric Data structures Mugisha Moses
 
EnviroInsite training workshop - Database fundamentals
EnviroInsite training workshop - Database fundamentalsEnviroInsite training workshop - Database fundamentals
EnviroInsite training workshop - Database fundamentalsBruce Jacobs
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine LearningAmanBhalla14
 
Obtain better data accuracy using reference tables
Obtain better data accuracy using reference tablesObtain better data accuracy using reference tables
Obtain better data accuracy using reference tablesKiran Venna
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in RFlorian Uhlitz
 
Rational Publishing Engine and Rational ClearQuest
Rational Publishing Engine and Rational ClearQuestRational Publishing Engine and Rational ClearQuest
Rational Publishing Engine and Rational ClearQuestGEBS Reporting
 
ECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsJan Aerts
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersUniversity of Huddersfield
 
[Www.pkbulk.blogspot.com]dbms13
[Www.pkbulk.blogspot.com]dbms13[Www.pkbulk.blogspot.com]dbms13
[Www.pkbulk.blogspot.com]dbms13AnusAhmad
 
Informix partitioning interval_rolling_window_table
Informix partitioning interval_rolling_window_tableInformix partitioning interval_rolling_window_table
Informix partitioning interval_rolling_window_tableKeshav Murthy
 
Introduction to programming c and data-structures
Introduction to programming c and data-structures Introduction to programming c and data-structures
Introduction to programming c and data-structures Pradipta Mishra
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesDatabricks
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAmazon Web Services
 

Similar to Effective Hive Queries (20)

Polymer Brush Data Processor
Polymer Brush Data ProcessorPolymer Brush Data Processor
Polymer Brush Data Processor
 
Esoteric Data structures
Esoteric Data structures Esoteric Data structures
Esoteric Data structures
 
EnviroInsite training workshop - Database fundamentals
EnviroInsite training workshop - Database fundamentalsEnviroInsite training workshop - Database fundamentals
EnviroInsite training workshop - Database fundamentals
 
Data import-cheatsheet
Data import-cheatsheetData import-cheatsheet
Data import-cheatsheet
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
R Cheat Sheet
R Cheat SheetR Cheat Sheet
R Cheat Sheet
 
Obtain better data accuracy using reference tables
Obtain better data accuracy using reference tablesObtain better data accuracy using reference tables
Obtain better data accuracy using reference tables
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in R
 
Rational Publishing Engine and Rational ClearQuest
Rational Publishing Engine and Rational ClearQuestRational Publishing Engine and Rational ClearQuest
Rational Publishing Engine and Rational ClearQuest
 
ECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPs
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
[Www.pkbulk.blogspot.com]dbms13
[Www.pkbulk.blogspot.com]dbms13[Www.pkbulk.blogspot.com]dbms13
[Www.pkbulk.blogspot.com]dbms13
 
Informix partitioning interval_rolling_window_table
Informix partitioning interval_rolling_window_tableInformix partitioning interval_rolling_window_table
Informix partitioning interval_rolling_window_table
 
Introduction to programming c and data-structures
Introduction to programming c and data-structures Introduction to programming c and data-structures
Introduction to programming c and data-structures
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Flights Landing Overrun Project
Flights Landing Overrun ProjectFlights Landing Overrun Project
Flights Landing Overrun Project
 
matlab_tutorial.ppt
matlab_tutorial.pptmatlab_tutorial.ppt
matlab_tutorial.ppt
 
matlab_tutorial.ppt
matlab_tutorial.pptmatlab_tutorial.ppt
matlab_tutorial.ppt
 
matlab_tutorial.ppt
matlab_tutorial.pptmatlab_tutorial.ppt
matlab_tutorial.ppt
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing Performance
 

More from Qubole

Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Qubole
 
7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome ThemQubole
 
State of Big Data Adoption
State of Big Data AdoptionState of Big Data Adoption
State of Big Data AdoptionQubole
 
Big Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by QuboleBig Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by QuboleQubole
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance Qubole
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on YarnQubole
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConfQubole
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on CloudQubole
 
Qubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at PinterestQubole
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesQubole
 
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup Qubole
 
BIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleBIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleQubole
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveQubole
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataQubole
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data TipsQubole
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposalQubole
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloudQubole
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveQubole
 

More from Qubole (20)

Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
 
7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them
 
State of Big Data Adoption
State of Big Data AdoptionState of Big Data Adoption
State of Big Data Adoption
 
Big Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by QuboleBig Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by Qubole
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on Yarn
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on Cloud
 
Qubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole State of the Big Data Industry
Qubole State of the Big Data Industry
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at Pinterest
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
 
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
 
BIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleBIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - Qubole
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big Data
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data Tips
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposal
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloud
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using Hive
 

Recently uploaded

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sectoritnewsafrica
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 

Recently uploaded (20)

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 

Effective Hive Queries

  • 1. Effec%ve  Hive  Queries                                Secrets  From  the  Pros   We  will  be  star,ng  at  11:03  PDT   Use  the  Chat  Pane  in  GoToWebinar  to  Ask  Ques%ons!   Assess  your  level  and  learn  new  stuff   This  webinar  is  intended  for  intermediate  audiences   (familiar  with  Apache  Hive  and  Hadoop,  but  not  experts)   ?  
  • 2. AGENDA   This  Webinar  provides  %ps  on  improving  the  performance  and   beJer  u%lizing  resources  using  the  following  best  prac%ces:   •  Data  Layout  (Par%%ons  and  Buckets)   •  Data  Sampling  (Bucket  and  Block  sampling)   •  Data  Processing  (Bucket  Map  Join  and  Parallel   execu%on)  
  • 3. Dataset  Used   #  of  records:    276M  records   Columns:   I%nerary  ID   Year  &Quarter  of  Travel   Trip  Origin  City  &  State   Trip  Des%na%on  City  &  State   Distance  between  Origin  &  Des%na%on   Airline  Bookings  All   Includes  stops  at  intermediate  ci%es   #  of  records:    116M  records   Columns:   I%nerary  ID   Year  &Quarter  of  Travel   Trip  Origin  City  &  State   Trip  Des%na%on  City  &  State   Distance  between  Origin  &  Des%na%on   Airline  Bookings  Origin  Only   Only  first  leg  of  travel   #  of  records:    50   Columns:   State  code  &  Name   Popula%on   Census   Human  popula%on  by  US  State  
  • 4. #1  -­‐  Data  Par%%oning     •  Problem  PaJern   –  Query  a  subset  of  data  in  a  table   –  Subset  iden%fied  by  “Column_Name  =  X”  filter   •  Solu%on  paJern   –  Layout  data  in  sub-­‐directories  with  each  directory  associated   with  a  value  of  the  par%%on  column   –  The  filter  on  par%%on  column  just  picks  a  single  sub  directory   •  Approach   –  Use  PARTITION  BY  clause   •  Benefit   –  Par%%on  pruning   –  2.7x  faster  on  a  query  on  Airline  Bookings  Dataset  (29  seconds)  
  • 5. #1  -­‐  Data  Par%%oning   Airline  Bookings  All  Table   Origin  State  (Par%%on   Column  /  Sub-­‐directory)   CA   WY  AL   File1001.dat   File1002.dat   File100n.dat   File3001.dat   File3002.dat   File300n.dat   Filex001.dat   Filex002.dat   Filex00n.dat   Files  inside  the   par%%on   SELECT  origin_city,  origin_state   FROM  Airline_Bookings_All   WHERE  origin_state  =  ‘CA’   CREATE  TABLE  Airline_Bookings_All   ….   PARTITIONED  BY  (origin_state  STRING)  
  • 6. #2  -­‐  Data  Bucke%ng   •  Problem  PaJern   –  Join  data  in  two  large  tables  efficiently   –  Sample  data  inside  a  table  efficiently   •  Solu%on  paJern   –  More  efficient  processing  by  storing  data  in  hash  buckets   •  Approach   •  Use  bucke%ng  using  CLUSTERED  BY  ..  INTO  n  BUCKETS   •  Benefit   –  Bucket  Map  Join   –  Bucket  Sampling  
  • 7. #2  –  Data  Bucke%ng   CREATE  TABLE  Airline_Bookings_All   …   CLUSTERED  BY  (i%nid)  INTO  64  BUCKETS   set  hive.enforce.bucke%ng  =  true;   INSERT  OVERWRITE  TABLE  Airline_Bookings_All   SELECT  …   FROM  ..   Ailrine_Bookings_All   File00.dat   File63.dat   File01.dat   Each  File  contains  all   the  rows  that   correspond  to  the   same  hash  of  i%nid   column  
  • 8. #2  -­‐  Data  Bucke%ng   a   File1001.dat   File1002.dat   File100n.dat   Filex001.dat   Filex002.dat   Filex00m.dat   Files  containing  table   data  bucketed  on  a   column   b   set  hive.op%mize.bucketmapjoin  =  true;     SELECT  /*+  MAPJOIN(a,  b)  */  a.*,  b.*   FROM  Airline_Bookings_All  a  JOIN  Airline_Bookings_Origin_Only  b   ON  a.i%nid  =  b.i%nid   Note:     1.  Both  the  tables  are  bucketed  on  i%nid  column   2.  The  numbers  of  buckets  in  the  two  tables  are  a  strict  mul%ple  of  each  other  
  • 9. #3  -­‐  Bucket  Sampling   •  Problem  PaJern   –  Work  on  joinable  samples  of  data  from  different  tables   •  Solu%on  paJern   –  Use  Bucket  Sampling   •  Approach   •  TABLESAMPLE  (BUCKET  x  OUT  OF  Y  ON  column)   •  Benefit   –  Useful  while  working  with  sample  data  and  joins  
  • 10. #3  -­‐  Bucket  Sampling   Filex002.dat   Filex030.dat   Filex064.dat   Files  containing  bookings  data   bucketed  on  i%nid   a   SELECT  a.*,  b.*   FROM  Airline_Bookings_All  TABLESAMPLE(bucket  30  out  of  64  on  i%nid)  a                      ,  Airline_Bookings_Origin_Only  TABLESAMPLE(bucket  30  out  of  64  on  i%nid)  b   WHERE  a.i%nid  =  b.i%nid   Filex001.dat   Filex063.dat   Filey002.dat   Filey030.dat   Filey064.dat   b   Filey001.dat   Filey063.dat  
  • 11. #4  –  Block  Sampling   •  Problem  PaJern   –  View  a  sample  of  a  data  with  in  a  table   –  Sample  size  expressed  as  number  of  rows,  %age  of  data,  or   number  of  MBs   •  Solu%on  paJern   –  Use  Block  sampling   •  Approach   –  Use  TABLESAMPLE  (n%,  nM,  or  n  ROWS)   •  Benefit   –  Geyng  a  random  sample  from  the  table   –  More  op%ons  to  specify  how  many  samples  to  generate  
  • 12. #5  –  Parallel  Execu%on   SELECT a.year, a.quarter, a.origin, a.originstate, count(*) ct FROM ( SELECT itinid, year, quarter, origin, originstate FROM air_travel_bookings_8 )a JOIN ( SELECT itinid, origin, originstate FROM air_travel_origins_8 )B ON ( A.itinid = b.itinid and a.origin = b.origin and a.originstate = b.originstate) GROUP BY a.year, a.quarter, a.origin, a.originstate; Stage  1   Stage  2   Stage  3   Stage  1   Stage  2   Stage  3   Stage  1   Stage  2   Stage  3   set  hive.exec.parallel  =  false;   set  hive.exec.parallel  =  true;  
  • 13. Summary   •  Iterate  quickly  on  Query  Design   – Use  Bucket  and  Block  Sampling   •  Run  queries  faster   – Par%%oning  to  invoke  Par%%on  Pruning   – Bucke%ng  to  invoke  Bucket  Map  Joins   – Execute  complex  queries  in  parallel  
  • 14. THANK  YOU   Managed  Cluster   Built-­‐In  Connectors   Friendly  User-­‐Interface   Dedicated  Support   •  100%  Managed  Hadoop  Cluster  in  the  Cloud   •  Auto-­‐Scaling  Cluster.  Full  Life-­‐cycle  Management   •  +12  Connectors  to  Applica%ons  and  Data  Sources   •  14-­‐Day  Free  Trial  (free  account  available)   •  24/7  Customer  Support   What’s  Included?   è  www.qubole.com/try  ç