SlideShare a Scribd company logo
1 of 91
Download to read offline
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive & Performance
Toronto Hadoop User Group
July 23 2013
Page 1
Presenter:
Adam Muise – Hortonworks
amuise@hortonworks.com
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Agenda
• Hive – What Is It Good For?
• Hive’s Architecture and SQL Compatibility
• Turning Hive Performance to 11
• Get Data In and Out of Hive
• Hive Security
• Project Stinger – Making Hive 100x Faster
• Connecting to Hive From Popular Tools
Page 2
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive – SQL Analytics For Any Data Size
Page 3
Sensor	
  Mobile	
  
Weblog	
  
Opera1onal	
  
/	
  MPP	
  
Store	
  and	
  Query	
  all	
  
Data	
  in	
  Hive	
  
Use	
  Exis6ng	
  SQL	
  Tools	
  
and	
  Exis6ng	
  SQL	
  Processes	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive’s Focus
• Scalable SQL processing over data in Hadoop
• Scales to 100PB+
• Structured and Unstructured data
Page 4
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Comparing Hive with RDBMS
Page 5
Hive	
   RDBMS	
  
SQL	
  Interface.	
   SQL	
  Interface.	
  
Focus	
  on	
  analy1cs.	
   May	
  focus	
  on	
  online	
  or	
  analy1cs.	
  
No	
  transac1ons.	
   Transac1ons	
  usually	
  supported.	
  
Par11on	
  adds,	
  no	
  random	
  INSERTs.	
  
In-­‐Place	
  updates	
  not	
  na1vely	
  supported	
  (but	
  
are	
  possible).	
  
Random	
  INSERT	
  and	
  UPDATE	
  supported.	
  
Distributed	
  processing	
  via	
  map/reduce.	
   Distributed	
  processing	
  varies	
  by	
  vendor	
  (if	
  
available).	
  
Scales	
  to	
  hundreds	
  of	
  nodes.	
   Seldom	
  scale	
  beyond	
  20	
  nodes.	
  
Built	
  for	
  commodity	
  hardware.	
   OQen	
  built	
  on	
  proprietary	
  hardware	
  
(especially	
  when	
  scaling	
  out).	
  
Low	
  cost	
  per	
  petabyte.	
   What’s	
  a	
  petabyte?	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Agenda
• Hive – What Is It Good For?
• Hive’s Architecture and SQL Compatibility
• Turning Hive Performance to 11
• Get Data In and Out of Hive
• Hive Security
• Project Stinger – Making Hive 100x Faster
• Connecting to Hive From Popular Tools
Page 6
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive: The SQL Interface to Hadoop
Page 7
HiveServer2 Hive
Job TrackerName Node
Task
Tracker
Data Node
Web UI
JDBC /
ODBC
CLI
Hive
Hadoop
Compiler
Optimizer
Executor
Hive
SQL
Map / Reduce
User issues SQL query
Hive parses and plans query
Query converted to Map/Reduce
Map/Reduce run by Hadoop
1
2
3
4
1
2
3
4
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive: Reliable SQL Processing at Scale
Page 8
Hive
Task
Tracker
Time 1:
Job = 50% Complete
Progress Stored in HDFS
Node 1
Node 2
Node 3
Node 4
H
D
F
S
Time 2:
Node 3 Fails,
Job = 85% Complete
Node 1
Node 2
Node 3
Node 4
H
D
F
S
Time 3:
Job moves to Node 4
Job = 100% Complete
Node 1
Node 2
Node 3
Node 4
H
D
F
S
SQL
Map /
Reduce
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
SQL Coverage: SQL 92 with Extensions
Page 9
SQL	
  Datatypes	
   SQL	
  Seman6cs	
  
INT	
   SELECT,	
  LOAD,	
  INSERT	
  from	
  query	
  
TINYINT/SMALLINT/BIGINT	
   Expressions	
  in	
  WHERE	
  and	
  HAVING	
  
BOOLEAN	
   GROUP	
  BY,	
  ORDER	
  BY,	
  SORT	
  BY	
  
FLOAT	
   CLUSTER	
  BY,	
  DISTRIBUTE	
  BY	
  
DOUBLE	
   Sub-­‐queries	
  in	
  FROM	
  clause	
  
STRING	
   GROUP	
  BY,	
  ORDER	
  BY	
  
BINARY	
   ROLLUP	
  and	
  CUBE	
  
TIMESTAMP	
   UNION	
  
ARRAY,	
  MAP,	
  STRUCT,	
  UNION	
   LEFT,	
  RIGHT	
  and	
  FULL	
  INNER/OUTER	
  JOIN	
  
DECIMAL	
   CROSS	
  JOIN,	
  LEFT	
  SEMI	
  JOIN	
  
CHAR	
   Windowing	
  func1ons	
  (OVER,	
  RANK,	
  etc.)	
  
VARCHAR	
   Sub-­‐queries	
  for	
  IN/NOT	
  IN,	
  HAVING	
  
DATE	
   EXISTS	
  /	
  NOT	
  EXISTS	
  
INTERSECT,	
  EXCEPT	
  
Legend	
  
Available	
  
Roadmap	
  
New	
  in	
  Hive	
  0.11	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Agenda
• Hive – What Is It Good For?
• Hive’s Architecture and SQL Compatibility
• Turning Hive Performance to 11
• Get Data In and Out of Hive
• Hive Security
• Project Stinger – Making Hive 100x Faster
• Connecting to Hive From Popular Tools
Page 10
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Data Abstractions in Hive
Page 11
Par11ons,	
  buckets	
  and	
  skews	
  facilitate	
  
faster,	
  more	
  direct	
  data	
  access.	
  
Database	
  
Table	
   Table	
  
Par11on	
   Par11on	
   Par11on	
  
Bucket	
  
Bucket	
  
Bucket	
  
Op1onal	
  Per	
  Table	
  
Skewed	
  Keys	
  
Unskewed	
  
Keys	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
“I heard you should avoid joins…”
• “Joins are evil” – Cal Henderson
– Joins should be avoided in online systems.
• Joins are unavoidable in analytics.
– Making joins fast is the key design point.
Page 12
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Quick Refresher on Joins
Page 13
customer	
   order	
  
first	
   last	
   id	
   cid	
   price	
   quan6ty	
  
Nick	
   Toner	
   11911	
   4150	
   10.50	
   3	
  
Jessie	
   Simonds	
   11912	
   11914	
   12.25	
   27	
  
Kasi	
   Lamers	
   11913	
   3491	
   5.99	
   5	
  
Rodger	
   Clayton	
   11914	
   2934	
   39.99	
   22	
  
Verona	
   Hollen	
   11915	
   11914	
   40.50	
   10	
  
SELECT	
  *	
  FROM	
  customer	
  join	
  order	
  ON	
  customer.id	
  =	
  order.cid;	
  
Joins	
  match	
  values	
  from	
  one	
  table	
  against	
  values	
  in	
  another	
  table.	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive Join Strategies
Page 14
Type	
   Approach	
   Pros	
   Cons	
  
Shuffle	
  
Join	
  
Join	
  keys	
  are	
  shuffled	
  using	
  
map/reduce	
  and	
  joins	
  
performed	
  join	
  side.	
  
Works	
  regardless	
  of	
  
data	
  size	
  or	
  layout.	
  
Most	
  resource-­‐
intensive	
  and	
  
slowest	
  join	
  type.	
  
Broadcast	
  
Join	
  
Small	
  tables	
  are	
  loaded	
  into	
  
memory	
  in	
  all	
  nodes,	
  
mapper	
  scans	
  through	
  the	
  
large	
  table	
  and	
  joins.	
  
Very	
  fast,	
  single	
  scan	
  
through	
  largest	
  
table.	
  
All	
  but	
  one	
  table	
  
must	
  be	
  small	
  
enough	
  to	
  fit	
  in	
  
RAM.	
  
Sort-­‐
Merge-­‐
Bucket	
  
Join	
  
Mappers	
  take	
  advantage	
  of	
  
co-­‐loca1on	
  of	
  keys	
  to	
  do	
  
efficient	
  joins.	
  
Very	
  fast	
  for	
  tables	
  
of	
  any	
  size.	
  
Data	
  must	
  be	
  sorted	
  
and	
  bucketed	
  ahead	
  
of	
  1me.	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Shuffle Joins in Map Reduce
Page 15
customer	
   order	
  
first	
   last	
   id	
   cid	
   price	
   quan6ty	
  
Nick	
   Toner	
   11911	
   4150	
   10.50	
   3	
  
Jessie	
   Simonds	
   11912	
   11914	
   12.25	
   27	
  
Kasi	
   Lamers	
   11913	
   3491	
   5.99	
   5	
  
Rodger	
   Clayton	
   11914	
   2934	
   39.99	
   22	
  
Verona	
   Hollen	
   11915	
   11914	
   40.50	
   10	
  
SELECT	
  *	
  FROM	
  customer	
  join	
  order	
  ON	
  customer.id	
  =	
  order.cid;	
  
M
{	
  id:	
  11911,	
  {	
  first:	
  Nick,	
  last:	
  Toner	
  }}	
  
{	
  id:	
  11914,	
  {	
  first:	
  Rodger,	
  last:	
  Clayton	
  }}	
  
…	
  
M
{	
  cid:	
  4150,	
  {	
  price:	
  10.50,	
  quan1ty:	
  3	
  }}	
  
{	
  cid:	
  11914,	
  {	
  price:	
  12.25,	
  quan1ty:	
  27	
  }}	
  
…	
  
R
{	
  id:	
  11914,	
  {	
  first:	
  Rodger,	
  last:	
  Clayton	
  }}	
  
{	
  cid:	
  11914,	
  {	
  price:	
  12.25,	
  quan1ty:	
  27	
  }}	
  
R
{	
  id:	
  11911,	
  {	
  first:	
  Nick,	
  last:	
  Toner	
  }}	
  
{	
  cid:	
  4150,	
  {	
  price:	
  10.50,	
  quan1ty:	
  3	
  }}	
  
…	
  
Iden1cal	
  keys	
  shuffled	
  to	
  the	
  same	
  reducer.	
  Join	
  done	
  reduce-­‐side.	
  
Expensive	
  from	
  a	
  network	
  u1liza1on	
  standpoint.	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Broadcast Join
•  Star schemas use dimension tables small enough to fit in RAM.
•  Small tables held in memory by all nodes.
•  Single pass through the large table.
•  Used for star-schema type joins common in DW.
Page 16
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
When both are too large for memory:
Page 17
customer	
   order	
  
first	
   last	
   id	
   cid	
   price	
   quan6ty	
  
Nick	
   Toner	
   11911	
   4150	
   10.50	
   3	
  
Jessie	
   Simonds	
   11912	
   11914	
   12.25	
   27	
  
Kasi	
   Lamers	
   11913	
   11914	
   40.50	
   10	
  
Rodger	
   Clayton	
   11914	
   12337	
   39.99	
   22	
  
Verona	
   Hollen	
   11915	
   15912	
   40.50	
   10	
  
SELECT	
  *	
  FROM	
  customer	
  join	
  order	
  ON	
  customer.id	
  =	
  order.cid;	
  
CREATE	
  TABLE	
  customer	
  (id	
  int,	
  first	
  string,	
  last	
  string)	
  
CLUSTERED	
  BY(id)	
  SORTED	
  BY(id)	
  INTO	
  32	
  BUCKETS;	
  
CREATE	
  TABLE	
  order	
  (cid	
  int,	
  price	
  float,	
  quantity	
  int)	
  
CLUSTERED	
  BY(cid)	
  SORTED	
  BY(cid)	
  INTO	
  32	
  BUCKETS;	
  
Cluster	
  and	
  sort	
  by	
  the	
  most	
  common	
  join	
  key.	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive’s Clustering and Sorting
Page 18
customer	
   order	
  
first	
   last	
   id	
   cid	
   price	
   quan6ty	
  
Nick	
   Toner	
   11911	
   4150	
   10.50	
   3	
  
Jessie	
   Simonds	
   11912	
   11914	
   12.25	
   27	
  
Kasi	
   Lamers	
   11913	
   11914	
   40.50	
   10	
  
Rodger	
   Clayton	
   11914	
   12337	
   39.99	
   22	
  
Verona	
   Hollen	
   11915	
   15912	
   40.50	
   10	
  
SELECT	
  *	
  FROM	
  customer	
  join	
  order	
  ON	
  customer.id	
  =	
  order.cid;	
  
Observa1on	
  1:	
  
Sor1ng	
  by	
  the	
  join	
  key	
  makes	
  joins	
  easy.	
  
All	
  possible	
  matches	
  reside	
  in	
  the	
  same	
  area	
  on	
  disk.	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive’s Clustering and Sorting
Page 19
customer	
   order	
  
first	
   last	
   id	
   cid	
   price	
   quan6ty	
  
Nick	
   Toner	
   11911	
   4150	
   10.50	
   3	
  
Jessie	
   Simonds	
   11912	
   11914	
   12.25	
   27	
  
Kasi	
   Lamers	
   11913	
   11914	
   40.50	
   10	
  
Rodger	
   Clayton	
   11914	
   12337	
   39.99	
   22	
  
Verona	
   Hollen	
   11915	
   15912	
   40.50	
   10	
  
SELECT	
  *	
  FROM	
  customer	
  join	
  order	
  ON	
  customer.id	
  =	
  order.cid;	
  
Observa1on	
  2:	
  
Hash	
  bucke1ng	
  a	
  join	
  key	
  ensures	
  all	
  matching	
  values	
  reside	
  on	
  the	
  same	
  node.	
  
Equi-­‐joins	
  can	
  then	
  run	
  with	
  no	
  shuffle.	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Controlling Data Locality with Hive
• Bucketing:
– Hash partition values into a configurable number of buckets.
– Usually coupled with sorting.
• Skews:
– Split values out into separate files.
– Used when certain values are frequently seen.
• Replication Factor:
– Increase replication factor to accelerate reads.
– Controlled at the HDFS layer.
• Sorting:
– Sort the values within given columns.
– Greatly accelerates query when used with ORCFile filter
pushdown.
Page 20
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Guidelines for Architecting Hive Data
Page 21
Table	
  Size	
   Data	
  Profile	
   Query	
  PaNern	
   Recommenda6on	
  
Small	
   Hot	
  data	
   Any	
   Increase	
  replica1on	
  factor	
  
Any	
   Any	
   Very	
  precise	
  filters	
  
Sort	
  on	
  column	
  most	
  
frequently	
  used	
  in	
  precise	
  
queries	
  
Large	
   Any	
  
Joined	
  to	
  another	
  large	
  
table	
  
Sort	
  and	
  bucket	
  both	
  
tables	
  along	
  the	
  join	
  key	
  
Large	
  
One	
  value	
  >25%	
  of	
  
count	
  within	
  high	
  
cardinality	
  column	
  
Any	
  
Split	
  the	
  frequent	
  value	
  
into	
  a	
  separate	
  skew	
  
Large	
   Any	
  
Queries	
  tend	
  to	
  have	
  a	
  
natural	
  boundary	
  such	
  as	
  
date	
  
Par11on	
  the	
  data	
  along	
  
the	
  natural	
  boundary	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive Persistence Formats
• Built-in Formats:
– ORCFile
– RCFile
– Avro
– Delimited Text
– Regular Expression
– S3 Logfile
– Typed Bytes
• 3rd-Party Addons:
– JSON
– XML
Page 22
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive allows mixed format.
• Use Case:
– Ingest data in a write-optimized format like JSON or delimited.
– Every night, run a batch job to convert to read-optimized ORCFile.
Page 23
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
ORCFile – Efficient Columnar Layout
Large block size
well suited for
HDFS.
Columnar format
arranges columns
adjacent within the
file for compression
and fast access.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
ORCFile Advantages
• High Compression
– Many tricks used out-of-the-box to ensure high compression rates.
– RLE, dictionary encoding, etc.
• High Performance
– Inline indexes record value ranges within blocks of ORCFile data.
– Filter pushdown allows efficient scanning during precise queries.
• Flexible Data Model
– All Hive types including maps, structs and unions.
Page 25
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
High Compression with ORCFile
Page 26
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Some ORCFile Samples
Page 27
sale	
  
id	
   6mestamp	
   productsk	
   storesk	
   amount	
   state	
  
10000	
   2013-­‐06-­‐13T09:03:05	
   16775	
   670	
   $70.50	
   CA	
  
10001	
   2013-­‐06-­‐13T09:03:05	
   10739	
   359	
   $52.99	
   IL	
  
10002	
   2013-­‐06-­‐13T09:03:06	
   4671	
   606	
   $67.12	
   MA	
  
10003	
   2013-­‐06-­‐13T09:03:08	
   7224	
   174	
   $96.85	
   CA	
  
10004	
   2013-­‐06-­‐13T09:03:12	
   9354	
   123	
   $67.76	
   CA	
  
10005	
   2013-­‐06-­‐13T09:03:18	
   1192	
   497	
   $25.73	
   IL	
  
CREATE	
  TABLE	
  sale	
  (	
  
	
  	
  	
  	
  id	
  int,	
  timestamp	
  timestamp,	
  
	
  	
  	
  	
  productsk	
  int,	
  storesk	
  int,	
  
	
  	
  	
  	
  amount	
  decimal,	
  state	
  string	
  
)	
  STORED	
  AS	
  orc;	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
ORCFile Options and Defaults
Page 28
Key	
   Default	
   Notes	
  
orc.compress	
   ZLIB	
  
High	
  level	
  compression	
  (one	
  of	
  NONE,	
  ZLIB,	
  
SNAPPY)	
  
orc.compress.size	
  
262,144	
  
(=	
  256	
  KiB)	
  
Number	
  of	
  bytes	
  in	
  each	
  compression	
  chunk	
  
orc.stripe.size	
  
268,435,456	
  
(=	
  256	
  MiB)	
  
Number	
  of	
  bytes	
  in	
  each	
  stripe	
  
orc.row.index.stride	
   10,000	
  
Number	
  of	
  rows	
  between	
  index	
  entries	
  (must	
  
be	
  >=	
  1,000)	
  
orc.create.index	
   true	
   Whether	
  to	
  create	
  row	
  indexes	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
No Compression: Faster but Larger
Page 29
sale	
  
id	
   6mestamp	
   productsk	
   storesk	
   amount	
   state	
  
10000	
   2013-­‐06-­‐13T09:03:05	
   16775	
   670	
   $70.50	
   CA	
  
10001	
   2013-­‐06-­‐13T09:03:05	
   10739	
   359	
   $52.99	
   IL	
  
10002	
   2013-­‐06-­‐13T09:03:06	
   4671	
   606	
   $67.12	
   MA	
  
10003	
   2013-­‐06-­‐13T09:03:08	
   7224	
   174	
   $96.85	
   CA	
  
10004	
   2013-­‐06-­‐13T09:03:12	
   9354	
   123	
   $67.76	
   CA	
  
10005	
   2013-­‐06-­‐13T09:03:18	
   1192	
   497	
   $25.73	
   IL	
  
CREATE	
  TABLE	
  sale	
  (	
  
	
  	
  	
  	
  id	
  int,	
  timestamp	
  timestamp,	
  
	
  	
  	
  	
  productsk	
  int,	
  storesk	
  int,	
  
	
  	
  	
  	
  amount	
  decimal,	
  state	
  string	
  
)	
  STORED	
  AS	
  orc	
  tblproperties	
  ("orc.compress"="NONE");	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Column Sorting to Facilitate Skipping
Page 30
sale	
  
id	
   6mestamp	
   productsk	
   storesk	
   amount	
   state	
  
10005	
   2013-­‐06-­‐13T09:03:18	
   1192	
   497	
   $25.73	
   IL	
  
10002	
   2013-­‐06-­‐13T09:03:06	
   4671	
   606	
   $67.12	
   MA	
  
10003	
   2013-­‐06-­‐13T09:03:08	
   7224	
   174	
   $96.85	
   CA	
  
10004	
   2013-­‐06-­‐13T09:03:12	
   9354	
   123	
   $67.76	
   CA	
  
10001	
   2013-­‐06-­‐13T09:03:05	
   10739	
   359	
   $52.99	
   IL	
  
10000	
   2013-­‐06-­‐13T09:03:05	
   16775	
   670	
   $70.50	
   CA	
  
CREATE	
  TABLE	
  sale	
  (	
  
	
  	
  	
  	
  id	
  int,	
  timestamp	
  timestamp,	
  
	
  	
  	
  	
  productsk	
  int,	
  storesk	
  int,	
  
	
  	
  	
  	
  amount	
  decimal,	
  state	
  string	
  
)	
  STORED	
  AS	
  orc;	
  
INSERT	
  INTO	
  sale	
  AS	
  SELECT	
  *	
  FROM	
  staging	
  SORT	
  BY	
  productsk;	
  
ORCFile	
  skipping	
  speeds	
  queries	
  like	
  
WHERE	
  productsk	
  =	
  X,	
  productsk	
  IN	
  (Y,	
  Z);	
  etc.	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Not Your Traditional Database
• Traditional solution to all RDBMS problems:
– Put an index on it!
• Doing this in Hadoop = #fail
Page 31
Your Oracle DBA
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Going Fast in Hadoop
• Hadoop:
– Really good at coordinated sequential scans.
– No random I/O. Traditional index pretty much useless.
• Keys to speed in Hadoop:
– Sorting and skipping take the place of indexing.
– Minimizing data shuffle the other key consideration.
• Skipping data:
– Divide data among different files which can be pruned out.
– Partitions, buckets and skews.
– Skip records during scans using small embedded indexes.
– Automatic when you use ORCFile format.
– Sort data ahead of time.
– Simplifies joins and skipping becomes more effective.
Page 32
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Data Layout Considerations for Fast Hive
Page 33
Skip Reads Minimize Shuffle Reduce Latency
Partition Tables and/or
Use Skew
Sort and Bucket on
Common Join Keys
Increase Replication
Factor For Hot Data
Sort Secondary
Columns when Using
ORCFile
Use Broadcast Joins
when Joining Small
Tables
Enable Short-Circuit
Read
Take Advantage of Tez +
Tez Service (Future)
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Partitioning and Virtual Columns
• Partitioning makes queries go fast.
• You will almost always use some sort of partitioning.
• When partitioning you will use 1 or more virtual
columns.
• Virtual columns cause directories to be created in
HDFS.
– Files for that partition are stored within that subdirectory.
Page 34
#	
  Notice	
  how	
  xdate	
  and	
  state	
  are	
  not	
  “real”	
  column	
  names.	
  
CREATE	
  TABLE	
  sale	
  (	
  
	
  	
  	
  	
  id	
  int,	
  amount	
  decimal,	
  ...	
  
)	
  partitioned	
  by	
  (xdate	
  string,	
  state	
  string);	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Loading Data with Virtual Columns
• By default at least one virtual column must be hard-
coded.
• You can load all partitions in one shot:
– set hive.exec.dynamic.partition.mode=nonstrict;
– Warning: You can easily overwhelm your cluster this way.
Page 35
INSERT	
  INTO	
  sale	
  (xdate=‘2013-­‐03-­‐01’,	
  state=‘CA’)	
  
SELECT	
  *	
  FROM	
  staging_table	
  
WHERE	
  xdate	
  =	
  ‘2013-­‐03-­‐01’	
  AND	
  state	
  =	
  ‘CA’;	
  
set	
  hive.exec.dynamic.partition.mode=nonstrict;	
  
INSERT	
  INTO	
  sale	
  (xdate,	
  state)	
  
SELECT	
  *	
  FROM	
  staging_table;	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
You May Need to Re-Order Columns
• Virtual columns must be last within the inserted data
set.
• You can use the SELECT statement to re-order.
Page 36
INSERT	
  INTO	
  sale	
  (xdate,	
  state=‘CA’)	
  
SELECT	
  
	
  	
  	
  id,	
  amount,	
  other_stuff,	
  
	
  	
  	
  xdate,	
  state	
  
FROM	
  staging_table	
  
WHERE	
  state	
  =	
  ‘CA’;	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
The Most Essential Hive Query Tunings
Page 37
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Tune Split Size – Always
• mapred.max.split.size	
  and	
  mapred.min.split.size	
  
• Hive	
  processes	
  data	
  in	
  chunks	
  subject	
  to	
  these	
  bounds.	
  
• min	
  too	
  large	
  -­‐>	
  Too	
  few	
  mappers.	
  
• max	
  too	
  small	
  -­‐>	
  Too	
  many	
  mappers.	
  
• Tune	
  variables	
  un6l	
  mappers	
  occupy:	
  
– All	
  map	
  slots	
  if	
  you	
  own	
  the	
  cluster.	
  
– Reasonable	
  number	
  of	
  map	
  slots	
  if	
  you	
  don’t.	
  
• Example:	
  
– set	
  mapred.max.split.size=100000000;	
  
– set	
  mapred.min.split.size=1000000;	
  
• Manual	
  today,	
  automa6c	
  in	
  future	
  version	
  of	
  Hive.	
  
• You	
  will	
  need	
  to	
  set	
  these	
  for	
  most	
  queries.	
  
Page 38
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Tune io.sort.mb – Sometimes
• Hive	
  and	
  Map/Reduce	
  maintain	
  some	
  separate	
  buffers.	
  
• If	
  Hive	
  maps	
  need	
  lots	
  of	
  local	
  memory	
  you	
  may	
  need	
  to	
  
shrink	
  map/reduce	
  buffers.	
  
• If	
  your	
  maps	
  spill,	
  try	
  it	
  out.	
  
• Example:	
  
– set	
  io.sort.mb=100;	
  
Page 39
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Other Settings You Need
• All	
  the	
  6me:	
  
– set	
  hive.op1mize.mapjoin.mapreduce=true;	
  
– set	
  hive.op1mize.bucketmapjoin=true;	
  
– set	
  hive.op1mize.bucketmapjoin.sortedmerge=true;	
  
– set	
  hive.auto.convert.join=true;	
  
– set	
  hive.auto.convert.sortmerge.join=true;	
  
– set	
  hive.auto.convert.sortmerge.join.nocondi1onaltask=true;	
  
• When	
  bucke6ng	
  data:	
  
– set	
  hive.enforce.bucke1ng=true;	
  
– set	
  hive.enforce.sor1ng=true;	
  
• These	
  and	
  more	
  are	
  set	
  by	
  default	
  in	
  HDP	
  1.3.	
  
– Check	
  for	
  them	
  in	
  hive-­‐site.xml	
  
– If	
  not	
  present,	
  set	
  them	
  in	
  your	
  query	
  script	
  
Page 40
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Check Your Settings
• In Hive shell:
– See all settings with “set;”
– See a particular setting by adding it.
– Change a setting.
Page 41
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Example Workflow: Create a Staging table
Page 42
CREATE EXTERNAL TABLE pos_staging (!
!txnid STRING,!
!txntime STRING,!
!givenname STRING,!
!lastname STRING,!
!postalcode STRING,!
!storeid STRING,!
!ind1 STRING,!
!productid STRING,!
!purchaseamount FLOAT,!
!creditcard STRING!
)ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'!
LOCATION '/user/hdfs/staging_data/pos_staging';!
The raw data is the result of initial loading or the output of a
mapreduce or pig job. We create an external table over the results of
that job as we only intend to use it to load an optimized table.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Example Workflow: Choose a partition scheme
Page 43
hive> select distinct concat(year(txntime),month(txntime)) as part_dt !
from pos_staging;!
…!
OK!
20121!
201210!
201211!
201212!
20122!
20123!
20124!
20125!
20126!
20127!
20128!
20129!
Time taken: 21.823 seconds, Fetched: 12 row(s)!
Execute a query to determine if the partition choice returns a reasonable result. We will
use this projection to create partitions for our data set. You want to keep your partitions
large enough to be useful in partition pruning and efficient for HDFS storage. Hive has
configurable bounds to ensure you do not exceed per node and total partition counts
(defaults shown):
hive.exec.max.dynamic.partitions=1000
hive.exec.max.dynamic.partitions.pernode=100
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Example Workflow: Define optimized table
Page 44
CREATE TABLE fact_pos!
(!
!txnid STRING,!
!txntime STRING,!
!givenname STRING,!
!lastname STRING,!
!postalcode STRING,!
!storeid STRING,!
!ind1 STRING,!
!productid STRING,!
!purchaseamount FLOAT,!
!creditcard STRING!
) PARTITIONED BY (part_dt STRING)!
CLUSTERED BY (txnid)!
SORTED BY (txnid)!
INTO 24 BUCKETS!
STORED AS ORC tblproperties ("orc.compress"="SNAPPY");!
The part_dt field is defined in the partition by clause and cannot be the same name as any other
fields. In this case, we will be performing a modification of txntime to generate a partition key. The
cluster and sorted clauses contain the only key we intend to join the table on. We have stored as
ORCFile with Snappy compression.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Example Workflow: Load Data Into Optimized Table
Page 45
set hive.enforce.sorting=true;!
set hive.enforce.bucketing=true;!
set hive.exec.dynamic.partition=true;!
set hive.exec.dynamic.partition.mode=nonstrict; !
set mapreduce.reduce.input.limit=-1;!
!
FROM pos_staging!
INSERT OVERWRITE TABLE fact_pos!
PARTITION (part_dt)!
SELECT!
!txnid,!
!txntime,!
!givenname,!
!lastname,!
!postalcode,!
!storeid,!
!ind1,!
!productid,!
!purchaseamount,!
!creditcard,!
!concat(year(txntime),month(txntime)) as part_dt!
SORT BY productid;!
!
We use this commend to load data from our staging table into our
optimized ORCFile format. Note that we are using dynamic partitioning with
the projection of the txntime field. This results in a MapReduce job that will
copy the staging data into ORCFile format Hive managed table.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Example Workflow: Increase replication factor
Page 46
hadoop fs -setrep -R –w 5 /apps/hive/warehouse/fact_pos!
Increase the replication factor for the high performance table.
This increases the chance for data locality. In this case, the
increase in replication factor is not for additional resiliency.
This is a trade-off of storage for performance.
In fact, to conserve space, you may choose to reduce the
replication factor for older data sets or even delete them
altogether. With the raw data in place and untouched, you can
always recreate the ORCFile high performance tables. Most
users place the steps in this example workflow into an Oozie
job to automate the work.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Example Workflow: Enabling Short Circuit Read
Page 47
In hdfs-site.xml (or your custom Ambari settings for HDFS,
restart service after):!
!
dfs.block.local-path-access.user=hdfs!
dfs.client.read.shortcircuit=true!
dfs.client.read.shortcircuit.skip.checksum=false!
Short Circuit reads allow the mappers
to bypass the overhead of opening a
port to the datanode if the data is
local. The permissions for the local
block files need to permit hdfs to read
them (should be by default already)
See HDFS-2246 for more details.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Example Workflow: Execute your query
Page 48
set hive.mapred.reduce.tasks.speculative.execution=false;!
set io.sort.mb=300;!
set mapreduce.reduce.input.limit=-1;!
!
select productid, ROUND(SUM(purchaseamount),2) as total !
from fact_pos !
where part_dt between ‘201210’ and ‘201212’!
group by productid !
order by total desc !
limit 100;!
!
…!
OK!
20535!3026.87!
39079!2959.69!
28970!2869.87!
45594!2821.15!
…!
15649!2242.05!
47704!2241.22!
8140 !2238.61!
Time taken: 40.087 seconds, Fetched: 100 row(s)!
In the case above, we have a simple query executed to test out our table. We have some
example parameters set before our query. The good news is that most of the parameters
regarding join and engine optimizations are already set for you in Hive 0.11 (HDP). The
io.sort.mb is presented as an example of one of the tunable parameters you may want to
change for this particular SQL (note this value assumes 2-3GB JVMs for mappers). We are
also partition pruning for the holiday shopping season, Oct to Dec.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Example Workflow: Check Execution Path in Ambari
Page 49
You can check the execution path in Ambari’s Job viewer. This gives a high level overview of
the stages and particular number of map and reduce tasks. With Tez, it also shows task
number and execution order. The counts here are small as this is a sample from a single-node
HDP Sandbox. For more detailed analysis, you will need to read the query plan via “explain”.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Learn To Read The Query Plan
• “explain extended” in front of your query.
• Sections:
– Abstract syntax tree – you can usually ignore this.
– Stage dependencies – dependencies and # of stages.
– Stage plans – important info on how Hive is running the job.
Page 50
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
A sample query plan.
Page 51
4 Stages
Stage Details
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Use Case: Star Schema Join
Page 52
request
Fact Table
client
dimension
path
dimension
#	
  Most	
  popular	
  URLs	
  in	
  US.	
  
select	
  path,	
  country,	
  count(*)	
  as	
  cnt	
  
	
  	
  from	
  request	
  
	
  	
  join	
  client	
  on	
  request.clientsk	
  =	
  client.id	
  
	
  	
  join	
  path	
  on	
  request.pathsk	
  =	
  path.id	
  
where	
  
	
  	
  client.country	
  =	
  'us'	
  
group	
  by	
  
	
  	
  path,	
  country	
  
order	
  by	
  cnt	
  desc	
  
limit	
  5;	
  
Ideal =
Single scan of fact table.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Case 1: hive.auto.convert.join=false;
• Generates 5 stages.
Page 53
STAGE	
  DEPENDENCIES:	
  
	
  	
  Stage-­‐1	
  is	
  a	
  root	
  stage	
  
	
  	
  Stage-­‐2	
  depends	
  on	
  stages:	
  Stage-­‐1	
  
	
  	
  Stage-­‐3	
  depends	
  on	
  stages:	
  Stage-­‐2	
  
	
  	
  Stage-­‐4	
  depends	
  on	
  stages:	
  Stage-­‐3	
  
	
  	
  Stage-­‐0	
  is	
  a	
  root	
  stage	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Case 1: hive.auto.convert.join=false;
• 4 Map Reduces tells you something is not right.
Page 54
	
  Stage:	
  Stage-­‐1	
  
	
  	
  	
  	
  Map	
  Reduce	
  
	
  
	
  Stage:	
  Stage-­‐2	
  
	
  	
  	
  	
  Map	
  Reduce	
  
	
  
	
  Stage:	
  Stage-­‐3	
  
	
  	
  	
  	
  Map	
  Reduce	
  
	
  
	
  Stage:	
  Stage-­‐4	
  
	
  	
  	
  	
  Map	
  Reduce	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Case 1: hive.auto.convert.join=true;
• Only 4 stages this time.
Page 55
STAGE	
  DEPENDENCIES:	
  
	
  	
  Stage-­‐9	
  is	
  a	
  root	
  stage	
  
	
  	
  Stage-­‐8	
  depends	
  on	
  stages:	
  Stage-­‐9	
  
	
  	
  Stage-­‐4	
  depends	
  on	
  stages:	
  Stage-­‐8	
  
	
  	
  Stage-­‐0	
  is	
  a	
  root	
  stage	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Case 1: hive.auto.convert.join=false;
• Only 2 Map Reduces
Page 56
	
  Stage:	
  Stage-­‐8	
  
	
  	
  	
  	
  Map	
  Reduce	
  
	
  
	
  Stage:	
  Stage-­‐4	
  
	
  	
  	
  	
  Map	
  Reduce	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Case 1: hive.auto.convert.join=false;
• Stage-9 is Map Reduce Local Work. What is that?
• Hive is loading the dimension tables (client and path)
directly. Client is filtered by country.
• Remaining map/reduce are for join and order by.
Page 57
	
  	
  	
  	
  	
  	
  	
  	
  client	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  TableScan	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  alias:	
  client	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Filter	
  Operator	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  predicate:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  expr:	
  (country	
  =	
  'us')	
  
	
  	
  	
  	
  	
  	
  	
  	
  ...	
  
	
  	
  	
  	
  	
  	
  	
  	
  path	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  TableScan	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  alias:	
  path	
  
	
  	
  	
  	
  	
  	
  	
  	
  ...	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive Fast Query Checklist
Page 58
Par11oned	
  data	
  along	
  natural	
  query	
  boundaries	
  (e.g.	
  date).	
  
Minimized	
  data	
  shuffle	
  by	
  co-­‐loca1ng	
  the	
  most	
  commonly	
  joined	
  data.	
  
Took	
  advantage	
  of	
  skews	
  for	
  high-­‐frequency	
  values.	
  
Enabled	
  short-­‐circuit	
  read.	
  
Used	
  ORCFile.	
  
Sorted	
  columns	
  to	
  facilitate	
  row	
  skipping	
  for	
  common	
  targeted	
  queries.	
  
Verified	
  query	
  plan	
  to	
  ensure	
  single	
  scan	
  through	
  largest	
  table.	
  
Checked	
  the	
  query	
  plan	
  to	
  ensure	
  par11on	
  pruning	
  is	
  happening.	
  
Used	
  at	
  least	
  one	
  ON	
  clause	
  in	
  every	
  JOIN.	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
For Even More Hive Performance
Page 59
Increased	
  replica1on	
  factor	
  for	
  frequently	
  accessed	
  data	
  and	
  dimensions.	
  
Tuned	
  io.sort.mb	
  to	
  avoid	
  spilling.	
  
Tuned	
  mapred.max.split.size,	
  mapred.min.split.size	
  to	
  ensure	
  1	
  mapper	
  wave.	
  
Tuned	
  mapred.reduce.tasks	
  to	
  an	
  appropriate	
  value	
  based	
  on	
  map	
  output.	
  
Checked	
  jobtracker	
  to	
  ensure	
  “row	
  container”	
  spilling	
  does	
  not	
  occur.	
  
Gave	
  extra	
  memory	
  for	
  mapjoins	
  like	
  broadcast	
  joins.	
  
Disabled	
  orc.compress	
  (file	
  size	
  will	
  increase)	
  and	
  tuned	
  orc.row.index.stride.	
  
Ensured	
  the	
  job	
  ran	
  in	
  a	
  single	
  wave	
  of	
  mappers.	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Agenda
• Hive – What Is It Good For?
• Hive’s Architecture and SQL Compatibility
• Turning Hive Performance to 11
• Get Data In and Out of Hive
• Hive Security
• Project Stinger – Making Hive 100x Faster
• Connecting to Hive From Popular Tools
Page 60
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Loading Data in Hive
• Sqoop
– Data transfer from external RDBMS to Hive.
– Sqoop can load data directly to/from HCatalog.
• Hive LOAD
– Load files from HDFS or local filesystem.
– Format must agree with table format.
• Insert from query
– CREATE TABLE AS SELECT or INSERT INTO.
• WebHDFS + WebHCat
– Load data via REST APIs.
Page 61
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Optimized Sqoop Connectors
• Current:
– Teradata
– Oracle
– Netezza
– SQL Server
– MySQL
– Postgres
• Future:
– Vertica
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
High Performance Teradata Sqoop Connector
Page 63
• High performance Sqoop driver.
• Fully parallel data load using
Enhanced FastLoad.
• Multiple readers and writers for
efficient, wire-speed transfer.
• Copy data between Teradata and
HDP Hive, HBase or HDFS.
Enhanced FastLoad developed in partnership with Teradata.
Free to download & use with Hortonworks Data Platform.
Multiple active data channels for fully
parallel data movement (N to M)
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
ACID Properties
• Data loaded into Hive partition- or table-at-a-time.
– No INSERT or UPDATE statement. No transactions.
• Atomicity:
– Partition loads are atomic through directory renames in HDFS.
• Consistency:
– Ensured by HDFS. All nodes see the same partitions at all times.
– Immutable data = no update or delete consistency issues.
• Isolation:
– Read committed with an exception for partition deletes.
– Partitions can be deleted during queries. New partitions will not be
seen by jobs started before the partition add.
• Durability:
– Data is durable in HDFS before partition exposed to Hive.
Page 64
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Handling Semi-Structured Data
• Hive supports arrays, maps, structs and unions.
• SerDes map JSON, XML and other formats natively
into Hive.
Page 65
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Agenda
• Hive – What Is It Good For?
• Hive’s Architecture and SQL Compatibility
• Turning Hive Performance to 11
• Get Data In and Out of Hive
• Hive Security
• Project Stinger – Making Hive 100x Faster
• Connecting to Hive From Popular Tools
Page 66
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive Authorization
• Hive provides Users, Groups, Roles and Privileges
• Granular permissions on tables, DDL and DML
operations.
• Not designed for high security:
– On non-kerberized cluster, up to the client to supply their user
name.
– Suitable for preventing accidental data loss.
Page 67
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HiveServer2
• HiveServer2 is a gateway / JDBC / ODBC endpoint
Hive clients can talk to.
• Supports secure and non-secure clusters.
• DoAs support allows Hive query to run as the
requester.
• (Coming Soon) LDAP authentication.
Page 68
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Agenda
• Hive – What Is It Good For?
• Hive’s Architecture and SQL Compatibility
• Turning Hive Performance to 11
• Get Data In and Out of Hive
• Hive Security
• Project Stinger – Making Hive 100x Faster
• Connecting to Hive From Popular Tools
Page 69
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Roadmap to 100x Faster
Page 70
Hadoop	
  
Hive	
  
ORCFile	
  
	
  
Column	
  Store	
  
High	
  Compression	
  
Predicate	
  /	
  Filter	
  Pushdowns	
  
Buffer	
  Caching	
  
	
  
Cache	
  accessed	
  data	
  
Op1mized	
  for	
  vector	
  engine	
  
Tez	
  
	
  
Express	
  data	
  processing	
  
tasks	
  more	
  simply	
  
Eliminate	
  disk	
  writes	
  
Tez	
  Service	
  
	
  
Pre-­‐warmed	
  Containers	
  
Low-­‐latency	
  dispatch	
  
Vector	
  Query	
  Engine	
  
	
  
Op1mized	
  for	
  modern	
  
processor	
  architectures	
  
Base	
  Op6miza6ons	
  
	
  
Generate	
  simplified	
  DAGs	
  
Join	
  Improvements	
  
Query	
  Planner	
  
	
  
Intelligent	
  Cost-­‐Based	
  
Op1mizer	
  
Phase 1
Phase 2
Phase 3
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Phase 1 Improvements
Path to Making Hive 100x Faster
Page 71
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Join Optimizations
• Performance Improvements in Hive 0.11:
• New Join Types added or improved in Hive 0.11:
– In-memory Hash Join: Fast for fact-to-dimension joins.
– Sort-Merge-Bucket Join: Scalable for large-table to large-table
joins.
• More Efficient Query Plan Generation
– Joins done in-memory when possible, saving map-reduce steps.
– Combine map/reduce jobs when GROUP BY and ORDER BY use
the same key.
• More Than 30x Performance Improvement for Star
Schema Join
Page 72
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Star Schema Join Improvements in 0.11
Page 73
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive 0.11 Star Schema Join Performance
Page 74
In-Memory Hash Join
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive 0.11 SMB Join Improvements
Page 75
Sort-Merge-Bucket Join
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
ORCFile – Efficient Columnar Layout
Large block size
well suited for
HDFS.
Columnar format
arranges columns
adjacent within the
file for compression
and fast access.
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
ORCFile Provides High Compression
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Phase 2 & 3 Improvements
Path to Making Hive 100x Faster
Page 78
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Vectorization
• Designed for Modern Processor Architectures
– Make the most use of L1 and L2 cache.
– Avoid branching whenever possible.
• How It Works
– Process records in batches of 1,000 to maximize cache use.
– Generate code on-the-fly to minimize branching.
• What It Gives
– 30x+ improvement in number of rows processed per second.
Page 79
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Other Runtime Optimizations
• Optimized Query Planner
– Automatically determine optimal execution parameters.
• Buffering
– Cache hot data in memory.
Page 80
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hadoop 2.0 - YARN
•  Re-architected Hadoop framework
•  Focus on scale and innovation
– Support 10,000+ computer clusters
– Extensible to encourage innovation
•  Next generation execution
– Improves MapReduce performance
•  Supports new frameworks beyond
MapReduce
– Low latency, streaming, etc
– Do more with a single Hadoop cluster HDFS	
  
MapReduce	
  
Redundant, Reliable Storage
YARN:	
  Cluster	
  Resource	
  Management	
  
Tez	
  
Graph	
  Processing	
  
Other	
  
•  Separation of Resource Management from MapReduce
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Tez (“Speed”)
• What is it?
– A data processing framework as an alternative to MapReduce
– A new incubation project in the ASF
• Who else is involved?
– 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo,
Microsoft
• Why does it matter?
– Widens the platform for Hadoop use cases
– Crucial to improving the performance of low-latency applications
– Core to the Stinger initiative
– Evidence of Hortonworks leading the community in the evolution
of Enterprise Hadoop
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.© Hortonworks Inc. 2013. Confidential and Proprietary.
Tez on YARN: Going Beyond Batch
Tez Optimizes Execution
New runtime engine for
more efficient data processing
Always-On Tez Service
Low latency processing for
all Hadoop data processing
Tez Task
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Tez - Core Idea
Page 84
YARN ApplicationMaster to run DAG of Tez Tasks
Task with pluggable Input, Processor and Output
Tez Task - <Input, Processor, Output>
Task	
  
Processor	
  Input	
   Output	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Tez: Building blocks for scalable data processing
Page 85
Classical ‘Map’ Classical ‘Reduce’
Intermediate ‘Reduce’ for
Map-Reduce-Reduce
Map	
  
Processor	
  
HDFS	
  
Input	
  
Sorted	
  
Output	
  
Reduce	
  
Processor	
  
Shuffle	
  
Input	
  
HDFS	
  
Output	
  
Reduce	
  
Processor	
  
Shuffle	
  
Input	
  
Sorted	
  
Output	
  
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive – MR Hive – Tez
Hive/MR versus Hive/Tez
Page 86
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
SELECT a.state
JOIN (a, c)
SELECT c.price
SELECT b.id
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,
c.itemId
JOIN (a, c)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
SELECT b.id
Tez avoids
unneeded writes to
HDFS
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Tez Service
• Map/Reduce Query Startup Is Expensive
• Solution
– Tez Service
– Hot containers ready for immediate use
– Removes task and job launch overhead (~5s – 30s)
– Hive
– Submits query plan directly to Tez Service
– Native Hadoop service, not ad-hoc
Page 87
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Agenda
• Hive – What Is It Good For?
• Hive’s Architecture and SQL Compatibility
• Turning Hive Performance to 11
• Get Data In and Out of Hive
• Hive Security
• Project Stinger – Making Hive 100x Faster
• Connecting to Hive From Popular Tools
Page 88
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive: The De-Facto SQL Interface for Hadoop
Page 89
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Connectivity
• JDBC – Included with Hive.
• ODBC – Free driver available at hortonworks.com.
• WebHCat – Run jobs using a simple REST interface.
• BI Ecosystem – Most popular BI tools support Hive.
Page 90
Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Learn More with Hortonworks HDP Sandbox
Page 91
Built-in tutorials show how to connect
to Hive with ODBC and Excel.
Learn to connect to Hadoop with
popular BI tools.

More Related Content

What's hot

Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookDatabricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performanceDataWorks Summit
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
In-Memory Evolution in Apache Spark
In-Memory Evolution in Apache SparkIn-Memory Evolution in Apache Spark
In-Memory Evolution in Apache SparkKazuaki Ishizaki
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsFlink Forward
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Databricks
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 

What's hot (20)

Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Vectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at FacebookVectorized Query Execution in Apache Spark at Facebook
Vectorized Query Execution in Apache Spark at Facebook
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
In-Memory Evolution in Apache Spark
In-Memory Evolution in Apache SparkIn-Memory Evolution in Apache Spark
In-Memory Evolution in Apache Spark
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 

Similar to Hive tuning

Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
 
Ambry : Linkedin's Scalable Geo-Distributed Object Store
Ambry : Linkedin's Scalable Geo-Distributed Object StoreAmbry : Linkedin's Scalable Geo-Distributed Object Store
Ambry : Linkedin's Scalable Geo-Distributed Object StoreSivabalan Narayanan
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep DiveHortonworks
 
Soft-Shake 2013 : Enabling Realtime Queries to End Users
Soft-Shake 2013 : Enabling Realtime Queries to End UsersSoft-Shake 2013 : Enabling Realtime Queries to End Users
Soft-Shake 2013 : Enabling Realtime Queries to End UsersBenoit Perroud
 
Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...
Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...
Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...Big Data Spain
 
Add Redis to Postgres to Make Your Microservices Go Boom!
Add Redis to Postgres to Make Your Microservices Go Boom!Add Redis to Postgres to Make Your Microservices Go Boom!
Add Redis to Postgres to Make Your Microservices Go Boom!Dave Nielsen
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopPOSSCON
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
data-mesh-101.pptx
data-mesh-101.pptxdata-mesh-101.pptx
data-mesh-101.pptxTarekHamdi8
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshConfluentInc1
 
Liberate Your Files with a Private Cloud Storage Solution powered by Open Source
Liberate Your Files with a Private Cloud Storage Solution powered by Open SourceLiberate Your Files with a Private Cloud Storage Solution powered by Open Source
Liberate Your Files with a Private Cloud Storage Solution powered by Open SourceIsaac Christoffersen
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezJan Pieter Posthuma
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsAaron Brooks
 

Similar to Hive tuning (20)

Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
Data Science
Data ScienceData Science
Data Science
 
Ambry : Linkedin's Scalable Geo-Distributed Object Store
Ambry : Linkedin's Scalable Geo-Distributed Object StoreAmbry : Linkedin's Scalable Geo-Distributed Object Store
Ambry : Linkedin's Scalable Geo-Distributed Object Store
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
Soft-Shake 2013 : Enabling Realtime Queries to End Users
Soft-Shake 2013 : Enabling Realtime Queries to End UsersSoft-Shake 2013 : Enabling Realtime Queries to End Users
Soft-Shake 2013 : Enabling Realtime Queries to End Users
 
Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...
Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...
Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...
 
Add Redis to Postgres to Make Your Microservices Go Boom!
Add Redis to Postgres to Make Your Microservices Go Boom!Add Redis to Postgres to Make Your Microservices Go Boom!
Add Redis to Postgres to Make Your Microservices Go Boom!
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in Hadoop
 
HDF Data in the Cloud
HDF Data in the CloudHDF Data in the Cloud
HDF Data in the Cloud
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
data-mesh-101.pptx
data-mesh-101.pptxdata-mesh-101.pptx
data-mesh-101.pptx
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
 
Liberate Your Files with a Private Cloud Storage Solution powered by Open Source
Liberate Your Files with a Private Cloud Storage Solution powered by Open SourceLiberate Your Files with a Private Cloud Storage Solution powered by Open Source
Liberate Your Files with a Private Cloud Storage Solution powered by Open Source
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime Analytics
 

More from Michael Zhang

廣告系統在Docker/Mesos上的可靠性實踐
廣告系統在Docker/Mesos上的可靠性實踐廣告系統在Docker/Mesos上的可靠性實踐
廣告系統在Docker/Mesos上的可靠性實踐Michael Zhang
 
HKIX Upgrade to 100Gbps-Based Two-Tier Architecture
HKIX Upgrade to 100Gbps-Based Two-Tier ArchitectureHKIX Upgrade to 100Gbps-Based Two-Tier Architecture
HKIX Upgrade to 100Gbps-Based Two-Tier ArchitectureMichael Zhang
 
2014 GITC 帶上數據去創業 talkingdata—高铎
 2014 GITC 帶上數據去創業 talkingdata—高铎 2014 GITC 帶上數據去創業 talkingdata—高铎
2014 GITC 帶上數據去創業 talkingdata—高铎Michael Zhang
 
Fastsocket Linxiaofeng
Fastsocket LinxiaofengFastsocket Linxiaofeng
Fastsocket LinxiaofengMichael Zhang
 
2014 Hpocon 李志刚 1号店 - puppet在1号店的实践
2014 Hpocon 李志刚   1号店 - puppet在1号店的实践2014 Hpocon 李志刚   1号店 - puppet在1号店的实践
2014 Hpocon 李志刚 1号店 - puppet在1号店的实践Michael Zhang
 
2014 Hpocon 姚仁捷 唯品会 - data driven ops
2014 Hpocon 姚仁捷   唯品会 - data driven ops2014 Hpocon 姚仁捷   唯品会 - data driven ops
2014 Hpocon 姚仁捷 唯品会 - data driven opsMichael Zhang
 
2014 Hpocon 高驰涛 云智慧 - apm在高性能架构中的应用
2014 Hpocon 高驰涛   云智慧 - apm在高性能架构中的应用2014 Hpocon 高驰涛   云智慧 - apm在高性能架构中的应用
2014 Hpocon 高驰涛 云智慧 - apm在高性能架构中的应用Michael Zhang
 
2014 Hpocon 黄慧攀 upyun - 平台架构的服务监控
2014 Hpocon 黄慧攀   upyun - 平台架构的服务监控2014 Hpocon 黄慧攀   upyun - 平台架构的服务监控
2014 Hpocon 黄慧攀 upyun - 平台架构的服务监控Michael Zhang
 
2014 Hpocon 吴磊 ucloud - 由点到面 提升公有云服务可用性
2014 Hpocon 吴磊   ucloud - 由点到面 提升公有云服务可用性2014 Hpocon 吴磊   ucloud - 由点到面 提升公有云服务可用性
2014 Hpocon 吴磊 ucloud - 由点到面 提升公有云服务可用性Michael Zhang
 
2014 Hpocon 周辉 大众点评 - 大众点评混合开发模式下的加速尝试
2014 Hpocon 周辉   大众点评 - 大众点评混合开发模式下的加速尝试2014 Hpocon 周辉   大众点评 - 大众点评混合开发模式下的加速尝试
2014 Hpocon 周辉 大众点评 - 大众点评混合开发模式下的加速尝试Michael Zhang
 
Cuda 6 performance_report
Cuda 6 performance_reportCuda 6 performance_report
Cuda 6 performance_reportMichael Zhang
 
The Data Center and Hadoop
The Data Center and HadoopThe Data Center and Hadoop
The Data Center and HadoopMichael Zhang
 
Hadoop Hardware @Twitter: Size does matter.
Hadoop Hardware @Twitter: Size does matter.Hadoop Hardware @Twitter: Size does matter.
Hadoop Hardware @Twitter: Size does matter.Michael Zhang
 
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]Michael Zhang
 
Q con shanghai2013-[刘海锋]-[京东文件系统简介]
Q con shanghai2013-[刘海锋]-[京东文件系统简介]Q con shanghai2013-[刘海锋]-[京东文件系统简介]
Q con shanghai2013-[刘海锋]-[京东文件系统简介]Michael Zhang
 
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]Q con shanghai2013-[韩军]-[超大型电商系统架构解密]
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]Michael Zhang
 
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]Michael Zhang
 
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]Q con shanghai2013-[黄舒泉]-[intel it openstack practice]
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]Michael Zhang
 
Q con shanghai2013-罗婷-performance methodology
Q con shanghai2013-罗婷-performance methodologyQ con shanghai2013-罗婷-performance methodology
Q con shanghai2013-罗婷-performance methodologyMichael Zhang
 

More from Michael Zhang (20)

廣告系統在Docker/Mesos上的可靠性實踐
廣告系統在Docker/Mesos上的可靠性實踐廣告系統在Docker/Mesos上的可靠性實踐
廣告系統在Docker/Mesos上的可靠性實踐
 
HKIX Upgrade to 100Gbps-Based Two-Tier Architecture
HKIX Upgrade to 100Gbps-Based Two-Tier ArchitectureHKIX Upgrade to 100Gbps-Based Two-Tier Architecture
HKIX Upgrade to 100Gbps-Based Two-Tier Architecture
 
2014 GITC 帶上數據去創業 talkingdata—高铎
 2014 GITC 帶上數據去創業 talkingdata—高铎 2014 GITC 帶上數據去創業 talkingdata—高铎
2014 GITC 帶上數據去創業 talkingdata—高铎
 
Fastsocket Linxiaofeng
Fastsocket LinxiaofengFastsocket Linxiaofeng
Fastsocket Linxiaofeng
 
Spark sql meetup
Spark sql meetupSpark sql meetup
Spark sql meetup
 
2014 Hpocon 李志刚 1号店 - puppet在1号店的实践
2014 Hpocon 李志刚   1号店 - puppet在1号店的实践2014 Hpocon 李志刚   1号店 - puppet在1号店的实践
2014 Hpocon 李志刚 1号店 - puppet在1号店的实践
 
2014 Hpocon 姚仁捷 唯品会 - data driven ops
2014 Hpocon 姚仁捷   唯品会 - data driven ops2014 Hpocon 姚仁捷   唯品会 - data driven ops
2014 Hpocon 姚仁捷 唯品会 - data driven ops
 
2014 Hpocon 高驰涛 云智慧 - apm在高性能架构中的应用
2014 Hpocon 高驰涛   云智慧 - apm在高性能架构中的应用2014 Hpocon 高驰涛   云智慧 - apm在高性能架构中的应用
2014 Hpocon 高驰涛 云智慧 - apm在高性能架构中的应用
 
2014 Hpocon 黄慧攀 upyun - 平台架构的服务监控
2014 Hpocon 黄慧攀   upyun - 平台架构的服务监控2014 Hpocon 黄慧攀   upyun - 平台架构的服务监控
2014 Hpocon 黄慧攀 upyun - 平台架构的服务监控
 
2014 Hpocon 吴磊 ucloud - 由点到面 提升公有云服务可用性
2014 Hpocon 吴磊   ucloud - 由点到面 提升公有云服务可用性2014 Hpocon 吴磊   ucloud - 由点到面 提升公有云服务可用性
2014 Hpocon 吴磊 ucloud - 由点到面 提升公有云服务可用性
 
2014 Hpocon 周辉 大众点评 - 大众点评混合开发模式下的加速尝试
2014 Hpocon 周辉   大众点评 - 大众点评混合开发模式下的加速尝试2014 Hpocon 周辉   大众点评 - 大众点评混合开发模式下的加速尝试
2014 Hpocon 周辉 大众点评 - 大众点评混合开发模式下的加速尝试
 
Cuda 6 performance_report
Cuda 6 performance_reportCuda 6 performance_report
Cuda 6 performance_report
 
The Data Center and Hadoop
The Data Center and HadoopThe Data Center and Hadoop
The Data Center and Hadoop
 
Hadoop Hardware @Twitter: Size does matter.
Hadoop Hardware @Twitter: Size does matter.Hadoop Hardware @Twitter: Size does matter.
Hadoop Hardware @Twitter: Size does matter.
 
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]
 
Q con shanghai2013-[刘海锋]-[京东文件系统简介]
Q con shanghai2013-[刘海锋]-[京东文件系统简介]Q con shanghai2013-[刘海锋]-[京东文件系统简介]
Q con shanghai2013-[刘海锋]-[京东文件系统简介]
 
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]Q con shanghai2013-[韩军]-[超大型电商系统架构解密]
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]
 
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]
 
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]Q con shanghai2013-[黄舒泉]-[intel it openstack practice]
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]
 
Q con shanghai2013-罗婷-performance methodology
Q con shanghai2013-罗婷-performance methodologyQ con shanghai2013-罗婷-performance methodology
Q con shanghai2013-罗婷-performance methodology
 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

Hive tuning

  • 1. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hive & Performance Toronto Hadoop User Group July 23 2013 Page 1 Presenter: Adam Muise – Hortonworks amuise@hortonworks.com
  • 2. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Agenda • Hive – What Is It Good For? • Hive’s Architecture and SQL Compatibility • Turning Hive Performance to 11 • Get Data In and Out of Hive • Hive Security • Project Stinger – Making Hive 100x Faster • Connecting to Hive From Popular Tools Page 2
  • 3. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hive – SQL Analytics For Any Data Size Page 3 Sensor  Mobile   Weblog   Opera1onal   /  MPP   Store  and  Query  all   Data  in  Hive   Use  Exis6ng  SQL  Tools   and  Exis6ng  SQL  Processes  
  • 4. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hive’s Focus • Scalable SQL processing over data in Hadoop • Scales to 100PB+ • Structured and Unstructured data Page 4
  • 5. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Comparing Hive with RDBMS Page 5 Hive   RDBMS   SQL  Interface.   SQL  Interface.   Focus  on  analy1cs.   May  focus  on  online  or  analy1cs.   No  transac1ons.   Transac1ons  usually  supported.   Par11on  adds,  no  random  INSERTs.   In-­‐Place  updates  not  na1vely  supported  (but   are  possible).   Random  INSERT  and  UPDATE  supported.   Distributed  processing  via  map/reduce.   Distributed  processing  varies  by  vendor  (if   available).   Scales  to  hundreds  of  nodes.   Seldom  scale  beyond  20  nodes.   Built  for  commodity  hardware.   OQen  built  on  proprietary  hardware   (especially  when  scaling  out).   Low  cost  per  petabyte.   What’s  a  petabyte?  
  • 6. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Agenda • Hive – What Is It Good For? • Hive’s Architecture and SQL Compatibility • Turning Hive Performance to 11 • Get Data In and Out of Hive • Hive Security • Project Stinger – Making Hive 100x Faster • Connecting to Hive From Popular Tools Page 6
  • 7. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hive: The SQL Interface to Hadoop Page 7 HiveServer2 Hive Job TrackerName Node Task Tracker Data Node Web UI JDBC / ODBC CLI Hive Hadoop Compiler Optimizer Executor Hive SQL Map / Reduce User issues SQL query Hive parses and plans query Query converted to Map/Reduce Map/Reduce run by Hadoop 1 2 3 4 1 2 3 4
  • 8. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hive: Reliable SQL Processing at Scale Page 8 Hive Task Tracker Time 1: Job = 50% Complete Progress Stored in HDFS Node 1 Node 2 Node 3 Node 4 H D F S Time 2: Node 3 Fails, Job = 85% Complete Node 1 Node 2 Node 3 Node 4 H D F S Time 3: Job moves to Node 4 Job = 100% Complete Node 1 Node 2 Node 3 Node 4 H D F S SQL Map / Reduce
  • 9. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. SQL Coverage: SQL 92 with Extensions Page 9 SQL  Datatypes   SQL  Seman6cs   INT   SELECT,  LOAD,  INSERT  from  query   TINYINT/SMALLINT/BIGINT   Expressions  in  WHERE  and  HAVING   BOOLEAN   GROUP  BY,  ORDER  BY,  SORT  BY   FLOAT   CLUSTER  BY,  DISTRIBUTE  BY   DOUBLE   Sub-­‐queries  in  FROM  clause   STRING   GROUP  BY,  ORDER  BY   BINARY   ROLLUP  and  CUBE   TIMESTAMP   UNION   ARRAY,  MAP,  STRUCT,  UNION   LEFT,  RIGHT  and  FULL  INNER/OUTER  JOIN   DECIMAL   CROSS  JOIN,  LEFT  SEMI  JOIN   CHAR   Windowing  func1ons  (OVER,  RANK,  etc.)   VARCHAR   Sub-­‐queries  for  IN/NOT  IN,  HAVING   DATE   EXISTS  /  NOT  EXISTS   INTERSECT,  EXCEPT   Legend   Available   Roadmap   New  in  Hive  0.11  
  • 10. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Agenda • Hive – What Is It Good For? • Hive’s Architecture and SQL Compatibility • Turning Hive Performance to 11 • Get Data In and Out of Hive • Hive Security • Project Stinger – Making Hive 100x Faster • Connecting to Hive From Popular Tools Page 10
  • 11. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Data Abstractions in Hive Page 11 Par11ons,  buckets  and  skews  facilitate   faster,  more  direct  data  access.   Database   Table   Table   Par11on   Par11on   Par11on   Bucket   Bucket   Bucket   Op1onal  Per  Table   Skewed  Keys   Unskewed   Keys  
  • 12. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. “I heard you should avoid joins…” • “Joins are evil” – Cal Henderson – Joins should be avoided in online systems. • Joins are unavoidable in analytics. – Making joins fast is the key design point. Page 12
  • 13. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Quick Refresher on Joins Page 13 customer   order   first   last   id   cid   price   quan6ty   Nick   Toner   11911   4150   10.50   3   Jessie   Simonds   11912   11914   12.25   27   Kasi   Lamers   11913   3491   5.99   5   Rodger   Clayton   11914   2934   39.99   22   Verona   Hollen   11915   11914   40.50   10   SELECT  *  FROM  customer  join  order  ON  customer.id  =  order.cid;   Joins  match  values  from  one  table  against  values  in  another  table.  
  • 14. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hive Join Strategies Page 14 Type   Approach   Pros   Cons   Shuffle   Join   Join  keys  are  shuffled  using   map/reduce  and  joins   performed  join  side.   Works  regardless  of   data  size  or  layout.   Most  resource-­‐ intensive  and   slowest  join  type.   Broadcast   Join   Small  tables  are  loaded  into   memory  in  all  nodes,   mapper  scans  through  the   large  table  and  joins.   Very  fast,  single  scan   through  largest   table.   All  but  one  table   must  be  small   enough  to  fit  in   RAM.   Sort-­‐ Merge-­‐ Bucket   Join   Mappers  take  advantage  of   co-­‐loca1on  of  keys  to  do   efficient  joins.   Very  fast  for  tables   of  any  size.   Data  must  be  sorted   and  bucketed  ahead   of  1me.  
  • 15. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Shuffle Joins in Map Reduce Page 15 customer   order   first   last   id   cid   price   quan6ty   Nick   Toner   11911   4150   10.50   3   Jessie   Simonds   11912   11914   12.25   27   Kasi   Lamers   11913   3491   5.99   5   Rodger   Clayton   11914   2934   39.99   22   Verona   Hollen   11915   11914   40.50   10   SELECT  *  FROM  customer  join  order  ON  customer.id  =  order.cid;   M {  id:  11911,  {  first:  Nick,  last:  Toner  }}   {  id:  11914,  {  first:  Rodger,  last:  Clayton  }}   …   M {  cid:  4150,  {  price:  10.50,  quan1ty:  3  }}   {  cid:  11914,  {  price:  12.25,  quan1ty:  27  }}   …   R {  id:  11914,  {  first:  Rodger,  last:  Clayton  }}   {  cid:  11914,  {  price:  12.25,  quan1ty:  27  }}   R {  id:  11911,  {  first:  Nick,  last:  Toner  }}   {  cid:  4150,  {  price:  10.50,  quan1ty:  3  }}   …   Iden1cal  keys  shuffled  to  the  same  reducer.  Join  done  reduce-­‐side.   Expensive  from  a  network  u1liza1on  standpoint.  
  • 16. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Broadcast Join •  Star schemas use dimension tables small enough to fit in RAM. •  Small tables held in memory by all nodes. •  Single pass through the large table. •  Used for star-schema type joins common in DW. Page 16
  • 17. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. When both are too large for memory: Page 17 customer   order   first   last   id   cid   price   quan6ty   Nick   Toner   11911   4150   10.50   3   Jessie   Simonds   11912   11914   12.25   27   Kasi   Lamers   11913   11914   40.50   10   Rodger   Clayton   11914   12337   39.99   22   Verona   Hollen   11915   15912   40.50   10   SELECT  *  FROM  customer  join  order  ON  customer.id  =  order.cid;   CREATE  TABLE  customer  (id  int,  first  string,  last  string)   CLUSTERED  BY(id)  SORTED  BY(id)  INTO  32  BUCKETS;   CREATE  TABLE  order  (cid  int,  price  float,  quantity  int)   CLUSTERED  BY(cid)  SORTED  BY(cid)  INTO  32  BUCKETS;   Cluster  and  sort  by  the  most  common  join  key.  
  • 18. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hive’s Clustering and Sorting Page 18 customer   order   first   last   id   cid   price   quan6ty   Nick   Toner   11911   4150   10.50   3   Jessie   Simonds   11912   11914   12.25   27   Kasi   Lamers   11913   11914   40.50   10   Rodger   Clayton   11914   12337   39.99   22   Verona   Hollen   11915   15912   40.50   10   SELECT  *  FROM  customer  join  order  ON  customer.id  =  order.cid;   Observa1on  1:   Sor1ng  by  the  join  key  makes  joins  easy.   All  possible  matches  reside  in  the  same  area  on  disk.  
  • 19. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hive’s Clustering and Sorting Page 19 customer   order   first   last   id   cid   price   quan6ty   Nick   Toner   11911   4150   10.50   3   Jessie   Simonds   11912   11914   12.25   27   Kasi   Lamers   11913   11914   40.50   10   Rodger   Clayton   11914   12337   39.99   22   Verona   Hollen   11915   15912   40.50   10   SELECT  *  FROM  customer  join  order  ON  customer.id  =  order.cid;   Observa1on  2:   Hash  bucke1ng  a  join  key  ensures  all  matching  values  reside  on  the  same  node.   Equi-­‐joins  can  then  run  with  no  shuffle.  
  • 20. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Controlling Data Locality with Hive • Bucketing: – Hash partition values into a configurable number of buckets. – Usually coupled with sorting. • Skews: – Split values out into separate files. – Used when certain values are frequently seen. • Replication Factor: – Increase replication factor to accelerate reads. – Controlled at the HDFS layer. • Sorting: – Sort the values within given columns. – Greatly accelerates query when used with ORCFile filter pushdown. Page 20
  • 21. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Guidelines for Architecting Hive Data Page 21 Table  Size   Data  Profile   Query  PaNern   Recommenda6on   Small   Hot  data   Any   Increase  replica1on  factor   Any   Any   Very  precise  filters   Sort  on  column  most   frequently  used  in  precise   queries   Large   Any   Joined  to  another  large   table   Sort  and  bucket  both   tables  along  the  join  key   Large   One  value  >25%  of   count  within  high   cardinality  column   Any   Split  the  frequent  value   into  a  separate  skew   Large   Any   Queries  tend  to  have  a   natural  boundary  such  as   date   Par11on  the  data  along   the  natural  boundary  
  • 22. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hive Persistence Formats • Built-in Formats: – ORCFile – RCFile – Avro – Delimited Text – Regular Expression – S3 Logfile – Typed Bytes • 3rd-Party Addons: – JSON – XML Page 22
  • 23. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hive allows mixed format. • Use Case: – Ingest data in a write-optimized format like JSON or delimited. – Every night, run a batch job to convert to read-optimized ORCFile. Page 23
  • 24. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. ORCFile – Efficient Columnar Layout Large block size well suited for HDFS. Columnar format arranges columns adjacent within the file for compression and fast access.
  • 25. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. ORCFile Advantages • High Compression – Many tricks used out-of-the-box to ensure high compression rates. – RLE, dictionary encoding, etc. • High Performance – Inline indexes record value ranges within blocks of ORCFile data. – Filter pushdown allows efficient scanning during precise queries. • Flexible Data Model – All Hive types including maps, structs and unions. Page 25
  • 26. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. High Compression with ORCFile Page 26
  • 27. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Some ORCFile Samples Page 27 sale   id   6mestamp   productsk   storesk   amount   state   10000   2013-­‐06-­‐13T09:03:05   16775   670   $70.50   CA   10001   2013-­‐06-­‐13T09:03:05   10739   359   $52.99   IL   10002   2013-­‐06-­‐13T09:03:06   4671   606   $67.12   MA   10003   2013-­‐06-­‐13T09:03:08   7224   174   $96.85   CA   10004   2013-­‐06-­‐13T09:03:12   9354   123   $67.76   CA   10005   2013-­‐06-­‐13T09:03:18   1192   497   $25.73   IL   CREATE  TABLE  sale  (          id  int,  timestamp  timestamp,          productsk  int,  storesk  int,          amount  decimal,  state  string   )  STORED  AS  orc;  
  • 28. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. ORCFile Options and Defaults Page 28 Key   Default   Notes   orc.compress   ZLIB   High  level  compression  (one  of  NONE,  ZLIB,   SNAPPY)   orc.compress.size   262,144   (=  256  KiB)   Number  of  bytes  in  each  compression  chunk   orc.stripe.size   268,435,456   (=  256  MiB)   Number  of  bytes  in  each  stripe   orc.row.index.stride   10,000   Number  of  rows  between  index  entries  (must   be  >=  1,000)   orc.create.index   true   Whether  to  create  row  indexes  
  • 29. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. No Compression: Faster but Larger Page 29 sale   id   6mestamp   productsk   storesk   amount   state   10000   2013-­‐06-­‐13T09:03:05   16775   670   $70.50   CA   10001   2013-­‐06-­‐13T09:03:05   10739   359   $52.99   IL   10002   2013-­‐06-­‐13T09:03:06   4671   606   $67.12   MA   10003   2013-­‐06-­‐13T09:03:08   7224   174   $96.85   CA   10004   2013-­‐06-­‐13T09:03:12   9354   123   $67.76   CA   10005   2013-­‐06-­‐13T09:03:18   1192   497   $25.73   IL   CREATE  TABLE  sale  (          id  int,  timestamp  timestamp,          productsk  int,  storesk  int,          amount  decimal,  state  string   )  STORED  AS  orc  tblproperties  ("orc.compress"="NONE");  
  • 30. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Column Sorting to Facilitate Skipping Page 30 sale   id   6mestamp   productsk   storesk   amount   state   10005   2013-­‐06-­‐13T09:03:18   1192   497   $25.73   IL   10002   2013-­‐06-­‐13T09:03:06   4671   606   $67.12   MA   10003   2013-­‐06-­‐13T09:03:08   7224   174   $96.85   CA   10004   2013-­‐06-­‐13T09:03:12   9354   123   $67.76   CA   10001   2013-­‐06-­‐13T09:03:05   10739   359   $52.99   IL   10000   2013-­‐06-­‐13T09:03:05   16775   670   $70.50   CA   CREATE  TABLE  sale  (          id  int,  timestamp  timestamp,          productsk  int,  storesk  int,          amount  decimal,  state  string   )  STORED  AS  orc;   INSERT  INTO  sale  AS  SELECT  *  FROM  staging  SORT  BY  productsk;   ORCFile  skipping  speeds  queries  like   WHERE  productsk  =  X,  productsk  IN  (Y,  Z);  etc.  
  • 31. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Not Your Traditional Database • Traditional solution to all RDBMS problems: – Put an index on it! • Doing this in Hadoop = #fail Page 31 Your Oracle DBA
  • 32. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Going Fast in Hadoop • Hadoop: – Really good at coordinated sequential scans. – No random I/O. Traditional index pretty much useless. • Keys to speed in Hadoop: – Sorting and skipping take the place of indexing. – Minimizing data shuffle the other key consideration. • Skipping data: – Divide data among different files which can be pruned out. – Partitions, buckets and skews. – Skip records during scans using small embedded indexes. – Automatic when you use ORCFile format. – Sort data ahead of time. – Simplifies joins and skipping becomes more effective. Page 32
  • 33. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Data Layout Considerations for Fast Hive Page 33 Skip Reads Minimize Shuffle Reduce Latency Partition Tables and/or Use Skew Sort and Bucket on Common Join Keys Increase Replication Factor For Hot Data Sort Secondary Columns when Using ORCFile Use Broadcast Joins when Joining Small Tables Enable Short-Circuit Read Take Advantage of Tez + Tez Service (Future)
  • 34. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Partitioning and Virtual Columns • Partitioning makes queries go fast. • You will almost always use some sort of partitioning. • When partitioning you will use 1 or more virtual columns. • Virtual columns cause directories to be created in HDFS. – Files for that partition are stored within that subdirectory. Page 34 #  Notice  how  xdate  and  state  are  not  “real”  column  names.   CREATE  TABLE  sale  (          id  int,  amount  decimal,  ...   )  partitioned  by  (xdate  string,  state  string);  
  • 35. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Loading Data with Virtual Columns • By default at least one virtual column must be hard- coded. • You can load all partitions in one shot: – set hive.exec.dynamic.partition.mode=nonstrict; – Warning: You can easily overwhelm your cluster this way. Page 35 INSERT  INTO  sale  (xdate=‘2013-­‐03-­‐01’,  state=‘CA’)   SELECT  *  FROM  staging_table   WHERE  xdate  =  ‘2013-­‐03-­‐01’  AND  state  =  ‘CA’;   set  hive.exec.dynamic.partition.mode=nonstrict;   INSERT  INTO  sale  (xdate,  state)   SELECT  *  FROM  staging_table;  
  • 36. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. You May Need to Re-Order Columns • Virtual columns must be last within the inserted data set. • You can use the SELECT statement to re-order. Page 36 INSERT  INTO  sale  (xdate,  state=‘CA’)   SELECT        id,  amount,  other_stuff,        xdate,  state   FROM  staging_table   WHERE  state  =  ‘CA’;  
  • 37. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. The Most Essential Hive Query Tunings Page 37
  • 38. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Tune Split Size – Always • mapred.max.split.size  and  mapred.min.split.size   • Hive  processes  data  in  chunks  subject  to  these  bounds.   • min  too  large  -­‐>  Too  few  mappers.   • max  too  small  -­‐>  Too  many  mappers.   • Tune  variables  un6l  mappers  occupy:   – All  map  slots  if  you  own  the  cluster.   – Reasonable  number  of  map  slots  if  you  don’t.   • Example:   – set  mapred.max.split.size=100000000;   – set  mapred.min.split.size=1000000;   • Manual  today,  automa6c  in  future  version  of  Hive.   • You  will  need  to  set  these  for  most  queries.   Page 38
  • 39. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Tune io.sort.mb – Sometimes • Hive  and  Map/Reduce  maintain  some  separate  buffers.   • If  Hive  maps  need  lots  of  local  memory  you  may  need  to   shrink  map/reduce  buffers.   • If  your  maps  spill,  try  it  out.   • Example:   – set  io.sort.mb=100;   Page 39
  • 40. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Other Settings You Need • All  the  6me:   – set  hive.op1mize.mapjoin.mapreduce=true;   – set  hive.op1mize.bucketmapjoin=true;   – set  hive.op1mize.bucketmapjoin.sortedmerge=true;   – set  hive.auto.convert.join=true;   – set  hive.auto.convert.sortmerge.join=true;   – set  hive.auto.convert.sortmerge.join.nocondi1onaltask=true;   • When  bucke6ng  data:   – set  hive.enforce.bucke1ng=true;   – set  hive.enforce.sor1ng=true;   • These  and  more  are  set  by  default  in  HDP  1.3.   – Check  for  them  in  hive-­‐site.xml   – If  not  present,  set  them  in  your  query  script   Page 40
  • 41. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Check Your Settings • In Hive shell: – See all settings with “set;” – See a particular setting by adding it. – Change a setting. Page 41
  • 42. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Example Workflow: Create a Staging table Page 42 CREATE EXTERNAL TABLE pos_staging (! !txnid STRING,! !txntime STRING,! !givenname STRING,! !lastname STRING,! !postalcode STRING,! !storeid STRING,! !ind1 STRING,! !productid STRING,! !purchaseamount FLOAT,! !creditcard STRING! )ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'! LOCATION '/user/hdfs/staging_data/pos_staging';! The raw data is the result of initial loading or the output of a mapreduce or pig job. We create an external table over the results of that job as we only intend to use it to load an optimized table.
  • 43. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Example Workflow: Choose a partition scheme Page 43 hive> select distinct concat(year(txntime),month(txntime)) as part_dt ! from pos_staging;! …! OK! 20121! 201210! 201211! 201212! 20122! 20123! 20124! 20125! 20126! 20127! 20128! 20129! Time taken: 21.823 seconds, Fetched: 12 row(s)! Execute a query to determine if the partition choice returns a reasonable result. We will use this projection to create partitions for our data set. You want to keep your partitions large enough to be useful in partition pruning and efficient for HDFS storage. Hive has configurable bounds to ensure you do not exceed per node and total partition counts (defaults shown): hive.exec.max.dynamic.partitions=1000 hive.exec.max.dynamic.partitions.pernode=100
  • 44. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Example Workflow: Define optimized table Page 44 CREATE TABLE fact_pos! (! !txnid STRING,! !txntime STRING,! !givenname STRING,! !lastname STRING,! !postalcode STRING,! !storeid STRING,! !ind1 STRING,! !productid STRING,! !purchaseamount FLOAT,! !creditcard STRING! ) PARTITIONED BY (part_dt STRING)! CLUSTERED BY (txnid)! SORTED BY (txnid)! INTO 24 BUCKETS! STORED AS ORC tblproperties ("orc.compress"="SNAPPY");! The part_dt field is defined in the partition by clause and cannot be the same name as any other fields. In this case, we will be performing a modification of txntime to generate a partition key. The cluster and sorted clauses contain the only key we intend to join the table on. We have stored as ORCFile with Snappy compression.
  • 45. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Example Workflow: Load Data Into Optimized Table Page 45 set hive.enforce.sorting=true;! set hive.enforce.bucketing=true;! set hive.exec.dynamic.partition=true;! set hive.exec.dynamic.partition.mode=nonstrict; ! set mapreduce.reduce.input.limit=-1;! ! FROM pos_staging! INSERT OVERWRITE TABLE fact_pos! PARTITION (part_dt)! SELECT! !txnid,! !txntime,! !givenname,! !lastname,! !postalcode,! !storeid,! !ind1,! !productid,! !purchaseamount,! !creditcard,! !concat(year(txntime),month(txntime)) as part_dt! SORT BY productid;! ! We use this commend to load data from our staging table into our optimized ORCFile format. Note that we are using dynamic partitioning with the projection of the txntime field. This results in a MapReduce job that will copy the staging data into ORCFile format Hive managed table.
  • 46. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Example Workflow: Increase replication factor Page 46 hadoop fs -setrep -R –w 5 /apps/hive/warehouse/fact_pos! Increase the replication factor for the high performance table. This increases the chance for data locality. In this case, the increase in replication factor is not for additional resiliency. This is a trade-off of storage for performance. In fact, to conserve space, you may choose to reduce the replication factor for older data sets or even delete them altogether. With the raw data in place and untouched, you can always recreate the ORCFile high performance tables. Most users place the steps in this example workflow into an Oozie job to automate the work.
  • 47. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Example Workflow: Enabling Short Circuit Read Page 47 In hdfs-site.xml (or your custom Ambari settings for HDFS, restart service after):! ! dfs.block.local-path-access.user=hdfs! dfs.client.read.shortcircuit=true! dfs.client.read.shortcircuit.skip.checksum=false! Short Circuit reads allow the mappers to bypass the overhead of opening a port to the datanode if the data is local. The permissions for the local block files need to permit hdfs to read them (should be by default already) See HDFS-2246 for more details.
  • 48. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Example Workflow: Execute your query Page 48 set hive.mapred.reduce.tasks.speculative.execution=false;! set io.sort.mb=300;! set mapreduce.reduce.input.limit=-1;! ! select productid, ROUND(SUM(purchaseamount),2) as total ! from fact_pos ! where part_dt between ‘201210’ and ‘201212’! group by productid ! order by total desc ! limit 100;! ! …! OK! 20535!3026.87! 39079!2959.69! 28970!2869.87! 45594!2821.15! …! 15649!2242.05! 47704!2241.22! 8140 !2238.61! Time taken: 40.087 seconds, Fetched: 100 row(s)! In the case above, we have a simple query executed to test out our table. We have some example parameters set before our query. The good news is that most of the parameters regarding join and engine optimizations are already set for you in Hive 0.11 (HDP). The io.sort.mb is presented as an example of one of the tunable parameters you may want to change for this particular SQL (note this value assumes 2-3GB JVMs for mappers). We are also partition pruning for the holiday shopping season, Oct to Dec.
  • 49. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Example Workflow: Check Execution Path in Ambari Page 49 You can check the execution path in Ambari’s Job viewer. This gives a high level overview of the stages and particular number of map and reduce tasks. With Tez, it also shows task number and execution order. The counts here are small as this is a sample from a single-node HDP Sandbox. For more detailed analysis, you will need to read the query plan via “explain”.
  • 50. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Learn To Read The Query Plan • “explain extended” in front of your query. • Sections: – Abstract syntax tree – you can usually ignore this. – Stage dependencies – dependencies and # of stages. – Stage plans – important info on how Hive is running the job. Page 50
  • 51. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. A sample query plan. Page 51 4 Stages Stage Details
  • 52. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Use Case: Star Schema Join Page 52 request Fact Table client dimension path dimension #  Most  popular  URLs  in  US.   select  path,  country,  count(*)  as  cnt      from  request      join  client  on  request.clientsk  =  client.id      join  path  on  request.pathsk  =  path.id   where      client.country  =  'us'   group  by      path,  country   order  by  cnt  desc   limit  5;   Ideal = Single scan of fact table.
  • 53. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Case 1: hive.auto.convert.join=false; • Generates 5 stages. Page 53 STAGE  DEPENDENCIES:      Stage-­‐1  is  a  root  stage      Stage-­‐2  depends  on  stages:  Stage-­‐1      Stage-­‐3  depends  on  stages:  Stage-­‐2      Stage-­‐4  depends  on  stages:  Stage-­‐3      Stage-­‐0  is  a  root  stage  
  • 54. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Case 1: hive.auto.convert.join=false; • 4 Map Reduces tells you something is not right. Page 54  Stage:  Stage-­‐1          Map  Reduce      Stage:  Stage-­‐2          Map  Reduce      Stage:  Stage-­‐3          Map  Reduce      Stage:  Stage-­‐4          Map  Reduce  
  • 55. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Case 1: hive.auto.convert.join=true; • Only 4 stages this time. Page 55 STAGE  DEPENDENCIES:      Stage-­‐9  is  a  root  stage      Stage-­‐8  depends  on  stages:  Stage-­‐9      Stage-­‐4  depends  on  stages:  Stage-­‐8      Stage-­‐0  is  a  root  stage  
  • 56. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Case 1: hive.auto.convert.join=false; • Only 2 Map Reduces Page 56  Stage:  Stage-­‐8          Map  Reduce      Stage:  Stage-­‐4          Map  Reduce  
  • 57. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Case 1: hive.auto.convert.join=false; • Stage-9 is Map Reduce Local Work. What is that? • Hive is loading the dimension tables (client and path) directly. Client is filtered by country. • Remaining map/reduce are for join and order by. Page 57                client                        TableScan                          alias:  client                          Filter  Operator                              predicate:                                      expr:  (country  =  'us')                  ...                  path                        TableScan                          alias:  path                  ...  
  • 58. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hive Fast Query Checklist Page 58 Par11oned  data  along  natural  query  boundaries  (e.g.  date).   Minimized  data  shuffle  by  co-­‐loca1ng  the  most  commonly  joined  data.   Took  advantage  of  skews  for  high-­‐frequency  values.   Enabled  short-­‐circuit  read.   Used  ORCFile.   Sorted  columns  to  facilitate  row  skipping  for  common  targeted  queries.   Verified  query  plan  to  ensure  single  scan  through  largest  table.   Checked  the  query  plan  to  ensure  par11on  pruning  is  happening.   Used  at  least  one  ON  clause  in  every  JOIN.  
  • 59. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. For Even More Hive Performance Page 59 Increased  replica1on  factor  for  frequently  accessed  data  and  dimensions.   Tuned  io.sort.mb  to  avoid  spilling.   Tuned  mapred.max.split.size,  mapred.min.split.size  to  ensure  1  mapper  wave.   Tuned  mapred.reduce.tasks  to  an  appropriate  value  based  on  map  output.   Checked  jobtracker  to  ensure  “row  container”  spilling  does  not  occur.   Gave  extra  memory  for  mapjoins  like  broadcast  joins.   Disabled  orc.compress  (file  size  will  increase)  and  tuned  orc.row.index.stride.   Ensured  the  job  ran  in  a  single  wave  of  mappers.  
  • 60. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Agenda • Hive – What Is It Good For? • Hive’s Architecture and SQL Compatibility • Turning Hive Performance to 11 • Get Data In and Out of Hive • Hive Security • Project Stinger – Making Hive 100x Faster • Connecting to Hive From Popular Tools Page 60
  • 61. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Loading Data in Hive • Sqoop – Data transfer from external RDBMS to Hive. – Sqoop can load data directly to/from HCatalog. • Hive LOAD – Load files from HDFS or local filesystem. – Format must agree with table format. • Insert from query – CREATE TABLE AS SELECT or INSERT INTO. • WebHDFS + WebHCat – Load data via REST APIs. Page 61
  • 62. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Optimized Sqoop Connectors • Current: – Teradata – Oracle – Netezza – SQL Server – MySQL – Postgres • Future: – Vertica
  • 63. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. High Performance Teradata Sqoop Connector Page 63 • High performance Sqoop driver. • Fully parallel data load using Enhanced FastLoad. • Multiple readers and writers for efficient, wire-speed transfer. • Copy data between Teradata and HDP Hive, HBase or HDFS. Enhanced FastLoad developed in partnership with Teradata. Free to download & use with Hortonworks Data Platform. Multiple active data channels for fully parallel data movement (N to M)
  • 64. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. ACID Properties • Data loaded into Hive partition- or table-at-a-time. – No INSERT or UPDATE statement. No transactions. • Atomicity: – Partition loads are atomic through directory renames in HDFS. • Consistency: – Ensured by HDFS. All nodes see the same partitions at all times. – Immutable data = no update or delete consistency issues. • Isolation: – Read committed with an exception for partition deletes. – Partitions can be deleted during queries. New partitions will not be seen by jobs started before the partition add. • Durability: – Data is durable in HDFS before partition exposed to Hive. Page 64
  • 65. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Handling Semi-Structured Data • Hive supports arrays, maps, structs and unions. • SerDes map JSON, XML and other formats natively into Hive. Page 65
  • 66. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Agenda • Hive – What Is It Good For? • Hive’s Architecture and SQL Compatibility • Turning Hive Performance to 11 • Get Data In and Out of Hive • Hive Security • Project Stinger – Making Hive 100x Faster • Connecting to Hive From Popular Tools Page 66
  • 67. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hive Authorization • Hive provides Users, Groups, Roles and Privileges • Granular permissions on tables, DDL and DML operations. • Not designed for high security: – On non-kerberized cluster, up to the client to supply their user name. – Suitable for preventing accidental data loss. Page 67
  • 68. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HiveServer2 • HiveServer2 is a gateway / JDBC / ODBC endpoint Hive clients can talk to. • Supports secure and non-secure clusters. • DoAs support allows Hive query to run as the requester. • (Coming Soon) LDAP authentication. Page 68
  • 69. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Agenda • Hive – What Is It Good For? • Hive’s Architecture and SQL Compatibility • Turning Hive Performance to 11 • Get Data In and Out of Hive • Hive Security • Project Stinger – Making Hive 100x Faster • Connecting to Hive From Popular Tools Page 69
  • 70. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Roadmap to 100x Faster Page 70 Hadoop   Hive   ORCFile     Column  Store   High  Compression   Predicate  /  Filter  Pushdowns   Buffer  Caching     Cache  accessed  data   Op1mized  for  vector  engine   Tez     Express  data  processing   tasks  more  simply   Eliminate  disk  writes   Tez  Service     Pre-­‐warmed  Containers   Low-­‐latency  dispatch   Vector  Query  Engine     Op1mized  for  modern   processor  architectures   Base  Op6miza6ons     Generate  simplified  DAGs   Join  Improvements   Query  Planner     Intelligent  Cost-­‐Based   Op1mizer   Phase 1 Phase 2 Phase 3
  • 71. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Phase 1 Improvements Path to Making Hive 100x Faster Page 71
  • 72. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Join Optimizations • Performance Improvements in Hive 0.11: • New Join Types added or improved in Hive 0.11: – In-memory Hash Join: Fast for fact-to-dimension joins. – Sort-Merge-Bucket Join: Scalable for large-table to large-table joins. • More Efficient Query Plan Generation – Joins done in-memory when possible, saving map-reduce steps. – Combine map/reduce jobs when GROUP BY and ORDER BY use the same key. • More Than 30x Performance Improvement for Star Schema Join Page 72
  • 73. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Star Schema Join Improvements in 0.11 Page 73
  • 74. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hive 0.11 Star Schema Join Performance Page 74 In-Memory Hash Join
  • 75. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hive 0.11 SMB Join Improvements Page 75 Sort-Merge-Bucket Join
  • 76. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. ORCFile – Efficient Columnar Layout Large block size well suited for HDFS. Columnar format arranges columns adjacent within the file for compression and fast access.
  • 77. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. ORCFile Provides High Compression
  • 78. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Phase 2 & 3 Improvements Path to Making Hive 100x Faster Page 78
  • 79. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Vectorization • Designed for Modern Processor Architectures – Make the most use of L1 and L2 cache. – Avoid branching whenever possible. • How It Works – Process records in batches of 1,000 to maximize cache use. – Generate code on-the-fly to minimize branching. • What It Gives – 30x+ improvement in number of rows processed per second. Page 79
  • 80. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Other Runtime Optimizations • Optimized Query Planner – Automatically determine optimal execution parameters. • Buffering – Cache hot data in memory. Page 80
  • 81. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hadoop 2.0 - YARN •  Re-architected Hadoop framework •  Focus on scale and innovation – Support 10,000+ computer clusters – Extensible to encourage innovation •  Next generation execution – Improves MapReduce performance •  Supports new frameworks beyond MapReduce – Low latency, streaming, etc – Do more with a single Hadoop cluster HDFS   MapReduce   Redundant, Reliable Storage YARN:  Cluster  Resource  Management   Tez   Graph  Processing   Other   •  Separation of Resource Management from MapReduce
  • 82. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Tez (“Speed”) • What is it? – A data processing framework as an alternative to MapReduce – A new incubation project in the ASF • Who else is involved? – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft • Why does it matter? – Widens the platform for Hadoop use cases – Crucial to improving the performance of low-latency applications – Core to the Stinger initiative – Evidence of Hortonworks leading the community in the evolution of Enterprise Hadoop
  • 83. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.© Hortonworks Inc. 2013. Confidential and Proprietary. Tez on YARN: Going Beyond Batch Tez Optimizes Execution New runtime engine for more efficient data processing Always-On Tez Service Low latency processing for all Hadoop data processing Tez Task
  • 84. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Tez - Core Idea Page 84 YARN ApplicationMaster to run DAG of Tez Tasks Task with pluggable Input, Processor and Output Tez Task - <Input, Processor, Output> Task   Processor  Input   Output  
  • 85. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Tez: Building blocks for scalable data processing Page 85 Classical ‘Map’ Classical ‘Reduce’ Intermediate ‘Reduce’ for Map-Reduce-Reduce Map   Processor   HDFS   Input   Sorted   Output   Reduce   Processor   Shuffle   Input   HDFS   Output   Reduce   Processor   Shuffle   Input   Sorted   Output  
  • 86. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hive – MR Hive – Tez Hive/MR versus Hive/Tez Page 86 SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state SELECT a.state JOIN (a, c) SELECT c.price SELECT b.id JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) M M M R R M M R M M R M M R HDFS HDFS HDFS M M M R R R M M R R SELECT a.state, c.itemId JOIN (a, c) JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) SELECT b.id Tez avoids unneeded writes to HDFS
  • 87. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Tez Service • Map/Reduce Query Startup Is Expensive • Solution – Tez Service – Hot containers ready for immediate use – Removes task and job launch overhead (~5s – 30s) – Hive – Submits query plan directly to Tez Service – Native Hadoop service, not ad-hoc Page 87
  • 88. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Agenda • Hive – What Is It Good For? • Hive’s Architecture and SQL Compatibility • Turning Hive Performance to 11 • Get Data In and Out of Hive • Hive Security • Project Stinger – Making Hive 100x Faster • Connecting to Hive From Popular Tools Page 88
  • 89. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Hive: The De-Facto SQL Interface for Hadoop Page 89
  • 90. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Connectivity • JDBC – Included with Hive. • ODBC – Free driver available at hortonworks.com. • WebHCat – Run jobs using a simple REST interface. • BI Ecosystem – Most popular BI tools support Hive. Page 90
  • 91. Deep Dive content by Hortonworks, Inc. is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Learn More with Hortonworks HDP Sandbox Page 91 Built-in tutorials show how to connect to Hive with ODBC and Excel. Learn to connect to Hadoop with popular BI tools.