HBaseCon 2013: Integration of Apache Hive and HBase

© Hortonworks Inc. 2011
Integration of Apache Hive
and HBase
Enis Soztutar
enis [at] apache [dot] org
Page 1
Ashutosh Chauhan
hashutosh [at] apache [dot] org

About Us
Page 2
Architecting the Future of Big Data
Enis Soztutar
•  In the Hadoop space since 2007
•  Committer and PMC Member in Apache HBase and Hadoop
•  Twitter: @enissoz
Ashutosh Chauhan
•  In the Hadoop space since 2009
•  Committer and PMC Member in Apache Hive and Pig

Agenda
Page 3
•  Overview of Hive
•  Hive + HBase Features and Improvements
•  Future of Hive and HBase
•  Q&A

Apache Hive Overview
• Apache Hive is a data warehouse system for Hadoop
• SQL-like query language called HiveQL
• Built for PB scale data
• Main purpose is analysis and ad hoc querying
• Database / table / partition / bucket – DDL Operations
• SQL Types + Complex Types (ARRAY, MAP, etc)
• Very extensible
• Not for : small data sets, OLTP
Page 4

Apache Hive Architecture
Page 5
Metastore
RDBMS
Hive Thrift
Server
Driver
CLI
JDBC/ODBC
Hive Web
Interface
HDFS
MapReduce
Execution
Parser Planner
Optimizer
M
S
C
l
i
e
n
t

Hive + HBase Features and
Improvements
Page 6

Hive + HBase Motivation
• Hive over HDFS and HBase has different characteristics
–  Batch Online
–  Structured vs Unstructured
– Analysts Programmers
• Hive datawarehouses on HDFS are
– Long ETL times
– Access to real time data
• Analyzing HBase data with MapReduce requires
custom coding
• Hive and SQL are already known by many analysts
Page 7

Use Case 1: HBase as ETL Data Sink
Page 8
From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook
http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010
HDFS
Tables
INSERT …
SELECT …!
FROM … !
HBase
Online Queries

Use Case 2: HBase as Data Source
Page 9
HDFS
Tables
SELECT …
JOIN …!
GROUP BY … !
HBase
Query
Result

Use Case 3: Low Latency Warehouse
Page 10
HDFS
Tables
HBase
Continuous
Updates
HIVE QUERIES!
Periodic
Dump

Hive + HBase Example (HBase table)
hbase(main):001:0> create 'short_urls', {NAME => 'u'},
{NAME=>'s'}
hbase(main):014:0> scan 'short_urls'
ROW COLUMN+CELL
bit.ly/aaaa column=s:hits, value=100
bit.ly/aaaa column=u:url, value=hbase.apache.org/
bit.ly/abcd column=s:hits, value=123
bit.ly/abcd column=u:url, value=example.com/foo
Page 11

Hive + HBase Example (Hive table)
CREATE TABLE short_urls(
short_url string,
url string,
hit_count int
)
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES
("hbase.columns.mapping" = ":key, u:url, s:hits")
TBLPROPERTIES
("hbase.table.name" = ”short_urls");
Page 12

Storage Handler
• Hive defines HiveStorageHandler class for different storage
backends: HBase/ Cassandra / MongoDB/ etc
• Storage Handler has hooks for
–  Getting input / output formats
–  Meta data operations hook: CREATE TABLE, DROP TABLE,
etc
• Storage Handler is a table level concept
–  Does not support Hive partitions, and buckets
Page 13

Apache Hive + HBase Architecture
Page 14
Metastore
RDBMS
Hive Thrift
Server
Driver
CLI
Hive Web
Interface
HDFS
MapReduce
Execution
Parser Planner
Optimizer
M
S
C
l
i
e
n
t
HBase
StorageHandler

Hive + HBase Integration
• For Input/OutputFormat, getSplits(), etc underlying HBase
classes are used
• Column selection and certain filters can be pushed down
• HBase tables can be used with other(Hadoop native) tables
and SQL constructs
• Hive DDL operations are converted to HBase DDL
operations via the client hook.
– All operations are performed by the client
– No two phase commit
Page 15

Schema / Type Mapping
Page 16

Schema Mapping
• Hive table + columns + column types <=> HBase table + column
families (+ column qualifiers)
• Every field in Hive table is mapped to either
– The table key (using :key as selector)
– A column family (cf:) -> MAP fields in Hive
– A column (cf:cq)
•  Hive table does not need to include all columns in Hbase
Page 17

Schema Mapping - Example
Page 18
short_url string,
url string,
hit_count int,
props, map<string,string>
)
("hbase.columns.mapping" = ":key, u:url, s:hits, p:")

Schema Mapping - Example
Page 19
short_url string,
url string,
hit_count int,
props map<string,string>
)
("hbase.columns.mapping" = ":key, u:url, s:hits, p:")

Type Mapping
• Added in Hive (0.9.0)
• Previously all types were being converted to strings in HBase
• Hive has:
– Primitive types: INT, STRING, BINARY, DOUBLE etc
– ARRAY<Type>
– MAP<PrimitiveType, Type>
– STRUCT<a:INT, b:STRING, c:STRING>
• HBase does not have types
– Bytes.toBytes()
Page 20

Type Mapping
• Table level property
"hbase.table.default.storage.type” = “binary”
• Type mapping can be given per column after #
– Any prefix of “binary” , eg u:url#b
– Any prefix of “string” , eg u:url#s
– The dash char “-” , eg u:url#-
Page 21

Type Mapping - Example
Page 22
short_url string,
url string,
hit_count int,
props, map<string,string>
)
("hbase.columns.mapping" = ":key#b,u:url#b,s:hits#b,p:#s")

Type Mapping
• If the type is not a primitive or Map, it is converted to a JSON
string and serialized
• Still a few rough edges for schema and type mapping:
– No support for DECIMAL, BINARY Hive types
– No mapping of HBase timestamp (can only provide put
timestamp)
– No arbitrary mapping of Structs / Arrays into HBase schema
Page 23

Bulk Load
• Steps to bulk load:
– Sample source data for range partitioning
– Save sampling results to a file
– Run CLUSTER BY query using HiveHFileOutputFormat and
TotalOrderPartitioner
– Import Hfiles into HBase table
• Ideal setup should be
SET hive.hbase.bulk=true
INSERT OVERWRITE TABLE web_table SELECT ….
Page 24

Filter Pushdown
Page 25

Filter Pushdown
• Idea is to pass down filter expressions to the storage layer to
minimize scanned data
• To access indexes at hdfs or hbase
• Example:
CREATE EXTERNAL TABLE users (userid LONG, email STRING, … )
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,…")
SELECT ... FROM users WHERE userid > 1000000 and email LIKE
‘%@gmail.com’;
-> scan.setStartRow(Bytes.toBytes(1000000))
Page 26

Filter Decomposition
• Optimizer pushes down the predicates to the query plan
• Storage handlers can negotiate with the Hive optimizer to
decompose the filter
x > 3 AND upper(y) = 'XYZ’
• Handle x > 3, send upper(y) = ’XYZ’ as residual for Hive
• Works with:
key = 3, key > 3, etc
key > 3 AND key < 100
• Only works against constant expressions
Page 27

Future of Hive + HBase
• Improve on schema / type mapping
• Fully secure Hive deployment options
• HBase bulk import improvements
• Filter pushdown: non key column filters
• Sortable signed numeric types in HBase
• Use HBase’s new typing API’s (upcoming in HBase)
• Integration with Phoenix / extract common modules, hbase-
sql ?
Page 28

References
• Type mapping / Filter Pushdown
– https://issues.apache.org/jira/browse/HIVE-1634
Page 29

Thanks
Questions?
Page 30

HBaseCon 2013: Integration of Apache Hive and HBase

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to HBaseCon 2013: Integration of Apache Hive and HBase

Similar to HBaseCon 2013: Integration of Apache Hive and HBase (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

HBaseCon 2013: Integration of Apache Hive and HBase