Apache Hive provides SQL-like access to your stored data in Apache Hadoop. Apache HBase stores tabular data in Hadoop and supports update operations. The combination of these two capabilities is often desired, however, the current integration show limitations such as performance issues. In this talk, Enis Soztutar will present an overview of Hive and HBase and discuss new updates/improvements from the community on the integration of these two projects. Various techniques used to reduce data exchange and improve efficiency will also be provided.
Scaling API-first â The story of a global engineering organization
Â
Integration of Hive and HBase
1. Integration of Apache Hive
and HBase
Enis Soztutar
enis [at] apache [dot] org
@enissoz
Architecting the Future of Big Data
Š Hortonworks Inc. 2011 Page 1
2. About Me
â˘âŻ User and committer of Hadoop since 2007
â˘âŻ Contributor to Apache Hadoop, HBase, Hive and Gora
â˘âŻ Joined Hortonworks as Member of Technical Staff
â˘âŻ Twitter: @enissoz
Architecting the Future of Big Data
Page 2
Š Hortonworks Inc. 2011
3. Agenda
â˘âŻ Overview of Hive and HBase
â˘âŻ Hive + HBase Features and Improvements
â˘âŻ Future of Hive and HBase
â˘âŻ Q&A
Architecting the Future of Big Data
Page 3
Š Hortonworks Inc. 2011
4. Apache Hive Overview
â˘âŻApache Hive is a data warehouse system for Hadoop
â˘âŻSQL-like query language called HiveQL
â˘âŻBuilt for PB scale data
â˘âŻMain purpose is analysis and ad hoc querying
â˘âŻDatabase / table / partition / bucket â DDL Operations
â˘âŻSQL Types + Complex Types (ARRAY, MAP, etc)
â˘âŻVery extensible
â˘âŻNot for : small data sets, low latency queries, OLTP
Architecting the Future of Big Data
Page 4
Š Hortonworks Inc. 2011
5. Apache Hive Architecture
JDBC/ODBC
Hive Thrift Hive Web
CLI
Server Interface
Driver M
S
C
Parser Planner l Metastore
i
e
Execution Optimizer n
t
MapReduce
HDFS RDBMS
Architecting the Future of Big Data
Page 5
Š Hortonworks Inc. 2011
6. Overview of Apache HBase
â˘âŻApache HBase is the Hadoop database
â˘âŻModeled after Googleâs BigTable
â˘âŻA sparse, distributed, persistent multi- dimensional sorted
map
â˘âŻThe map is indexed by a row key, column key, and a
timestamp
â˘âŻEach value in the map is an un-interpreted array of bytes
â˘âŻLow latency random data access
Architecting the Future of Big Data
Page 6
Š Hortonworks Inc. 2011
7. Overview of Apache HBase
â˘âŻLogical view:
From: Bigtable: A Distributed Storage System for Structured Data, Chang, et al.
Architecting the Future of Big Data
Page 7
Š Hortonworks Inc. 2011
8. Apache HBase Architecture
Client
HMaster
Zookeeper
Region Region Region
server server server
Region Region Region
Region Region Region
HDFS
Architecting the Future of Big Data
Page 8
Š Hortonworks Inc. 2011
9. Hive + HBase Features and
Improvements
Architecting the Future of Big Data
Page 9
Š Hortonworks Inc. 2011
10. Hive + HBase Motivation
â˘âŻHive and HBase has different characteristics:
High latency Low latency
Structured vs. Unstructured
Analysts Programmers
â˘âŻHive datawarehouses on Hadoop are high latency
ââŻLong ETL times
ââŻAccess to real time data
â˘âŻAnalyzing HBase data with MapReduce requires custom
coding
â˘âŻHive and SQL are already known by many analysts
Architecting the Future of Big Data
Page 10
Š Hortonworks Inc. 2011
11. Use Case 1: HBase as ETL Data Sink
From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook
http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010
Architecting the Future of Big Data
Page 11
Š Hortonworks Inc. 2011
12. Use Case 2: HBase as Data Source
From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook
http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010
Architecting the Future of Big Data
Page 12
Š Hortonworks Inc. 2011
13. Use Case 3: Low Latency Warehouse
From HUG - Hive/HBase Integration or, MaybeSQL? April 2010 John Sichi Facebook
http://www.slideshare.net/hadoopusergroup/hive-h-basehadoopapr2010
Architecting the Future of Big Data
Page 13
Š Hortonworks Inc. 2011
14. Example: Hive + Hbase (HBase table)
hbase(main):001:0> create 'short_urls', {NAME =>
'u'}, {NAME=>'s'}
hbase(main):014:0> scan 'short_urls'
ROW COLUMN+CELL
bit.ly/aaaa column=s:hits, value=100
bit.ly/aaaa column=u:url,
value=hbase.apache.org/
bit.ly/abcd column=s:hits, value=123
bit.ly/abcd column=u:url,
value=example.com/foo
Architecting the Future of Big Data
Page 14
Š Hortonworks Inc. 2011
15. Example: Hive + HBase (Hive table)
CREATE TABLE short_urls(
short_url string,
url string,
hit_count int
)
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES
("hbase.columns.mapping" = ":key, u:url, s:hits")
TBLPROPERTIES
("hbase.table.name" = âshort_urls");
Architecting the Future of Big Data
Page 15
Š Hortonworks Inc. 2011
16. Storage Handler
â˘âŻHive defines HiveStorageHandler class for different storage
backends: HBase/ Cassandra / MongoDB/ etc
â˘âŻStorage Handler has hooks for
â⯠Getting input / output formats
â⯠Meta data operations hook: CREATE TABLE, DROP TABLE, etc
â˘âŻStorage Handler is a table level concept
â⯠Does not support Hive partitions, and buckets
Architecting the Future of Big Data
Page 16
Š Hortonworks Inc. 2011
17. Apache Hive + HBase Architecture
Hive Thrift Hive Web
CLI
Server Interface
Driver M
S
Parser Planner C
l Metastore
i
Execution Optimizer e
n
t
StorageHandler
MapReduce HBase
HDFS RDBMS
Architecting the Future of Big Data
Page 17
Š Hortonworks Inc. 2011
18. Hive + HBase Integration
â˘âŻFor Input/OutputFormat, getSplits(), etc underlying HBase
classes are used
â˘âŻColumn selection and certain filters can be pushed down
â˘âŻHBase tables can be used with other(Hadoop native) tables
and SQL constructs
â˘âŻHive DDL operations are converted to HBase DDL
operations via the client hook.
ââŻAll operations are performed by the client
ââŻNo two phase commit
Architecting the Future of Big Data
Page 18
Š Hortonworks Inc. 2011
19. Schema / Type Mapping
Architecting the Future of Big Data
Page 19
Š Hortonworks Inc. 2011
20. Schema Mapping
â˘âŻ Hive table + columns + column types <=> HBase table + column
families (+ column qualifiers)
â˘âŻ Every field in Hive table is mapped in order to either
ââŻThe table key (using :key as selector)
ââŻA column family (cf:) -> MAP fields in Hive
ââŻA column (cf:cq)
â˘âŻ Hive table does not need to include all columns in HBase
â˘âŻ CREATE TABLE short_urls(
short_url string,
url string,
hit_count int,
props, map<string,string>
)
WITH SERDEPROPERTIES
("hbase.columns.mapping" = ":key, u:url, s:hits, p:")
Architecting the Future of Big Data
Page 20
Š Hortonworks Inc. 2011
21. Type Mapping
â˘âŻRecently added to Hive (0.9.0)
â˘âŻPreviously all types were being converted to strings in HBase
â˘âŻHive has:
ââŻPrimitive types: INT, STRING, BINARY, DATE, etc
ââŻARRAY<Type>
ââŻMAP<PrimitiveType, Type>
ââŻSTRUCT<a:INT, b:STRING, c:STRING>
â˘âŻHBase does not have types
ââŻBytes.toBytes()
Architecting the Future of Big Data
Page 21
Š Hortonworks Inc. 2011
22. Type Mapping
â˘âŻTable level property
"hbase.table.default.storage.typeâ = âbinaryâ
â˘âŻType mapping can be given per column after #
ââŻAny prefix of âbinaryâ , eg u:url#b
ââŻAny prefix of âstringâ , eg u:url#s
ââŻThe dash char â-â , eg u:url#-
CREATE TABLE short_urls(
short_url string,
url string,
hit_count int,
props, map<string,string>
)
WITH SERDEPROPERTIES
("hbase.columns.mapping" = ":key#b,u:url#b,s:hits#b,p:#s")
Architecting the Future of Big Data Page 22
Š Hortonworks Inc. 2011
23. Type Mapping
â˘âŻIf the type is not a primitive or Map, it is converted to a JSON
string and serialized
â˘âŻStill a few rough edges for schema and type mapping:
ââŻNo Hive BINARY support in HBase mapping
ââŻNo mapping of HBase timestamp (can only provide put
timestamp)
ââŻNo arbitrary mapping of Structs / Arrays into HBase schema
Architecting the Future of Big Data
Page 23
Š Hortonworks Inc. 2011
24. Bulk Load
â˘âŻSteps to bulk load:
ââŻSample source data for range partitioning
ââŻSave sampling results to a file
ââŻRun CLUSTER BY query using HiveHFileOutputFormat and
TotalOrderPartitioner
ââŻImport Hfiles into HBase table
â˘âŻIdeal setup should be
SET hive.hbase.bulk=true
INSERT OVERWRITE TABLE web_table SELECT âŚ.
Architecting the Future of Big Data
Page 24
Š Hortonworks Inc. 2011
26. Filter Pushdown
â˘âŻIdea is to pass down filter expressions to the storage layer to
minimize scanned data
â˘âŻTo access indexes at HDFS or HBase
â˘âŻExample:
CREATE EXTERNAL TABLE users (userid LONG, email STRING, ⌠)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandlerâ
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,âŚ")
SELECT ... FROM users WHERE userid > 1000000 and email LIKE
â%@gmail.comâ;
-> scan.setStartRow(Bytes.toBytes(1000000))
Architecting the Future of Big Data
Page 26
Š Hortonworks Inc. 2011
27. Filter Decomposition
â˘âŻOptimizer pushes down the predicates to the query plan
â˘âŻStorage handlers can negotiate with the Hive optimizer to
decompose the filter
x > 3 AND upper(y) = 'XYZâ
â˘âŻHandle x > 3, send upper(y) = âXYZâ as residual for Hive
â˘âŻWorks with:
key = 3, key > 3, etc
key > 3 AND key < 100
â˘âŻOnly works against constant expressions
Architecting the Future of Big Data
Page 27
Š Hortonworks Inc. 2011
29. Security â Big Picture
â˘âŻSecurity becomes more important to support enterprise level
and multi tenant applications
â˘âŻ5 Different Components to ensure / impose security
ââŻHDFS
ââŻMapReduce
ââŻHBase
ââŻZookeeper
ââŻHive
â˘âŻEach component has:
ââŻAuthentication
ââŻAuthorization
Architecting the Future of Big Data
Page 29
Š Hortonworks Inc. 2011
30. HBase Security â Closer look
â˘âŻReleased with HBase 0.92
â˘âŻFully optional module, disabled by default
â˘âŻNeeds an underlying secure Hadoop release
â˘âŻSecureRPCEngine: optional engine enforcing SASL
authentication
ââŻKerberos
ââŻDIGEST-MD5 based tokens
ââŻTokenProvider coprocessor
â˘âŻAccess control is implemented as a Coprocessor:
AccessController
â˘âŻStores and distributes ACL data via Zookeeper
ââŻSensitive data is only accessible by HBase daemons
ââŻClient does not need to authenticate to zk
Architecting the Future of Big Data
Page 30
Š Hortonworks Inc. 2011
31. Hive Security â Closer look
â˘âŻHive has different deployment options, security considerations
should take into account different deployments
â˘âŻAuthentication is only supported at Metastore, not on
HiveServer, web interface, JDBC
â˘âŻAuthorization is enforced at the query layer (Driver)
â˘âŻPluggable authorization providers. Default one stores global/
table/partition/column permissions in Metastore
GRANT ALTER ON TABLE web_table TO USER bob;
CREATE ROLE db_reader
GRANT SELECT, SHOW_DATABASE ON DATABASE mydb TO
ROLE db_reader
Architecting the Future of Big Data
Page 31
Š Hortonworks Inc. 2011
32. Hive Deployment Option 1
Client
CLI
Driver M
Authorization
S
C
Parser Planner l Authentication
i
Metastore
e
Execution Optimizer n
t
A/A A/A
MapReduce HBase
A12n/A11N A12n/A11N
HDFS
RDBMS
Architecting the Future of Big Data
Page 32
Š Hortonworks Inc. 2011
33. Hive Deployment Option 2
Client
CLI
Driver M
Authorization
S
C
Parser Planner l Authentication
i
e Metastore
n
Execution Optimizer
t
A/A A/A
MapReduce HBase
A12n/A11N A12n/A11N
HDFS
RDBMS
Architecting the Future of Big Data
Page 33
Š Hortonworks Inc. 2011
34. Hive Deployment Option 3
Client
JDBC/ODBC
Hive Thrift Hive Web
CLI
Server Interface
M
Driver Authorization
S
C
Parser Planner l Authentication
i Metastore
e
Execution Optimizer n
t
A/A A/A
MapReduce HBase
A12n/A11N
HDFS A12n/A11N
RDBMS
Architecting the Future of Big Data
Page 34
Š Hortonworks Inc. 2011
35. Hive + HBase + Hadoop Security
â˘âŻRegardless of Hiveâs own security, for Hive to work on
secure Hadoop and HBase, we should:
ââŻObtain delegation tokens for Hadoop and HBase jobs
ââŻEnsure to obey the storage level (HDFS, HBase) permission checks
ââŻIn HiveServer deployments, authenticate and impersonate the user
â˘âŻDelegation tokens for Hadoop are already working
â˘âŻObtaining HBase delegation tokens are released in Hive
0.9.0
Architecting the Future of Big Data
Page 35
Š Hortonworks Inc. 2011
36. Future of Hive + HBase
â˘âŻImprove on schema / type mapping
â˘âŻFully secure Hive deployment options
â˘âŻHBase bulk import improvements
â˘âŻSortable signed numeric types in HBase
â˘âŻFilter pushdown: non key column filters
â˘âŻHive random access support for HBase
ââŻhttps://cwiki.apache.org/HCATALOG/random-access-
framework.html
Architecting the Future of Big Data
Page 36
Š Hortonworks Inc. 2011
37. References
â˘âŻSecurity
ââŻhttps://issues.apache.org/jira/browse/HIVE-2764
ââŻhttps://issues.apache.org/jira/browse/HBASE-5371
ââŻhttps://issues.apache.org/jira/browse/HCATALOG-245
ââŻhttps://issues.apache.org/jira/browse/HCATALOG-260
ââŻhttps://issues.apache.org/jira/browse/HCATALOG-244
ââŻhttps://cwiki.apache.org/confluence/display/HCATALOG/Hcat+Security
+Design
â˘âŻType mapping / Filter Pushdown
ââŻhttps://issues.apache.org/jira/browse/HIVE-1634
ââŻhttps://issues.apache.org/jira/browse/HIVE-1226
ââŻhttps://issues.apache.org/jira/browse/HIVE-1643
ââŻhttps://issues.apache.org/jira/browse/HIVE-2815
ââŻhttps://issues.apache.org/jira/browse/HIVE-1643
Architecting the Future of Big Data
Page 37
Š Hortonworks Inc. 2011
38. Other Resources
â˘âŻHadoop Summit
ââŻJune 13-14
ââŻSan Jose, California
ââŻwww.Hadoopsummit.org
â˘âŻHadoop Training and Certification
ââŻDeveloping Solutions Using Apache Hadoop
ââŻAdministering Apache Hadoop
ââŻOnline classes available US, India, EMEA
ââŻhttp://hortonworks.com/training/
Š Hortonworks Inc. 2012 Page 38
39. Thanks
Questions?
Architecting the Future of Big Data
Page 39
Š Hortonworks Inc. 2011