SlideShare a Scribd company logo
1 of 17
Apache Hive 2.0:
SQL, Speed, Scale
Alan Gates
Hive PMC Member
Co-founder Hortonworks
May 2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Acknowledgements
 The Apache Hive community for building all this awesome tech
 Content of some of these slides based on earlier presentations by Sergey Shelukhin
and Siddarth Seth
 alias Hive=‘Apache Hive’
alias Hadoop=‘Apache Hadoop’
alias Spark=‘Apache Spark’
alias Tez=‘Apache Tez’
alias Parquet=‘Apache Parquet’
alias ORC=‘Apache ORC’
alias Omid=‘Apache Omid (incubating)’
alias Calcite=‘Apache Calcite’
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive History
 Initially Hive provided SQL on Hadoop
– Provided a table view instead of file view of data
– Translated SQL to MapReduce
– Mostly used for ETL (Extract Transform Load)
– Big, batch, high start up time
 Around 2012 it became clear users wanted to do all data warehousing on Hadoop,
not just batch ETL
 Hive has shifted over time to focus on traditional data warehousing problems
– Still does large ETL well
– Now also can be used for analytics, reporting
– Work being done to better support BI (Business Intelligence) tools
 Not OLTP, very focused on backend analytics
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 1.x and 2.x
 New feature development in Hive moving at a fast pace
– Stressful for those who use Hive for its original purpose (ETL type SQL on MapReduce)
– Realizing the full potential of Hive as data warehouse on Hadoop requires more changes
 Compromise: follow Hadoop’s example, split into stable and new feature lines
 1.x
– Stable
– Backwards compatible
– Ongoing bug fixes
 2.x
– Major new features
– Backwards compatible where possible, but some things will be broken
– Hive 2.0 released February 15, 2016 – Not considered production ready
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2.0 New Features Overview
 1039 JIRAs resolved with 2.0 as fix version
– 666 bugs
– 140 improvements or new features
 HPLSQL
 LLAP
 HBase Metastore
 Hive-On-Spark Improvements
 Cost Based Optimizer Improvements
 Many, many new features and bug fixes I will not have time to cover
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Adding Procedural SQL: HPLSQL
 Procedural SQL, akin to Oracle’s PL/SQL and Teradata’s stored procedures
– Adds cursors, loops (FOR, WHILE, LOOP), branches (IF), HPLSQL procedures, exceptions (SIGNAL)
 Aims to be compatible with all major dialects of procedural SQL to maximize re-use of
existing scripts
 Currently external to Hive, communicates with Hive via JDBC.
– User runs command using hplsql binary
– Goal is to tightly integrate it so that Hive’s parser can execute HPLSQL, store HPLSQL procedures,
etc.
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sub-second Queries in Hive: LLAP (Live Long and Process)
 Persistent daemons
– Saves time on process start up (eliminates container allocation and JVM start up time)
– All code JITed within a query or two
 Data caching with an async I/O elevator
– Hot data cached in memory (columnar aware, so only hot columns cached)
– When possible work scheduled on node with data cached, if not work will be run in other node
 Operators can be executed inside LLAP when it makes sense
– Large, ETL style queries usually don’t make sense
– User code not run in LLAP for security
 Working on interface to allow other data engines to read securely in parallel
 Beta in 2.0
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive With LLAP Execution Options
AM AM
T T T
R R
R
T T
T
R
M M M
R R
R
M M
R
R
Tez Only LLAP + Tez
T T T
R R
R
T T
T
R
LLAP only
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
LLAP Performance
0
5
10
15
20
25
30
35
40
45
50
query3 query12 query20 query21 query26 query27 query42 query52 query55 query73 query89 query91 query98
TIME(SECONDS)
LLAP vs Hive 1.x 10TB Scale
LLAP Hive 1.x
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
LLAP Performance Continued
0
100
200
300
400
500
Time(seconds)
LLAP Hive 1.2.1
Hive / LLAP, Hive 1.2.1 Query Times
38 out of 61 queries ran 50% faster
25 out of 61 queries ran 70% faster
12 out of 61 queries ran 80% faster
1 query ran 90% faster
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
LLAP Limitations
 Currently in Beta
 Read only, no write path yet
 Does not work with ACID yet (see previous bullet)
 User must decide whether query runs fully in LLAP, mixed mode, or not at all
– Should be handled by CBO
 Currently only reads ORC files
 Currently only integrates with Tez as an engine
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Speeding up Query Planning: HBase Metastore
 Add option to use HBase to store Hive’s metadata
 Why?
– Planning a query that reads several thousand partitions in Hive 1.2 takes 5+ seconds, mostly for metadata
acquisition
– ORM layer produces complex, slow schema (40+ tables)
– The need to work across 5 different databases limits performance optimizations and maximizes test
matrix for developers
– Limits caching opportunities as we cannot store too much data in a single node RDBMS
– The need to limit number of concurrent connections forces all metadata operations to be done during
query planning
– HBase addresses each of these
 Goal: cut metadata access time for query with thousands of partitions to 200 milliseconds
– Not there yet, currently at 1-1.5 seconds
 Challenges
– HBase lacks transactions, addressing via Apache Omid (incubating)
 Alpha in Hive 2.0
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Improvements to Hive on Spark
 Dynamic partition pruning
 Make use of spark persistence for self-join, self-union, and CTEs
 Vectorized map-join and other map-join improvements
 Parallel order by
 Pre-warming of containers
 Support for Spark 1.5
 Many bug fixes
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Cost Base Optimizer (CBO) Improvements
 Hive’s CBO uses Calcite
– Not all optimization rules migrated yet, but 2.0 continues work towards that
 CBO on by default in 2.0 (wasn’t in in 1.x)
 Main focus of CBO work has been BI queries (using TPC-DS as guide)
– Some work on machine generated queries, since tools generate some funky queries
 Focus on improving stats collection and estimating stats more accurately between
operators in the plan
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
And Many, Many More
• SQL Standard Auth is the default authorization (actually works)
• CLI mode for beeline (WIP to replace and deprecate CLI in Hive 2.*)
• Codahale-based metrics (also in 1.3)
• HS2 Web UI
• Stability Improvements and bugfixes for ACID (almost production ready now)
• Native vectorized mapjoin, vectorized reducesink, improved vectorized GBY, etc.
• Improvements to Parquet performance (PPD, memory manager, etc.)
• ORC schema evolution (beta)
• Improvement to windowing functions, refactoring ORC before split, SIMD
optimizations, new LIMIT syntax, parallel compilation in HS2, improvements to Tez
session management, many more
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive 2.0 Incompabilities
 Java 7 & 8 supported, 6 no longer supported
 Requires Hadoop 2.x, Hadoop 1.x no longer supported
 MapReduce deprecated, Tez or Spark recommended instead
– At some future date MR will be removed
 Some configuration defaults changed, e.g.
– bucketing enforced by default
– metadata schema no longer created if it is missing
– SQL Standard authorization used by default
 We plan to remove Hive CLI in the future and replace with beeline CLI
– Why?
• Makes it easier for users to deploy secure clusters where all access is via [OJ]DBC
• It is cleaner to maintain one code path
– Does not require HiveServer2, can run HS2 embedded in beeline
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You

More Related Content

What's hot

Managing Enterprise Hadoop Clusters with Apache Ambari
Managing Enterprise Hadoop Clusters with Apache AmbariManaging Enterprise Hadoop Clusters with Apache Ambari
Managing Enterprise Hadoop Clusters with Apache AmbariHortonworks
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghaiYifeng Jiang
 
Hortonworks Data In Motion Series Part 3 - HDF Ambari
Hortonworks Data In Motion Series Part 3 - HDF Ambari Hortonworks Data In Motion Series Part 3 - HDF Ambari
Hortonworks Data In Motion Series Part 3 - HDF Ambari Hortonworks
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache RangerDataWorks Summit
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics OptimizationHortonworks
 
Hadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureHadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureDataWorks Summit
 
Transactional SQL in Apache Hive
Transactional SQL in Apache HiveTransactional SQL in Apache Hive
Transactional SQL in Apache HiveDataWorks Summit
 
Double Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseDouble Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseHortonworks
 
Apache NiFi Toronto Meetup
Apache NiFi Toronto MeetupApache NiFi Toronto Meetup
Apache NiFi Toronto MeetupHortonworks
 
Apache Ambari: Past, Present, Future
Apache Ambari: Past, Present, FutureApache Ambari: Past, Present, Future
Apache Ambari: Past, Present, FutureHortonworks
 
Scaling real time streaming architectures with HDF and Dell EMC Isilon
Scaling real time streaming architectures with HDF and Dell EMC IsilonScaling real time streaming architectures with HDF and Dell EMC Isilon
Scaling real time streaming architectures with HDF and Dell EMC IsilonHortonworks
 
MiNiFi 0.0.1 MeetUp talk
MiNiFi 0.0.1 MeetUp talkMiNiFi 0.0.1 MeetUp talk
MiNiFi 0.0.1 MeetUp talkJoe Percivall
 
Attunity Hortonworks Webinar- Sept 22, 2016
Attunity Hortonworks Webinar- Sept 22, 2016Attunity Hortonworks Webinar- Sept 22, 2016
Attunity Hortonworks Webinar- Sept 22, 2016Hortonworks
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudDataWorks Summit
 
Delivering a Flexible IT Infrastructure for Analytics on IBM Power Systems
Delivering a Flexible IT Infrastructure for Analytics on IBM Power SystemsDelivering a Flexible IT Infrastructure for Analytics on IBM Power Systems
Delivering a Flexible IT Infrastructure for Analytics on IBM Power SystemsHortonworks
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFiHortonworks
 
An Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureAn Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureDataWorks Summit
 
Running Zeppelin in Enterprise
Running Zeppelin in EnterpriseRunning Zeppelin in Enterprise
Running Zeppelin in EnterpriseDataWorks Summit
 
Edw Optimization Solution
Edw Optimization Solution Edw Optimization Solution
Edw Optimization Solution Hortonworks
 

What's hot (20)

Managing Enterprise Hadoop Clusters with Apache Ambari
Managing Enterprise Hadoop Clusters with Apache AmbariManaging Enterprise Hadoop Clusters with Apache Ambari
Managing Enterprise Hadoop Clusters with Apache Ambari
 
Hive present-and-feature-shanghai
Hive present-and-feature-shanghaiHive present-and-feature-shanghai
Hive present-and-feature-shanghai
 
Hortonworks Data In Motion Series Part 3 - HDF Ambari
Hortonworks Data In Motion Series Part 3 - HDF Ambari Hortonworks Data In Motion Series Part 3 - HDF Ambari
Hortonworks Data In Motion Series Part 3 - HDF Ambari
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
Log Analytics Optimization
Log Analytics OptimizationLog Analytics Optimization
Log Analytics Optimization
 
Hadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureHadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and Future
 
Transactional SQL in Apache Hive
Transactional SQL in Apache HiveTransactional SQL in Apache Hive
Transactional SQL in Apache Hive
 
Double Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseDouble Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSense
 
Apache NiFi Toronto Meetup
Apache NiFi Toronto MeetupApache NiFi Toronto Meetup
Apache NiFi Toronto Meetup
 
Apache Ambari: Past, Present, Future
Apache Ambari: Past, Present, FutureApache Ambari: Past, Present, Future
Apache Ambari: Past, Present, Future
 
Scaling real time streaming architectures with HDF and Dell EMC Isilon
Scaling real time streaming architectures with HDF and Dell EMC IsilonScaling real time streaming architectures with HDF and Dell EMC Isilon
Scaling real time streaming architectures with HDF and Dell EMC Isilon
 
MiNiFi 0.0.1 MeetUp talk
MiNiFi 0.0.1 MeetUp talkMiNiFi 0.0.1 MeetUp talk
MiNiFi 0.0.1 MeetUp talk
 
Attunity Hortonworks Webinar- Sept 22, 2016
Attunity Hortonworks Webinar- Sept 22, 2016Attunity Hortonworks Webinar- Sept 22, 2016
Attunity Hortonworks Webinar- Sept 22, 2016
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
 
Delivering a Flexible IT Infrastructure for Analytics on IBM Power Systems
Delivering a Flexible IT Infrastructure for Analytics on IBM Power SystemsDelivering a Flexible IT Infrastructure for Analytics on IBM Power Systems
Delivering a Flexible IT Infrastructure for Analytics on IBM Power Systems
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
 
An Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureAn Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, Future
 
Running Zeppelin in Enterprise
Running Zeppelin in EnterpriseRunning Zeppelin in Enterprise
Running Zeppelin in Enterprise
 
Edw Optimization Solution
Edw Optimization Solution Edw Optimization Solution
Edw Optimization Solution
 

Similar to Apache Hive 2.0; SQL, Speed, Scale

Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016alanfgates
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesApache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesBig Data Spain
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...VMware Tanzu
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?DataWorks Summit
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019alanfgates
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017alanfgates
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData DayJohn Park
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BIDataWorks Summit
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)Chris Nauroth
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionDataWorks Summit/Hadoop Summit
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive DataWorks Summit
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 

Similar to Apache Hive 2.0; SQL, Speed, Scale (20)

Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesApache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData Day
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 

More from Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

More from Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Recently uploaded

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Recently uploaded (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Apache Hive 2.0; SQL, Speed, Scale

  • 1. Apache Hive 2.0: SQL, Speed, Scale Alan Gates Hive PMC Member Co-founder Hortonworks May 2016
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Acknowledgements  The Apache Hive community for building all this awesome tech  Content of some of these slides based on earlier presentations by Sergey Shelukhin and Siddarth Seth  alias Hive=‘Apache Hive’ alias Hadoop=‘Apache Hadoop’ alias Spark=‘Apache Spark’ alias Tez=‘Apache Tez’ alias Parquet=‘Apache Parquet’ alias ORC=‘Apache ORC’ alias Omid=‘Apache Omid (incubating)’ alias Calcite=‘Apache Calcite’
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive History  Initially Hive provided SQL on Hadoop – Provided a table view instead of file view of data – Translated SQL to MapReduce – Mostly used for ETL (Extract Transform Load) – Big, batch, high start up time  Around 2012 it became clear users wanted to do all data warehousing on Hadoop, not just batch ETL  Hive has shifted over time to focus on traditional data warehousing problems – Still does large ETL well – Now also can be used for analytics, reporting – Work being done to better support BI (Business Intelligence) tools  Not OLTP, very focused on backend analytics
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 1.x and 2.x  New feature development in Hive moving at a fast pace – Stressful for those who use Hive for its original purpose (ETL type SQL on MapReduce) – Realizing the full potential of Hive as data warehouse on Hadoop requires more changes  Compromise: follow Hadoop’s example, split into stable and new feature lines  1.x – Stable – Backwards compatible – Ongoing bug fixes  2.x – Major new features – Backwards compatible where possible, but some things will be broken – Hive 2.0 released February 15, 2016 – Not considered production ready
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2.0 New Features Overview  1039 JIRAs resolved with 2.0 as fix version – 666 bugs – 140 improvements or new features  HPLSQL  LLAP  HBase Metastore  Hive-On-Spark Improvements  Cost Based Optimizer Improvements  Many, many new features and bug fixes I will not have time to cover
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Adding Procedural SQL: HPLSQL  Procedural SQL, akin to Oracle’s PL/SQL and Teradata’s stored procedures – Adds cursors, loops (FOR, WHILE, LOOP), branches (IF), HPLSQL procedures, exceptions (SIGNAL)  Aims to be compatible with all major dialects of procedural SQL to maximize re-use of existing scripts  Currently external to Hive, communicates with Hive via JDBC. – User runs command using hplsql binary – Goal is to tightly integrate it so that Hive’s parser can execute HPLSQL, store HPLSQL procedures, etc.
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sub-second Queries in Hive: LLAP (Live Long and Process)  Persistent daemons – Saves time on process start up (eliminates container allocation and JVM start up time) – All code JITed within a query or two  Data caching with an async I/O elevator – Hot data cached in memory (columnar aware, so only hot columns cached) – When possible work scheduled on node with data cached, if not work will be run in other node  Operators can be executed inside LLAP when it makes sense – Large, ETL style queries usually don’t make sense – User code not run in LLAP for security  Working on interface to allow other data engines to read securely in parallel  Beta in 2.0
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive With LLAP Execution Options AM AM T T T R R R T T T R M M M R R R M M R R Tez Only LLAP + Tez T T T R R R T T T R LLAP only
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved LLAP Performance 0 5 10 15 20 25 30 35 40 45 50 query3 query12 query20 query21 query26 query27 query42 query52 query55 query73 query89 query91 query98 TIME(SECONDS) LLAP vs Hive 1.x 10TB Scale LLAP Hive 1.x
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved LLAP Performance Continued 0 100 200 300 400 500 Time(seconds) LLAP Hive 1.2.1 Hive / LLAP, Hive 1.2.1 Query Times 38 out of 61 queries ran 50% faster 25 out of 61 queries ran 70% faster 12 out of 61 queries ran 80% faster 1 query ran 90% faster
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved LLAP Limitations  Currently in Beta  Read only, no write path yet  Does not work with ACID yet (see previous bullet)  User must decide whether query runs fully in LLAP, mixed mode, or not at all – Should be handled by CBO  Currently only reads ORC files  Currently only integrates with Tez as an engine
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Speeding up Query Planning: HBase Metastore  Add option to use HBase to store Hive’s metadata  Why? – Planning a query that reads several thousand partitions in Hive 1.2 takes 5+ seconds, mostly for metadata acquisition – ORM layer produces complex, slow schema (40+ tables) – The need to work across 5 different databases limits performance optimizations and maximizes test matrix for developers – Limits caching opportunities as we cannot store too much data in a single node RDBMS – The need to limit number of concurrent connections forces all metadata operations to be done during query planning – HBase addresses each of these  Goal: cut metadata access time for query with thousands of partitions to 200 milliseconds – Not there yet, currently at 1-1.5 seconds  Challenges – HBase lacks transactions, addressing via Apache Omid (incubating)  Alpha in Hive 2.0
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Improvements to Hive on Spark  Dynamic partition pruning  Make use of spark persistence for self-join, self-union, and CTEs  Vectorized map-join and other map-join improvements  Parallel order by  Pre-warming of containers  Support for Spark 1.5  Many bug fixes
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Cost Base Optimizer (CBO) Improvements  Hive’s CBO uses Calcite – Not all optimization rules migrated yet, but 2.0 continues work towards that  CBO on by default in 2.0 (wasn’t in in 1.x)  Main focus of CBO work has been BI queries (using TPC-DS as guide) – Some work on machine generated queries, since tools generate some funky queries  Focus on improving stats collection and estimating stats more accurately between operators in the plan
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved And Many, Many More • SQL Standard Auth is the default authorization (actually works) • CLI mode for beeline (WIP to replace and deprecate CLI in Hive 2.*) • Codahale-based metrics (also in 1.3) • HS2 Web UI • Stability Improvements and bugfixes for ACID (almost production ready now) • Native vectorized mapjoin, vectorized reducesink, improved vectorized GBY, etc. • Improvements to Parquet performance (PPD, memory manager, etc.) • ORC schema evolution (beta) • Improvement to windowing functions, refactoring ORC before split, SIMD optimizations, new LIMIT syntax, parallel compilation in HS2, improvements to Tez session management, many more
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive 2.0 Incompabilities  Java 7 & 8 supported, 6 no longer supported  Requires Hadoop 2.x, Hadoop 1.x no longer supported  MapReduce deprecated, Tez or Spark recommended instead – At some future date MR will be removed  Some configuration defaults changed, e.g. – bucketing enforced by default – metadata schema no longer created if it is missing – SQL Standard authorization used by default  We plan to remove Hive CLI in the future and replace with beeline CLI – Why? • Makes it easier for users to deploy secure clusters where all access is via [OJ]DBC • It is cleaner to maintain one code path – Does not require HiveServer2, can run HS2 embedded in beeline
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You

Editor's Notes

  1. 10 compute nodes, with 512GB RAM per node, running HDP 2.3 Scale 10K (10TB), interactive queries Single query runs – via Hive CLI Concurrency runs – via HiveServer2 and jmeter Hive1: Hive 1.2 + Tez 0.7 Pre-warm and container reuse enabled LLAP: Close to the 2.0 Hive branch, Tez close to the current master branch Caching Enabled (as of November 2015)
  2. 1. DPP: Implemented in two sequential jobs. The first one processes the pruning part, saving the dynamic values on HDFS. The second job uses these values to filter out unwanted partitoins. Not fully tested yet. 2. Spark RDD persistence is used to store the temporary results from repeated subqueires to avoid re-computation. This is similar to materialized view and happens automatically. This is especially useful for cases of self-join, self-union, and CTE. 3. Vectorized map-join, optimized hashtable for mapjoin. These are very similar to tez. 4. Use parallel order by provided by Spark to do global sorting without limiting to one reducer. Internally, however, spark does the sampling.  5. Wait for a few seconds after SparkContext is created before submitting the job to make sure that enough number of executors are launched. SparkContext allows a job to be submitted right way, even if the executors are still starting up. Parallelism at reducer is partially determined by the number of available executors at the time when the job is submitted. This is useful for short-lived sessions, such as those launched by Oozie.