SlideShare a Scribd company logo
1 of 45
© Hortonworks Inc. 2013
Stinger Initiative: Deep Dive
Interactive Query on Hadoop
Page 1
Chris Harris
E-Mail : charris@hortonworks.com
Twitter : cj_harris5
© Hortonworks Inc. 2013
Agenda
• Key Hive Use Cases
• Brief Refresher on Hive
• The Stinger Initiative: Interactive Query for Hive
Page 2
© Hortonworks Inc. 2013
Key Hive Use Cases
• RDBMS / MPP Offload
–More data under query.
–Database unable to keep up with SLAs.
• Analysis of semi-structured data.
• ETL / Data Refinement
• +++ Increasingly: Business Intelligence and
interactive query
Page 3
© Hortonworks Inc. 2013
BI Use Cases
Page 4
Enterprise Reports Dashboard / Scorecard Parameterized Reports
Visualization Data Mining
© Hortonworks Inc. 2013
Organize Tiers and Process with Metadata
Page 5
Work
Tier
Standardize, Cleanse, Transform
MapReduce
Pig
Raw
Tier
Extract & Load
WebHDF
S
Flume
Sqoop
Gold
Tier
Transform, Integrate, Storage
MapReduce
Pig
Conform, Summarize, Access
HiveQL
Pig
Access
Tier
HCat
Provides unified
metadata access
to Pig, Hive &
MapReduce
• Organize data
based on
source/derived
relationships
• Allows for fault
and rebuild
process
© Hortonworks Inc. 2013
Hive Current Focus Area
Page 6
• Online systems
• R-T analytics
• CEP
Real-Time Interactive Batch
• Parameterized
Reports
• Drilldown
• Visualization
• Exploration
• Operational batch
processing
• Enterprise
Reports
• Data Mining
Data Size
0-5s 5s – 1m 1m – 1h 1h+
Non-
Interactive
• Data preparation
• Incremental batch
processing
• Dashboards /
Scorecards
Current Hive Sweet Spot
© Hortonworks Inc. 2013
Stinger: Extending Hive‟s Sweetspot
Page 7
• Online systems
• R-T analytics
• CEP
Real-Time Interactive Batch
• Parameterized
Reports
• Drilldown
• Visualization
• Exploration
• Operational batch
processing
• Enterprise
Reports
• Data Mining
Data Size
0-5s 5s – 1m 1m – 1h 1h+
Non-
Interactive
• Data preparation
• Incremental batch
processing
• Dashboards /
Scorecards
Improve Latency & Throughput
• Query engine improvements
• New “Optimized RCFile” column store
• Next-gen runtime (elim‟s M/R latency)
Extend Deep Analytical Ability
• Analytics functions
• Improved SQL coverage
• Continued focus on core Hive use cases
Current Hive Sweet SpotFuture Hive
Expansion
© Hortonworks Inc. 2013
The top BI vendors support Hive today
Page 8
© Hortonworks Inc. 2013
Agenda
• Key Hive Use Cases
• Brief Refresher on Hive
• The Stinger Initiative: Interactive Query for Hive
Page 9
© Hortonworks Inc. 2013
Brief Refresher on Hive
The State of Hive Today (0.10)
Page 10
© Hortonworks Inc. 2013
Hive‟s Origins
Page 11
Hive was originally developed at Facebook.
More data than existing RDBMS could handle.
60,000+ Hive queries per day.
More than 1,000 users per day.
100+ PB of data.
15+ TB of data loaded daily.
Hive is a proven solution at extreme scale.
© Hortonworks Inc. 2013
Hive 0.10 Capabilities
• De-facto SQL Interface for Hadoop
• Multiple persistence options:
–Flat text for simple data imports.
–Columnar format (RCFile) for high performance processing.
• Secure and concurrent remote access
• ODBC/JDBC connectivity
• Highly extensible:
–Supports User Defined Functions and User Defined Aggregation
Functions.
–Ships with more than 150 UDF/UDAF.
–Extensible readers/writers can process any persisted data.
• Support from 10+ BI vendors
Page 12
© Hortonworks Inc. 2013
HDP 1.2: ODBC Access for Popular BI Tools
Page 13
• Seamless integration with BI
tools such as Excel, PowerPivot,
MicroStrategy, and Tableau
• Efficiently maps advanced SQL
functionality into HiveQL
– With configurable pass-through of
HiveQL for Hive-aware apps
• ODBC 3.52 standard compliant
• Supports Linux & Windows
High quality ODBC driver developed in partnership with Simba.
Free to download & use with Hortonworks Data Platform.
Applications &
Spreadsheets
Visualization &
Intelligence
ODBC
Hortonworks
Data Platform
© Hortonworks Inc. 2013
0 to Big Data in 15 Minutes
Page 14
Hands on tutorials
integrated into
Sandbox
HDP environment for
evaluation
© Hortonworks Inc. 2013
Agenda
• Brief Refresher on Hive
• Key Hive Use Cases
• The Stinger Initiative: Interactive Query for Hive
Page 15
© Hortonworks Inc. 2013
The Stinger Initiative
Interactive Query on Hadoop
Page 16
© Hortonworks Inc. 2013
Stinger Initiative: 2-Pronged Approach
Page 17
Tez
• New primitives move beyond map-reduce
and beyond batch
• Avoid unnecessary persistence of
temporary data
• Hive, Pig and others generate Tez plans
for high perf
Query Engine Improvements
• Cost-based optimizer
• In-memory joins
• Caching hot tables
• Vector processing
State-of-the-art Column Store
• “Optimized RCFile” or ORCFile
• Minimizes disk IO and deserialization
Tez Service
• Always-on service for query interactivity
Improve Latency and Throughput
Analytics Functions
• SQL:2003 Compliant
• OVER with PARTITION BY and ORDER
BY
• Wide variety of windowing functions:
• RANK
• LEAD/LAG
• ROW_NUMBER
• FIRST_VALUE
• LAST_VALUE
• Many more
• Aligns well with BI ecosystem
Improved SQL Coverage
• Non-correlated Subqueries using IN in
WHERE
• Expanded SQL types including
DATETIME, VARCHAR, etc.
Extend Deep Analytical Ability
Making Hive Best for Interactive Query
© Hortonworks Inc. 2013
Hive: Performance Improvements
Page 18
© Hortonworks Inc. 2013
Stinger Initiative At A Glance
Page 19
© Hortonworks Inc. 2013
Base Optimizations: Intelligent Optimizer
• Introduction of In-Memory Hash Join:
–For joins where one side fits in memory:
–New in-memory-hash-join algorithm.
–Hive reads the small table into a hash table.
–Scans through the big file to produce the output.
• Introduction of Sort-Merge-Bucket Join:
–Applies when tables are bucketed on the same key.
–Dramatic speed improvements seen in benchmarks.
• Other Improvements:
–Lower the footprint of the fact tables in memory.
–Enable the optimizer to automatically pick map joins.
Page 20
© Hortonworks Inc. 2013
Dimensionally Structured Data
• Extremely common pattern in EDW.
• Results in large “fact tables” and small “dimension
tables”.
• Dimension tables often small enough to fit in RAM.
• Sometimes called Star Schema.
Page 21
© Hortonworks Inc. 2013
A Query on Dimensional Data
• Derived from TPC-DS Query 27
• Dramatic speedup on Hive 0.11
Page 22
SELECT col5, avg(col6)
FROM fact_table
join dim1 on (fact_table.col1 = dim1.col1)
join dim2 on (fact_table.col2 = dim2.col1)
join dim3 on (fact_table.col3 = dim3.col1)
join dim4 on (fact_table.col4 = dim4.col1)
GROUP BY col5
ORDER BY col5
LIMIT 100;
© Hortonworks Inc. 2013
Star Schema Join Improvements in 0.11
Page 23
© Hortonworks Inc. 2013
Hive: Bucketing
• Bucketing causes Hive to physically co-locate rows
within files.
• Buckets can be sorted or unsorted.
Page 24
CREATE EXTERNAL TABLE IF NOT EXISTS test_table
(
Id INT, name String
)
PARTITIONED BY (dt STRING, hour STRING)
CLUSTERED BY(country,continent) SORTED BY(country,continent) INTO n BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION '/home/test_dir';
© Hortonworks Inc. 2013
ORCFile - Optimized Column Storage
• Make a better columnar storage file
–Tightly aligned to Hive data model
• Decompose complex row types into primitive fields
–Better compression and projection
• Only read bytes from HDFS for the required columns.
• Store column level aggregates in the files
–Only need to read the file meta information for common queries
–Stored both for file and each section of a file
–Aggregates: min, max, sum, average, count
–Allows fast access by sorted columns
• Ability to add bloom filters for columns
–Enables quick checks for whether a value is present
Page 25
© Hortonworks Inc. 2013
Performance Futures - Vectorization
• Operates on blocks of 1K or more records, rather than
one record at a time
• Each block contains an array of Java scalars, one for
each column
• Avoids many function calls, virtual dispatch, CPU pipeline
stalls
• Size to fit in L1 cache, avoid cache misses
• Generate code for operators on the fly to avoid branches
in code, maximize deep pipelines of modern processers
• Up to 30x faster processing of records
• Beta possible in 2H 2013
Page 26
© Hortonworks Inc. 2013
Performance Futures – Cost-Based
Optimizer
• Generate more intelligent DAGs based on properties of
data being queried, e.g. table size, statistics, histograms,
etc.
Page 27
© Hortonworks Inc. 2013
Performance Futures - Buffering
• Query workloads always have hotspots:
–Metadata
–Small dimension tables
• Build into YARN or Tez Service ways of buffering
frequently used data into memory so it is not always read
from disk.
• Part of the “last mile” of latency efforts.
Page 28
© Hortonworks Inc. 2013
Yarn
Moving Hive and Hadoop beyond MapReduce
Page 29
© Hortonworks Inc. 2013
Hadoop 2.0 Innovations - YARN
• Focus on scale and innovation
– Support 10,000+ computer clusters
– Extensible to encourage innovation
• Next generation execution
– Improves MapReduce performance
• Supports new frameworks beyond
MapReduce
– Low latency, Streaming, Services
– Do more with a single Hadoop cluster
HDFS
MapReduce
Redundant, Reliable Storage
YARN: Cluster Resource Management
Tez
GraphProcessing
Other
© Hortonworks Inc. 2013
Tez
Moving Hive and Hadoop beyond MapReduce
Page 31
© Hortonworks Inc. 2013
Tez
• Low level data-processing execution engine
• Use it for the base of
MapReduce, Hive, Pig, Cascading etc.
• Enables pipelining of jobs
• Removes task and job launch times
• Hive and Pig jobs no longer need to move to the end
of the queue between steps in the pipeline
• Does not write intermediate output to HDFS
–Much lighter disk and network usage
• Built on YARN
Page 32
© Hortonworks Inc. 2013
Tez - Core Idea
Task with pluggable Input, Processor & Output
Page 33
YARN ApplicationMaster to run DAG of Tez Tasks
Input Processor
Task
Output
Tez Task - <Input, Processor, Output>
© Hortonworks Inc. 2013
Tez – Blocks for building tasks
MapReduce „Map‟
Page 34
MapReduce „Reduce‟
HDFS
Input
Map
Processor
MapReduce „Map‟ Task
Sorted
Output
Shuffle
Input
Reduce
Processor
HDFS
Output
Intermediate „Reduce‟
for
Map-Reduce-Reduce
Shuffle
Input
Reduce
Processor
Intermediate „Reduce‟ for Map-
Reduce-Reduce
Sorted
Output
MapReduce „Reduce‟ Task
© Hortonworks Inc. 2013
Tez – More tasks
Special Pig/Hive „Map‟
Page 35
In-memory Map
HDFS
Input
Map
Processor
Tez Task
Pipeline
Sorter
Output
HDFSIn
put
Map
Processor
Tez Task
In-
memory
Sorted
Output
Special Pig/Hive
„Reduce‟
Shuffle
Skip-
merge
Input
Reduce
Processor
Tez Task
Sorted
Output
© Hortonworks Inc. 2013
Pig/Hive-MR versus Pig/Hive-Tez
Page 36
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Pig/Hive - MR Pig/Hive - Tez
I/O Synchronization
Barrier
I/O Synchronization
Barrier
Job 1
Job 2
Job 3
Single Job
© Hortonworks Inc. 2013
FastQuery: Beyond Batch with YARN
Page 37
Tez Generalizes Map-Reduce
Simplified execution plans process
data more efficiently
Always-On Tez Service
Low latency processing for
all Hadoop data processing
© Hortonworks Inc. 2013
Tez Service
• MR Query Startup Expensive
–Job launch & task-launch latencies are fatal for short queries (in
order of 5s to 30s)
• Solution
–Tez Service
– Removes task-launch overhead
– Removes job-launch overhead
–Hive/Pig
– Submit query-plan to Tez Service
–Native Hadoop service, not ad-hoc
Page 38
© Hortonworks Inc. 2013
Tez Service Delivers Low Latency
Page 39
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Existing Hive
Parse Query 0.5s
Create Plan 0.5s
Launch Map-Reduce 20s
Process Map-
Reduce
10s
Total 31s
Hive/Tez
Parse Query 0.5s
Create Plan 0.5s
Launch Map-Reduce 20s
Process Map-
Reduce
2s
Total 23s
Tez and Tez Service
Parse Query 0.5s
Create Plan 0.5s
Submit to Tez Service 0.5s
Process Map-Reduce 2s
Total 3.5s
* Numbers for illustration only
© Hortonworks Inc. 2013
Recap and Questions: Hive Performance
Page 40
© Hortonworks Inc. 2013
Improving Hive‟s SQL Support
Page 41
© Hortonworks Inc. 2013
Stinger: Deep Analytical Capabilities
• SQL:2003 Window Functions
–OVER clauses
– Multiple PARTITION BY and ORDER BY supported
– Windowing supported (ROWS PRECEDING/FOLLOWING)
– Large variety of aggregates
– RANK
– FIRST_VALUE
– LAST_VALUE
– LEAD / LAG
– Distrubutions
Page 42
© Hortonworks Inc. 2013
Hive Data Type Conformance
• Data Types:
–Add fixed point NUMERIC and DECIMAL type (in progress)
–Add VARCHAR and CHAR types with limited field size
–Add DATETIME
–Add size ranges from 1 to 53 for FLOAT
–Add synonyms for compatibility
– BLOB for BINARY
– TEXT for STRING
– REAL for FLOAT
• SQL Semantics:
–Sub-queries in IN, NOT IN, HAVING.
–EXISTS and NOT EXISTS
Page 43
© Hortonworks Inc. 2013
Questions?
Page 44
© Hortonworks Inc. 2013
Thank You!
Questions & Answers
Page 45

More Related Content

What's hot

What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive DataWorks Summit
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudDataWorks Summit/Hadoop Summit
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hiverxu
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 

What's hot (20)

What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
How do you decide where your customer was?
How do you decide where your customer was?How do you decide where your customer was?
How do you decide where your customer was?
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
 
Empower Data-Driven Organizations
Empower Data-Driven OrganizationsEmpower Data-Driven Organizations
Empower Data-Driven Organizations
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Deep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profitDeep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profit
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Similar to Stinger Initiative - Deep Dive

Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...Caserta
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013alanfgates
 
An In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveAn In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveDataWorks Summit
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times FasterApril 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times FasterYahoo Developer Network
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitData Con LA
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data ModelingAdam Doyle
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Alluxio, Inc.
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 

Similar to Stinger Initiative - Deep Dive (20)

Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013
 
An In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveAn In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in Hive
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times FasterApril 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 

More from Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

More from Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Recently uploaded

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Recently uploaded (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

Stinger Initiative - Deep Dive

  • 1. © Hortonworks Inc. 2013 Stinger Initiative: Deep Dive Interactive Query on Hadoop Page 1 Chris Harris E-Mail : charris@hortonworks.com Twitter : cj_harris5
  • 2. © Hortonworks Inc. 2013 Agenda • Key Hive Use Cases • Brief Refresher on Hive • The Stinger Initiative: Interactive Query for Hive Page 2
  • 3. © Hortonworks Inc. 2013 Key Hive Use Cases • RDBMS / MPP Offload –More data under query. –Database unable to keep up with SLAs. • Analysis of semi-structured data. • ETL / Data Refinement • +++ Increasingly: Business Intelligence and interactive query Page 3
  • 4. © Hortonworks Inc. 2013 BI Use Cases Page 4 Enterprise Reports Dashboard / Scorecard Parameterized Reports Visualization Data Mining
  • 5. © Hortonworks Inc. 2013 Organize Tiers and Process with Metadata Page 5 Work Tier Standardize, Cleanse, Transform MapReduce Pig Raw Tier Extract & Load WebHDF S Flume Sqoop Gold Tier Transform, Integrate, Storage MapReduce Pig Conform, Summarize, Access HiveQL Pig Access Tier HCat Provides unified metadata access to Pig, Hive & MapReduce • Organize data based on source/derived relationships • Allows for fault and rebuild process
  • 6. © Hortonworks Inc. 2013 Hive Current Focus Area Page 6 • Online systems • R-T analytics • CEP Real-Time Interactive Batch • Parameterized Reports • Drilldown • Visualization • Exploration • Operational batch processing • Enterprise Reports • Data Mining Data Size 0-5s 5s – 1m 1m – 1h 1h+ Non- Interactive • Data preparation • Incremental batch processing • Dashboards / Scorecards Current Hive Sweet Spot
  • 7. © Hortonworks Inc. 2013 Stinger: Extending Hive‟s Sweetspot Page 7 • Online systems • R-T analytics • CEP Real-Time Interactive Batch • Parameterized Reports • Drilldown • Visualization • Exploration • Operational batch processing • Enterprise Reports • Data Mining Data Size 0-5s 5s – 1m 1m – 1h 1h+ Non- Interactive • Data preparation • Incremental batch processing • Dashboards / Scorecards Improve Latency & Throughput • Query engine improvements • New “Optimized RCFile” column store • Next-gen runtime (elim‟s M/R latency) Extend Deep Analytical Ability • Analytics functions • Improved SQL coverage • Continued focus on core Hive use cases Current Hive Sweet SpotFuture Hive Expansion
  • 8. © Hortonworks Inc. 2013 The top BI vendors support Hive today Page 8
  • 9. © Hortonworks Inc. 2013 Agenda • Key Hive Use Cases • Brief Refresher on Hive • The Stinger Initiative: Interactive Query for Hive Page 9
  • 10. © Hortonworks Inc. 2013 Brief Refresher on Hive The State of Hive Today (0.10) Page 10
  • 11. © Hortonworks Inc. 2013 Hive‟s Origins Page 11 Hive was originally developed at Facebook. More data than existing RDBMS could handle. 60,000+ Hive queries per day. More than 1,000 users per day. 100+ PB of data. 15+ TB of data loaded daily. Hive is a proven solution at extreme scale.
  • 12. © Hortonworks Inc. 2013 Hive 0.10 Capabilities • De-facto SQL Interface for Hadoop • Multiple persistence options: –Flat text for simple data imports. –Columnar format (RCFile) for high performance processing. • Secure and concurrent remote access • ODBC/JDBC connectivity • Highly extensible: –Supports User Defined Functions and User Defined Aggregation Functions. –Ships with more than 150 UDF/UDAF. –Extensible readers/writers can process any persisted data. • Support from 10+ BI vendors Page 12
  • 13. © Hortonworks Inc. 2013 HDP 1.2: ODBC Access for Popular BI Tools Page 13 • Seamless integration with BI tools such as Excel, PowerPivot, MicroStrategy, and Tableau • Efficiently maps advanced SQL functionality into HiveQL – With configurable pass-through of HiveQL for Hive-aware apps • ODBC 3.52 standard compliant • Supports Linux & Windows High quality ODBC driver developed in partnership with Simba. Free to download & use with Hortonworks Data Platform. Applications & Spreadsheets Visualization & Intelligence ODBC Hortonworks Data Platform
  • 14. © Hortonworks Inc. 2013 0 to Big Data in 15 Minutes Page 14 Hands on tutorials integrated into Sandbox HDP environment for evaluation
  • 15. © Hortonworks Inc. 2013 Agenda • Brief Refresher on Hive • Key Hive Use Cases • The Stinger Initiative: Interactive Query for Hive Page 15
  • 16. © Hortonworks Inc. 2013 The Stinger Initiative Interactive Query on Hadoop Page 16
  • 17. © Hortonworks Inc. 2013 Stinger Initiative: 2-Pronged Approach Page 17 Tez • New primitives move beyond map-reduce and beyond batch • Avoid unnecessary persistence of temporary data • Hive, Pig and others generate Tez plans for high perf Query Engine Improvements • Cost-based optimizer • In-memory joins • Caching hot tables • Vector processing State-of-the-art Column Store • “Optimized RCFile” or ORCFile • Minimizes disk IO and deserialization Tez Service • Always-on service for query interactivity Improve Latency and Throughput Analytics Functions • SQL:2003 Compliant • OVER with PARTITION BY and ORDER BY • Wide variety of windowing functions: • RANK • LEAD/LAG • ROW_NUMBER • FIRST_VALUE • LAST_VALUE • Many more • Aligns well with BI ecosystem Improved SQL Coverage • Non-correlated Subqueries using IN in WHERE • Expanded SQL types including DATETIME, VARCHAR, etc. Extend Deep Analytical Ability Making Hive Best for Interactive Query
  • 18. © Hortonworks Inc. 2013 Hive: Performance Improvements Page 18
  • 19. © Hortonworks Inc. 2013 Stinger Initiative At A Glance Page 19
  • 20. © Hortonworks Inc. 2013 Base Optimizations: Intelligent Optimizer • Introduction of In-Memory Hash Join: –For joins where one side fits in memory: –New in-memory-hash-join algorithm. –Hive reads the small table into a hash table. –Scans through the big file to produce the output. • Introduction of Sort-Merge-Bucket Join: –Applies when tables are bucketed on the same key. –Dramatic speed improvements seen in benchmarks. • Other Improvements: –Lower the footprint of the fact tables in memory. –Enable the optimizer to automatically pick map joins. Page 20
  • 21. © Hortonworks Inc. 2013 Dimensionally Structured Data • Extremely common pattern in EDW. • Results in large “fact tables” and small “dimension tables”. • Dimension tables often small enough to fit in RAM. • Sometimes called Star Schema. Page 21
  • 22. © Hortonworks Inc. 2013 A Query on Dimensional Data • Derived from TPC-DS Query 27 • Dramatic speedup on Hive 0.11 Page 22 SELECT col5, avg(col6) FROM fact_table join dim1 on (fact_table.col1 = dim1.col1) join dim2 on (fact_table.col2 = dim2.col1) join dim3 on (fact_table.col3 = dim3.col1) join dim4 on (fact_table.col4 = dim4.col1) GROUP BY col5 ORDER BY col5 LIMIT 100;
  • 23. © Hortonworks Inc. 2013 Star Schema Join Improvements in 0.11 Page 23
  • 24. © Hortonworks Inc. 2013 Hive: Bucketing • Bucketing causes Hive to physically co-locate rows within files. • Buckets can be sorted or unsorted. Page 24 CREATE EXTERNAL TABLE IF NOT EXISTS test_table ( Id INT, name String ) PARTITIONED BY (dt STRING, hour STRING) CLUSTERED BY(country,continent) SORTED BY(country,continent) INTO n BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/home/test_dir';
  • 25. © Hortonworks Inc. 2013 ORCFile - Optimized Column Storage • Make a better columnar storage file –Tightly aligned to Hive data model • Decompose complex row types into primitive fields –Better compression and projection • Only read bytes from HDFS for the required columns. • Store column level aggregates in the files –Only need to read the file meta information for common queries –Stored both for file and each section of a file –Aggregates: min, max, sum, average, count –Allows fast access by sorted columns • Ability to add bloom filters for columns –Enables quick checks for whether a value is present Page 25
  • 26. © Hortonworks Inc. 2013 Performance Futures - Vectorization • Operates on blocks of 1K or more records, rather than one record at a time • Each block contains an array of Java scalars, one for each column • Avoids many function calls, virtual dispatch, CPU pipeline stalls • Size to fit in L1 cache, avoid cache misses • Generate code for operators on the fly to avoid branches in code, maximize deep pipelines of modern processers • Up to 30x faster processing of records • Beta possible in 2H 2013 Page 26
  • 27. © Hortonworks Inc. 2013 Performance Futures – Cost-Based Optimizer • Generate more intelligent DAGs based on properties of data being queried, e.g. table size, statistics, histograms, etc. Page 27
  • 28. © Hortonworks Inc. 2013 Performance Futures - Buffering • Query workloads always have hotspots: –Metadata –Small dimension tables • Build into YARN or Tez Service ways of buffering frequently used data into memory so it is not always read from disk. • Part of the “last mile” of latency efforts. Page 28
  • 29. © Hortonworks Inc. 2013 Yarn Moving Hive and Hadoop beyond MapReduce Page 29
  • 30. © Hortonworks Inc. 2013 Hadoop 2.0 Innovations - YARN • Focus on scale and innovation – Support 10,000+ computer clusters – Extensible to encourage innovation • Next generation execution – Improves MapReduce performance • Supports new frameworks beyond MapReduce – Low latency, Streaming, Services – Do more with a single Hadoop cluster HDFS MapReduce Redundant, Reliable Storage YARN: Cluster Resource Management Tez GraphProcessing Other
  • 31. © Hortonworks Inc. 2013 Tez Moving Hive and Hadoop beyond MapReduce Page 31
  • 32. © Hortonworks Inc. 2013 Tez • Low level data-processing execution engine • Use it for the base of MapReduce, Hive, Pig, Cascading etc. • Enables pipelining of jobs • Removes task and job launch times • Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline • Does not write intermediate output to HDFS –Much lighter disk and network usage • Built on YARN Page 32
  • 33. © Hortonworks Inc. 2013 Tez - Core Idea Task with pluggable Input, Processor & Output Page 33 YARN ApplicationMaster to run DAG of Tez Tasks Input Processor Task Output Tez Task - <Input, Processor, Output>
  • 34. © Hortonworks Inc. 2013 Tez – Blocks for building tasks MapReduce „Map‟ Page 34 MapReduce „Reduce‟ HDFS Input Map Processor MapReduce „Map‟ Task Sorted Output Shuffle Input Reduce Processor HDFS Output Intermediate „Reduce‟ for Map-Reduce-Reduce Shuffle Input Reduce Processor Intermediate „Reduce‟ for Map- Reduce-Reduce Sorted Output MapReduce „Reduce‟ Task
  • 35. © Hortonworks Inc. 2013 Tez – More tasks Special Pig/Hive „Map‟ Page 35 In-memory Map HDFS Input Map Processor Tez Task Pipeline Sorter Output HDFSIn put Map Processor Tez Task In- memory Sorted Output Special Pig/Hive „Reduce‟ Shuffle Skip- merge Input Reduce Processor Tez Task Sorted Output
  • 36. © Hortonworks Inc. 2013 Pig/Hive-MR versus Pig/Hive-Tez Page 36 SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Pig/Hive - MR Pig/Hive - Tez I/O Synchronization Barrier I/O Synchronization Barrier Job 1 Job 2 Job 3 Single Job
  • 37. © Hortonworks Inc. 2013 FastQuery: Beyond Batch with YARN Page 37 Tez Generalizes Map-Reduce Simplified execution plans process data more efficiently Always-On Tez Service Low latency processing for all Hadoop data processing
  • 38. © Hortonworks Inc. 2013 Tez Service • MR Query Startup Expensive –Job launch & task-launch latencies are fatal for short queries (in order of 5s to 30s) • Solution –Tez Service – Removes task-launch overhead – Removes job-launch overhead –Hive/Pig – Submit query-plan to Tez Service –Native Hadoop service, not ad-hoc Page 38
  • 39. © Hortonworks Inc. 2013 Tez Service Delivers Low Latency Page 39 SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Existing Hive Parse Query 0.5s Create Plan 0.5s Launch Map-Reduce 20s Process Map- Reduce 10s Total 31s Hive/Tez Parse Query 0.5s Create Plan 0.5s Launch Map-Reduce 20s Process Map- Reduce 2s Total 23s Tez and Tez Service Parse Query 0.5s Create Plan 0.5s Submit to Tez Service 0.5s Process Map-Reduce 2s Total 3.5s * Numbers for illustration only
  • 40. © Hortonworks Inc. 2013 Recap and Questions: Hive Performance Page 40
  • 41. © Hortonworks Inc. 2013 Improving Hive‟s SQL Support Page 41
  • 42. © Hortonworks Inc. 2013 Stinger: Deep Analytical Capabilities • SQL:2003 Window Functions –OVER clauses – Multiple PARTITION BY and ORDER BY supported – Windowing supported (ROWS PRECEDING/FOLLOWING) – Large variety of aggregates – RANK – FIRST_VALUE – LAST_VALUE – LEAD / LAG – Distrubutions Page 42
  • 43. © Hortonworks Inc. 2013 Hive Data Type Conformance • Data Types: –Add fixed point NUMERIC and DECIMAL type (in progress) –Add VARCHAR and CHAR types with limited field size –Add DATETIME –Add size ranges from 1 to 53 for FLOAT –Add synonyms for compatibility – BLOB for BINARY – TEXT for STRING – REAL for FLOAT • SQL Semantics: –Sub-queries in IN, NOT IN, HAVING. –EXISTS and NOT EXISTS Page 43
  • 44. © Hortonworks Inc. 2013 Questions? Page 44
  • 45. © Hortonworks Inc. 2013 Thank You! Questions & Answers Page 45

Editor's Notes

  1. Enterprise Reports – Your cell phone bill is an exampleDashboard – KPI trackingParameterized Reports – What are the hot prospects in my region?Visualization – Visual exploration of dataData Mining – Large scale data processing and extraction usually fed to other tools
  2. Over clause similar to use group by except that with group by you produce a single row for each of your group where with over clause you produce a result for each row in your group. You specify which partition you would like to use and how you would like to order itAnd then you can give it a windows
  3. Sort Merge Bucket ( SMB ) joinIf both tables are: - sorted the same - Bucketed the same - And Joining on the sort/bucket columnEach process: - Reads a bucket from each table - Process the row with the lowest value
  4. Community developed frameworksMachine learning / Analytics (MPI, GraphLab, Giraph, Hama, Spark, …)Services inside Hadoop (memcache, HBase, Storm…)Low latency computing (CEP or stream processing)