SlideShare a Scribd company logo
1 of 37
Download to read offline
Copyright © 2013 Cloudera Inc. All rights reserved.
Headline Goes Here
Speaker Name or Subhead Goes Here
Hadoop Beyond Batch: 

Real-time Workloads, SQL-on-
Hadoop, and the Virtual EDW
Marcel Kornacker | marcel@cloudera.com 
April 2014
Copyright © 2013 Cloudera Inc. All rights reserved.
Analytic Workloads on Hadoop: Where Do
We Stand?
!2
“DeWitt Clause” prohibits
using DBMS vendor name
Copyright © 2013 Cloudera Inc. All rights reserved.
Hadoop for Analytic Workloads
•Hadoop has traditional been utilized for offline batch processing:
ETL and ELT
•Next step: Hadoop for traditional business intelligence (BI)/data
warehouse (EDW) workloads:
•interactive
•concurrent users
•Topic of this talk: a Hadoop-based open-source stack for EDW
workloads:
•HDFS: a high-performance storage system
•Parquet: a state-of-the-art columnar storage format
•Impala: a modern, open-source SQL engine for Hadoop
!3
Copyright © 2013 Cloudera Inc. All rights reserved.
Hadoop for Analytic Workloads
•Thesis of this talk:
•techniques and functionality of established commercial
solutions are either already available or are rapidly being
implemented in Hadoop stack
•Hadoop stack is effective solution for certain EDW workloads
•Hadoop-based EDW solution maintains Hadoop’s strengths:
flexibility, ease of scaling, cost effectiveness
!4
Copyright © 2013 Cloudera Inc. All rights reserved.
HDFS: A Storage System for Analytic
Workloads
•Available in Hdfs today:
•high-efficiency data scans at or near hardware speed, both
from disk and memory
•On the immediate roadmap:
•co-partitioned tables for even faster distributed joins
•temp-FS: write temp table data straight to memory,
bypassing disk

!5
Copyright © 2013 Cloudera Inc. All rights reserved.
HDFS: The Details
•High efficiency data transfers
•short-circuit reads: bypass DataNode protocol when reading
from local disk

-> read at 100+MB/s per disk
•HDFS caching: access explicitly cached data w/o copy or
checksumming

-> access memory-resident data at memory bus speed

-> enable in-memory processing
!6
Copyright © 2013 Cloudera Inc. All rights reserved.
HDFS: The Details
•Coming attractions:
•affinity groups: collocate blocks from different files

-> create co-partitioned tables for improved join
performance
•temp-fs: write temp table data straight to memory,
bypassing disk

-> ideal for iterative interactive data analysis
!7
Copyright © 2013 Cloudera Inc. All rights reserved.
Parquet: Columnar Storage for Hadoop
•What it is:
•state-of-the-art, open-source columnar file format that’s
available for (most) Hadoop processing frameworks:

Impala, Hive, Pig, MapReduce, Cascading, …
•offers both high compression and high scan efficiency
•co-developed by Twitter and Cloudera; hosted on github and
soon to be an Apache incubator project
•with contributors from Criteo, Stripe, Berkeley AMPlab,
LinkedIn
•used in production at Twitter and Criteo
!8
Copyright © 2013 Cloudera Inc. All rights reserved.
Parquet: The Details
•columnar storage: column-major instead of the traditional
row-major layout; used by all high-end analytic DBMSs
•optimized storage of nested data structures: patterned
after Dremel’s ColumnIO format
•extensible set of column encodings:
•run-length and dictionary encodings in current version (1.2)
•delta and optimized string encodings in 2.0
•embedded statistics: version 2.0 stores inlined column
statistics for further optimization of scan efficiency
!9
Copyright © 2013 Cloudera Inc. All rights reserved.
Parquet: Storage Efficiency
!10
Copyright © 2013 Cloudera Inc. All rights reserved.
Parquet: Scan Efficiency
!11
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala: A Modern, Open-Source SQL Engine
•implementation of an MPP SQL query engine for the Hadoop
environment
•highest-performance SQL engine for the Hadoop ecosystem;

already outperforms some of its commercial competitors
•effective for EDW-style workloads
•maintains Hadoop flexibility by utilizing standard Hadoop
components (HDFS, Hbase, Metastore, Yarn)
•plays well with traditional BI tools:

exposes/interacts with industry-standard interfaces (odbc/
jdbc, Kerberos and LDAP, ANSI SQL)
!12
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala: A Modern, Open-Source SQL Engine
•history:
•developed by Cloudera and fully open-source; hosted on
github
•released as beta in 10/2012
•1.0 version available in 05/2013
•current version is 1.2.3, available for CDH4 and CDH5 beta
!13
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala from The User’s Perspective
•create tables as virtual views over data stored in HDFS
or Hbase;

schema metadata is stored in Metastore (shared with
Hive, Pig, etc.; basis of HCatalog)
•connect via odbc/jdbc; authenticate via Kerberos or
LDAP
•run standard SQL:
•current version: ANSI SQL-92 (limited to SELECT and bulk
insert) minus correlated subqueries, has UDFs and UDAs
!14
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala from The User’s Perspective
•2014 roadmap:
•1.3: admission control 
•1.4: Decimal(<precision>, <scale>)
•2.0 or earlier: analytic window functions, Order By without
Limit, support for nested types (structs, arrays, maps),
UDTFs, disk-based joins and aggregation
!15
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture
•distributed service:
•daemon process (impalad) runs on every node with data
•easily deployed with Cloudera Manager
•each node can handle user requests; load balancer
configuration for multi-user environments recommended
•query execution phases:
•client request arrives via odbc/jdbc
•planner turns request into collection of plan fragments
•coordinator initiates execution on remote impala’s
!16
Copyright © 2013 Cloudera Inc. All rights reserved.
• Request arrives via odbc/jdbc
Impala Query Execution
!17
Copyright © 2013 Cloudera Inc. All rights reserved.
• Planner turns request into collection of plan fragments
• Coordinator initiates execution on remote impalad nodes
Impala Query Execution
!18
Copyright © 2013 Cloudera Inc. All rights reserved.
• Intermediate results are streamed between impala’s
• Query results are streamed back to client
Impala Query Execution
!19
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Planning
•2-phase process:
•single-node plan: left-deep tree of query operators
•partitioning into plan fragments for distributed parallel
execution:

maximize scan locality/minimize data movement, parallelize
all query operators
•cost-based join order optimization
•cost-based join distribution optimization
!20
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Execution
•execution engine designed for efficiency, written from scratch
in C++; no reuse of decades-old open-source code
•circumvents MapReduce completely
•in-memory execution:
•aggregation results and right-hand side inputs of joins are
cached in memory
•example: join with 1TB table, reference 2 of 200 cols, 10% of
rows 

-> need to cache 1GB across all nodes in cluster

-> not a limitation for most workloads
!21
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Execution
•runtime code generation:
•uses llvm to jit-compile the runtime-intensive parts of a
query
•effect the same as custom-coding a query:
•remove branches
•propagate constants, offsets, pointers, etc.
•inline function calls
•optimized execution for modern CPUs (instruction pipelines)
!22
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Execution
!23
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs MR for Analytic Workloads
•Impala vs. SQL-on-MR
•Impala 1.1.1/Hive 0.12 (“Stinger Phases 1 and 2”)
•file formats: Parquet/ORCfile
•TPC-DS, 3TB data set running on 5-node cluster
!24
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs MR for Analytic Workloads
• Impala speedup:
• interactive: 8-69x
• report: 6-68x
• deep analytics:
10-58x
!25
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs non-MR for Analytic Workloads
•Impala 1.2.3/Presto 0.6/Shark
•file formats: RCfile (+ Parquet)
•TPC-DS, 15TB data set running on 21-node cluster
!26
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs non-MR for Analytic Workloads
!27
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs non-MR for Analytic Workloads
!28
• Multi-user benchmark:
• 10 users concurrently
• same dataset, same
hardware
• workload: queries from
“interactive” group
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs non-MR for Analytic Workloads
!29
Copyright © 2013 Cloudera Inc. All rights reserved.
Scalability in Hadoop
•Hadoop’s promise of linear scalability: add more
nodes to cluster, gain a proportional increase in
capabilities

-> adapt to any kind of workload changes simply by
adding more nodes to cluster
•scaling dimensions for EDW workloads:
•response time
•concurrency/query throughput
•data size
!30
Copyright © 2013 Cloudera Inc. All rights reserved.
Scalability in Hadoop
•Scalability results for Impala:
•tests show linear scaling along all 3 dimensions
•setup:
•2 clusters: 18 and 36 nodes
•15TB TPC-DS data set
•6 “interactive” TPC-DS queries
!31
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Scalability: Latency
!32
Copyright © 2013 Cloudera Inc. All rights reserved.
• Comparison: 10 vs 20 concurrent users
Impala Scalability: Concurrency
!33
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Scalability: Data Size
• Comparison: 15TB vs. 30TB data set
!34
Copyright © 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
•Thesis of this talk:
•techniques and functionality of established commercial
solutions are either already available or are rapidly being
implemented in Hadoop stack
•Impala/Parquet/Hdfs is effective solution for certain EDW
workloads
•Hadoop-based EDW solution maintains Hadoop’s strengths:
flexibility, ease of scaling, cost effectiveness
!35
The End
!36
Copyright © 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
•what the future holds:
•further performance gains
•more complete SQL capabilities
•improved resource mgmt and ability to handle multiple
concurrent workloads in a single cluster
!37

More Related Content

What's hot

SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARNWangda Tan
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDataWorks Summit
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera, Inc.
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
Dancing with the elephant h base1_final
Dancing with the elephant   h base1_finalDancing with the elephant   h base1_final
Dancing with the elephant h base1_finalasterix_smartplatf
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 

What's hot (20)

SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Cloudera impala
Cloudera impalaCloudera impala
Cloudera impala
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
Dancing with the elephant h base1_final
Dancing with the elephant   h base1_finalDancing with the elephant   h base1_final
Dancing with the elephant h base1_final
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
The Heterogeneous Data lake
The Heterogeneous Data lakeThe Heterogeneous Data lake
The Heterogeneous Data lake
 

Viewers also liked

Hadoop: Extending your Data Warehouse
Hadoop: Extending your Data WarehouseHadoop: Extending your Data Warehouse
Hadoop: Extending your Data WarehouseCloudera, Inc.
 
eHarmony @ Hbase Conference 2016 by vijay vangapandu.
eHarmony @ Hbase Conference 2016 by vijay vangapandu.eHarmony @ Hbase Conference 2016 by vijay vangapandu.
eHarmony @ Hbase Conference 2016 by vijay vangapandu.Vijaykumar Vangapandu
 
Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetupnvvrajesh
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 

Viewers also liked (7)

Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Hadoop: Extending your Data Warehouse
Hadoop: Extending your Data WarehouseHadoop: Extending your Data Warehouse
Hadoop: Extending your Data Warehouse
 
eHarmony @ Hbase Conference 2016 by vijay vangapandu.
eHarmony @ Hbase Conference 2016 by vijay vangapandu.eHarmony @ Hbase Conference 2016 by vijay vangapandu.
eHarmony @ Hbase Conference 2016 by vijay vangapandu.
 
Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetup
 
Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 

Similar to Real-time SQL-on-Hadoop, Parquet and Impala for EDW Workloads

Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisFelicia Haggarty
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeNicolas Morales
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Timothy Spann
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Wei-Chiu Chuang
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 

Similar to Real-time SQL-on-Hadoop, Parquet and Impala for EDW Workloads (20)

Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Introducing Kudu
Introducing KuduIntroducing Kudu
Introducing Kudu
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 

More from huguk

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifactahuguk
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introhuguk
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...huguk
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watsonhuguk
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink huguk
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...huguk
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitchinghuguk
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoringhuguk
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startuphuguk
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapulthuguk
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysishuguk
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analyticshuguk
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Socialhuguk
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligencehuguk
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...huguk
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthyhuguk
 

More from huguk (20)

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp intro
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitching
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoring
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startup
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapult
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysis
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analytics
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Social
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligence
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 

Recently uploaded

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Recently uploaded (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Real-time SQL-on-Hadoop, Parquet and Impala for EDW Workloads

  • 1. Copyright © 2013 Cloudera Inc. All rights reserved. Headline Goes Here Speaker Name or Subhead Goes Here Hadoop Beyond Batch: 
 Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Marcel Kornacker | marcel@cloudera.com April 2014
  • 2. Copyright © 2013 Cloudera Inc. All rights reserved. Analytic Workloads on Hadoop: Where Do We Stand? !2 “DeWitt Clause” prohibits using DBMS vendor name
  • 3. Copyright © 2013 Cloudera Inc. All rights reserved. Hadoop for Analytic Workloads •Hadoop has traditional been utilized for offline batch processing: ETL and ELT •Next step: Hadoop for traditional business intelligence (BI)/data warehouse (EDW) workloads: •interactive •concurrent users •Topic of this talk: a Hadoop-based open-source stack for EDW workloads: •HDFS: a high-performance storage system •Parquet: a state-of-the-art columnar storage format •Impala: a modern, open-source SQL engine for Hadoop !3
  • 4. Copyright © 2013 Cloudera Inc. All rights reserved. Hadoop for Analytic Workloads •Thesis of this talk: •techniques and functionality of established commercial solutions are either already available or are rapidly being implemented in Hadoop stack •Hadoop stack is effective solution for certain EDW workloads •Hadoop-based EDW solution maintains Hadoop’s strengths: flexibility, ease of scaling, cost effectiveness !4
  • 5. Copyright © 2013 Cloudera Inc. All rights reserved. HDFS: A Storage System for Analytic Workloads •Available in Hdfs today: •high-efficiency data scans at or near hardware speed, both from disk and memory •On the immediate roadmap: •co-partitioned tables for even faster distributed joins •temp-FS: write temp table data straight to memory, bypassing disk
 !5
  • 6. Copyright © 2013 Cloudera Inc. All rights reserved. HDFS: The Details •High efficiency data transfers •short-circuit reads: bypass DataNode protocol when reading from local disk
 -> read at 100+MB/s per disk •HDFS caching: access explicitly cached data w/o copy or checksumming
 -> access memory-resident data at memory bus speed
 -> enable in-memory processing !6
  • 7. Copyright © 2013 Cloudera Inc. All rights reserved. HDFS: The Details •Coming attractions: •affinity groups: collocate blocks from different files
 -> create co-partitioned tables for improved join performance •temp-fs: write temp table data straight to memory, bypassing disk
 -> ideal for iterative interactive data analysis !7
  • 8. Copyright © 2013 Cloudera Inc. All rights reserved. Parquet: Columnar Storage for Hadoop •What it is: •state-of-the-art, open-source columnar file format that’s available for (most) Hadoop processing frameworks:
 Impala, Hive, Pig, MapReduce, Cascading, … •offers both high compression and high scan efficiency •co-developed by Twitter and Cloudera; hosted on github and soon to be an Apache incubator project •with contributors from Criteo, Stripe, Berkeley AMPlab, LinkedIn •used in production at Twitter and Criteo !8
  • 9. Copyright © 2013 Cloudera Inc. All rights reserved. Parquet: The Details •columnar storage: column-major instead of the traditional row-major layout; used by all high-end analytic DBMSs •optimized storage of nested data structures: patterned after Dremel’s ColumnIO format •extensible set of column encodings: •run-length and dictionary encodings in current version (1.2) •delta and optimized string encodings in 2.0 •embedded statistics: version 2.0 stores inlined column statistics for further optimization of scan efficiency !9
  • 10. Copyright © 2013 Cloudera Inc. All rights reserved. Parquet: Storage Efficiency !10
  • 11. Copyright © 2013 Cloudera Inc. All rights reserved. Parquet: Scan Efficiency !11
  • 12. Copyright © 2013 Cloudera Inc. All rights reserved. Impala: A Modern, Open-Source SQL Engine •implementation of an MPP SQL query engine for the Hadoop environment •highest-performance SQL engine for the Hadoop ecosystem;
 already outperforms some of its commercial competitors •effective for EDW-style workloads •maintains Hadoop flexibility by utilizing standard Hadoop components (HDFS, Hbase, Metastore, Yarn) •plays well with traditional BI tools:
 exposes/interacts with industry-standard interfaces (odbc/ jdbc, Kerberos and LDAP, ANSI SQL) !12
  • 13. Copyright © 2013 Cloudera Inc. All rights reserved. Impala: A Modern, Open-Source SQL Engine •history: •developed by Cloudera and fully open-source; hosted on github •released as beta in 10/2012 •1.0 version available in 05/2013 •current version is 1.2.3, available for CDH4 and CDH5 beta !13
  • 14. Copyright © 2013 Cloudera Inc. All rights reserved. Impala from The User’s Perspective •create tables as virtual views over data stored in HDFS or Hbase;
 schema metadata is stored in Metastore (shared with Hive, Pig, etc.; basis of HCatalog) •connect via odbc/jdbc; authenticate via Kerberos or LDAP •run standard SQL: •current version: ANSI SQL-92 (limited to SELECT and bulk insert) minus correlated subqueries, has UDFs and UDAs !14
  • 15. Copyright © 2013 Cloudera Inc. All rights reserved. Impala from The User’s Perspective •2014 roadmap: •1.3: admission control •1.4: Decimal(<precision>, <scale>) •2.0 or earlier: analytic window functions, Order By without Limit, support for nested types (structs, arrays, maps), UDTFs, disk-based joins and aggregation !15
  • 16. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture •distributed service: •daemon process (impalad) runs on every node with data •easily deployed with Cloudera Manager •each node can handle user requests; load balancer configuration for multi-user environments recommended •query execution phases: •client request arrives via odbc/jdbc •planner turns request into collection of plan fragments •coordinator initiates execution on remote impala’s !16
  • 17. Copyright © 2013 Cloudera Inc. All rights reserved. • Request arrives via odbc/jdbc Impala Query Execution !17
  • 18. Copyright © 2013 Cloudera Inc. All rights reserved. • Planner turns request into collection of plan fragments • Coordinator initiates execution on remote impalad nodes Impala Query Execution !18
  • 19. Copyright © 2013 Cloudera Inc. All rights reserved. • Intermediate results are streamed between impala’s • Query results are streamed back to client Impala Query Execution !19
  • 20. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture: Query Planning •2-phase process: •single-node plan: left-deep tree of query operators •partitioning into plan fragments for distributed parallel execution:
 maximize scan locality/minimize data movement, parallelize all query operators •cost-based join order optimization •cost-based join distribution optimization !20
  • 21. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture: Query Execution •execution engine designed for efficiency, written from scratch in C++; no reuse of decades-old open-source code •circumvents MapReduce completely •in-memory execution: •aggregation results and right-hand side inputs of joins are cached in memory •example: join with 1TB table, reference 2 of 200 cols, 10% of rows 
 -> need to cache 1GB across all nodes in cluster
 -> not a limitation for most workloads !21
  • 22. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture: Query Execution •runtime code generation: •uses llvm to jit-compile the runtime-intensive parts of a query •effect the same as custom-coding a query: •remove branches •propagate constants, offsets, pointers, etc. •inline function calls •optimized execution for modern CPUs (instruction pipelines) !22
  • 23. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture: Query Execution !23
  • 24. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs MR for Analytic Workloads •Impala vs. SQL-on-MR •Impala 1.1.1/Hive 0.12 (“Stinger Phases 1 and 2”) •file formats: Parquet/ORCfile •TPC-DS, 3TB data set running on 5-node cluster !24
  • 25. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs MR for Analytic Workloads • Impala speedup: • interactive: 8-69x • report: 6-68x • deep analytics: 10-58x !25
  • 26. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs non-MR for Analytic Workloads •Impala 1.2.3/Presto 0.6/Shark •file formats: RCfile (+ Parquet) •TPC-DS, 15TB data set running on 21-node cluster !26
  • 27. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs non-MR for Analytic Workloads !27
  • 28. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs non-MR for Analytic Workloads !28 • Multi-user benchmark: • 10 users concurrently • same dataset, same hardware • workload: queries from “interactive” group
  • 29. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs non-MR for Analytic Workloads !29
  • 30. Copyright © 2013 Cloudera Inc. All rights reserved. Scalability in Hadoop •Hadoop’s promise of linear scalability: add more nodes to cluster, gain a proportional increase in capabilities
 -> adapt to any kind of workload changes simply by adding more nodes to cluster •scaling dimensions for EDW workloads: •response time •concurrency/query throughput •data size !30
  • 31. Copyright © 2013 Cloudera Inc. All rights reserved. Scalability in Hadoop •Scalability results for Impala: •tests show linear scaling along all 3 dimensions •setup: •2 clusters: 18 and 36 nodes •15TB TPC-DS data set •6 “interactive” TPC-DS queries !31
  • 32. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Scalability: Latency !32
  • 33. Copyright © 2013 Cloudera Inc. All rights reserved. • Comparison: 10 vs 20 concurrent users Impala Scalability: Concurrency !33
  • 34. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Scalability: Data Size • Comparison: 15TB vs. 30TB data set !34
  • 35. Copyright © 2013 Cloudera Inc. All rights reserved. Summary: Hadoop for Analytic Workloads •Thesis of this talk: •techniques and functionality of established commercial solutions are either already available or are rapidly being implemented in Hadoop stack •Impala/Parquet/Hdfs is effective solution for certain EDW workloads •Hadoop-based EDW solution maintains Hadoop’s strengths: flexibility, ease of scaling, cost effectiveness !35
  • 37. Copyright © 2013 Cloudera Inc. All rights reserved. Summary: Hadoop for Analytic Workloads •what the future holds: •further performance gains •more complete SQL capabilities •improved resource mgmt and ability to handle multiple concurrent workloads in a single cluster !37