SlideShare a Scribd company logo
1 of 25
Page1 © Hortonworks Inc. 2014
Discardable, In-Memory
Materialized Query for Hadoop
Julian Hyde Julian Hyde
June 3rd, 2014
Page2 © Hortonworks Inc. 2014
About me
Julian Hyde
Architect at Hortonworks
Open source:
• Founder & lead, Apache Optiq (query optimization framework)
• Founder & lead, Pentaho Mondrian (analysis engine)
• Committer, Apache Drill
• Contributor, Apache Hive
• Contributor, Cascading Lingual (SQL interface to Cascading)
Past:
• SQLstream (streaming SQL)
• Broadbase (data warehouse)
• Oracle (SQL kernel development)
Page3 © Hortonworks Inc. 2014
Before we get started…
The bad news
• This software is not available to download
• I am a database bigot
• I am a BI bigot
The good news
• I believe in Hadoop
• Now is a great time to discuss where Hadoop is going
Page4 © Hortonworks Inc. 2014
Hadoop today
Brute force
Hadoop brings a lot of CPU, disk, IO
Yarn, Tez, Vectorization are making Hadoop faster
How to use that brute force is left to the application
Business Intelligence
Best practice is to pull data out of Hadoop
• Populate enterprise data warehouse
• In-memory analytics
• Custom analytics, e.g. Lambda architecture
Ineffective use of memory
Opportunity to make Hadoop smarter
Page5 © Hortonworks Inc. 2014
Brute force + diplomacy
Page6 © Hortonworks Inc. 2014
Hardware trends
Typical Hadoop server configuration
Lots of memory - but not enough for all data
More heterogeneous mix (disk + SSD + memory)
What to do about memory?
Year 2009 2014
Cores 4 – 8 24
Memory 8 GB 128 GB
SSD None 1 TB
Disk 4 x 1 TB 12 x 4 TB
Disk : memory 512 : 1 384 : 1
Page7 © Hortonworks Inc. 2014
Dumb use of memory - Buffer cache
Example:
50 TB data
1 TB memory
Full-table scan
Analogous to virtual
memory
• Great while it works, but…
• Table scan nukes the cache!
Table (disk)
Buffer cache
(memory)
Table scan
Query
operators
(Tez, or
whatever)
Pre-fetch
Page8 © Hortonworks Inc. 2014
“Document-oriented” analysis
Operate on working sets small enough to fit in memory
Analogous to working on a document (e.g. a spreadsheet)
• Works well for problems that fit into memory (e.g. some machine-learning algorithms)
• If your problem grows, you’re out of luck
Working set
(memory)
Table (disk)
Interactive user
Page9 © Hortonworks Inc. 2014
Smarter use of memory - Materialized queries
In-memory
materialized
queries
Tables
on disk
Page10 © Hortonworks Inc. 2014
Census data
Census table – 300M records
Which state has the most
males?
Brute force – read 150M
records
Smarter – read 50 records
CREATE TABLE Census (id, gender,
zipcode, state, age);
SELECT state, COUNT(*) AS c
FROM Census
WHERE gender = ‘M’
GROUP BY state
ORDER BY c DESC LIMIT 1;
CREATE TABLE CensusSummary AS
SELECT state, gender, COUNT(*) AS c
FROM Census
GROUP BY state, gender;
SELECT state, c
FROM CensusSummary
WHERE gender = ‘M’
ORDER BY c DESC LIMIT 1;
Page11 © Hortonworks Inc. 2014
Materialized view – Automatic smartness
A materialized view is a
table that is declared to be
identical to a given query
Optimizer rewrites query
on “Census” to use
“CensusSummary” instead
Even smarter – read 50
records
CREATE MATERIALIZED VIEW CensusSummary
STORAGE (MEMORY, DISCARDABLE) AS
SELECT state, gender, COUNT(*) AS c
FROM Census
GROUP BY state, gender;
SELECT state, COUNT(*) AS c
FROM Census
WHERE gender = ‘M’
GROUP BY state
ORDER BY c DESC LIMIT 1;
Page12 © Hortonworks Inc. 2014
Indistinguishable from magic
Run query #1
System creates
materialized view in
background
Related query #2 runs
faster
SELECT state, COUNT(*) AS c
FROM Census
WHERE gender = ‘M’
GROUP BY state
ORDER BY c DESC LIMIT 1;
CREATE MATERIALIZED VIEW CensusSummary
STORAGE (MEMORY, DISCARDABLE) AS
SELECT state, gender, COUNT(*) AS c
FROM Census
GROUP BY state, gender;
SELECT state,
COUNT(NULLIF(gender, ‘F’)) AS males,
COUNT(NULLIF(gender, ‘M’)) AS females
FROM Census
GROUP BY state
HAVING females > males;
Page13 © Hortonworks Inc. 2014
Materialized views - Classic
Classic materialized view
(Oracle, DB2, Teradata, MSSql)
1. A table defined using a SQL query
2. Designed by DBA
3. Storage same as a regular table
1. On disk
2. Can define indexes
4. DB populates the table
5. Queries are rewritten to use the
table**
6. DB updates the table to reflect
changes to source data (usually
deferred)*
*Magic required
SELECT t.year, AVG(s.units)
FROM SalesFact AS s
JOIN TimeDim AS t USING (timeId)
GROUP BY t.year;
CREATE MATERIALIZED VIEW SalesMonthZip AS
SELECT t.year, t.month,
c.state, c.zipcode,
COUNT(*), SUM(s.units), SUM(s.price)
FROM SalesFact AS s
JOIN TimeDim AS t USING (timeId)
JOIN CustomerDim AS c USING (customerId)
GROUP BY t.year, t.month,
c.state, c.zipcode;
Page14 © Hortonworks Inc. 2014
Materialized views - DIMMQ
DIMMQs - Discardable, In-memory Materialized Queries
Differences with classic materialized views
1. May be in-memory
2. HDFS may discard – based on DDM (Distributed Discardable Memory)
3. Lifecycle support:
1. Assume table is populated
2. Don’t populate & maintain
3. User can flag as valid, invalid, or change definition (e.g. date range)
4. HDFS may discard
4. More design options:
1. DBA specifies
2. Retain query results (or partial results)
3. An agent builds MVs based on query traffic
Page15 © Hortonworks Inc. 2014
DIMMQ compared to Spark RDDs
It’s not “either / or”
• Spark-on-YARN, SQL-on-Spark already exist; Cascading-on-Spark common soon
• Hive-queries-on-RDDs, DIMMQs populated by Spark, Spark-on-Hive-tables are possible
Spark RDD DIMMQ
Access By reference By algebraic expression
Sharing Within session Across sessions
Execution model Built-in External
Recovery Yes No
Discard Failure or GC Failure or cost-based
Native organization Memory Disk
Language Scala (or other JVM
language)
Language-independent
Page16 © Hortonworks Inc. 2014
Data independence
This is not just about SQL standards compliance!
Materialized views are supposed to be transparent in creation, maintenance and use.
If not one DBA ever types “CREATE MATERIALIZED VIEW”, we have still succeeded
Data independence
Ability to move data around and not tell your application
Replicas
Redundant copies
Moving between disk and memory
Sort order, projections (à la Vertica), aggregates (à la Microstrategy)
Indexes, and other weird data structures
Page17 © Hortonworks Inc. 2014
Implementing DIMMQs
Relational algebra
Apache Optiq (just entered incubator – yay!)
Algebra, rewrite rules, cost model
Metadata
Hive: “CREATE MATERIALIZED VIEW”
Definitions of materialized views in HCatalog
HDFS - Discardable Distributed Memory (DDM)
Off-heap data in memory-mapped files
Discard policy
Build in-memory, replicate to disk; or vice versa
Central namespace
Evolution of existing components
Page18 © Hortonworks Inc. 2014
Tiled queries in distributed memory
Query: SELECT x, SUM(y) FROM t GROUP BY x
In-memory
materialized
queries
Tables
on disk
Page19 © Hortonworks Inc. 2014
An adaptive system
Ongoing activities:
• Agent suggests new MVs
• MVs are built in background
• Ongoing query activity uses MVs
• User marks MVs as invalid due to source data
changes
• HDFS throws out MVs that are not pulling their
weight
Dynamic equilibrium
DIMMQs continually created & destroyed
System moves data around to adapt to
changing usage patterns
Page20 © Hortonworks Inc. 2014
Lambda architecture
From “Runaway complexity in Big Data and a plan to stop it” (Nathan Marz)
Page21 © Hortonworks Inc. 2014
Lambda architecture in Hadoop via DIMMQs
Use DIMMQs for materialized historic & streaming data
MV on disk
MV in key-
value store
Hadoop
Query via
SQL
Page22 © Hortonworks Inc. 2014
Variations on a theme
Materialized queries don’t have to be in memory
Materialized queries don’t need to be discardable
Materialized queries don’t need to be accessed via SQL
Materialized queries allow novel data structures to be
described
Maintaining materialized queries - build on Hive ACID
Fine-grained invalidation
Streaming into DIMMQs
In-memory tables don’t have to be materialized queries
Data aging – Older data to cheaper storage
Discardable
In-memory
Algebraic
Page23 © Hortonworks Inc. 2014
Lattice
Space of possible materialized views
A star schema, with mandatory many-to-one
relationships
Each view is a projected, filtered
aggregation
• Sales by zipcode and quarter in 2013
• Sales by state in Q1, 2012
Lattice gathers stats
• “I used MV m to answer query q and avoided
fetching r rows”
• Cost of MV = construction effort + memory * time
• Utility of MV = query processing effort saved
Recommends & builds optimal MVs
CREATE LATTICE SalesStar AS
SELECT *
FROM SalesFact AS s
JOIN TimeDim AS t USING (timeId)
JOIN CustomerDim AS c USING (customerId);
Page24 © Hortonworks Inc. 2014
Conclusion
Broadening the tent – batch, interactive BI, streaming, iterative
Declarative, algebraic, transparent
Seamless movement between memory and disk
Adding brains to complement Hadoop’s brawn
Page25 © Hortonworks Inc. 2014
Thank you!
My next talk:
“Cost-based query optimization in Hive”
4:35pm Wednesday
@julianhyde
http://hortonworks.com/blog/dmmq/
http://hortonworks.com/blog/ddm/
http://incubator.apache.org/projects/optiq.html

More Related Content

What's hot

Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits allJulian Hyde
 
Why is data independence (still) so important? Optiq and Apache Drill.
Why is data independence (still) so important? Optiq and Apache Drill.Why is data independence (still) so important? Optiq and Apache Drill.
Why is data independence (still) so important? Optiq and Apache Drill.Julian Hyde
 
Optiq: A dynamic data management framework
Optiq: A dynamic data management frameworkOptiq: A dynamic data management framework
Optiq: A dynamic data management frameworkJulian Hyde
 
SQL on Big Data using Optiq
SQL on Big Data using OptiqSQL on Big Data using Optiq
SQL on Big Data using OptiqJulian Hyde
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartEvans Ye
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Stamatis Zampetakis
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Michael Rys
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memoryJulian Hyde
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Michael Rys
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllMichael Mior
 
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Christian Tzolov
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteJulian Hyde
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys
 
Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)Julian Hyde
 
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)Jason L Brugger
 
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Michael Rys
 
Mondrian update (Pentaho community meetup 2012, Amsterdam)
Mondrian update (Pentaho community meetup 2012, Amsterdam)Mondrian update (Pentaho community meetup 2012, Amsterdam)
Mondrian update (Pentaho community meetup 2012, Amsterdam)Julian Hyde
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketDremio Corporation
 

What's hot (20)

Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
Why is data independence (still) so important? Optiq and Apache Drill.
Why is data independence (still) so important? Optiq and Apache Drill.Why is data independence (still) so important? Optiq and Apache Drill.
Why is data independence (still) so important? Optiq and Apache Drill.
 
Optiq: A dynamic data management framework
Optiq: A dynamic data management frameworkOptiq: A dynamic data management framework
Optiq: A dynamic data management framework
 
SQL on Big Data using Optiq
SQL on Big Data using OptiqSQL on Big Data using Optiq
SQL on Big Data using Optiq
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smart
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
 
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)
 
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
 
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
 
Mondrian update (Pentaho community meetup 2012, Amsterdam)
Mondrian update (Pentaho community meetup 2012, Amsterdam)Mondrian update (Pentaho community meetup 2012, Amsterdam)
Mondrian update (Pentaho community meetup 2012, Amsterdam)
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
 

Viewers also liked

The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too muchJulian Hyde
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
 
Cost-based Query Optimization in Hive
Cost-based Query Optimization in HiveCost-based Query Optimization in Hive
Cost-based Query Optimization in HiveDataWorks Summit
 
Oracle enterprise architects day
Oracle enterprise architects dayOracle enterprise architects day
Oracle enterprise architects dayAyodele Peter Boglo
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentJulian Hyde
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overviewJulian Hyde
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveJulian Hyde
 

Viewers also liked (11)

The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too much
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
Cost-based Query Optimization in Hive
Cost-based Query Optimization in HiveCost-based Query Optimization in Hive
Cost-based Query Optimization in Hive
 
Oracle enterprise architects day
Oracle enterprise architects dayOracle enterprise architects day
Oracle enterprise architects day
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 

Similar to Discardable In-Memory Materialized Queries With Hadoop

Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersAdaryl "Bob" Wakefield, MBA
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache HadoopHortonworks
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
The modern analytics architecture
The modern analytics architectureThe modern analytics architecture
The modern analytics architectureJoseph D'Antoni
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemJames Serra
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data ModelingAdam Doyle
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...Lucas Jellema
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep DiveHortonworks
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptxIke Ellis
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupVictor Coustenoble
 

Similar to Discardable In-Memory Materialized Queries With Hadoop (20)

Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Spark
SparkSpark
Spark
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
 
NoSQL
NoSQLNoSQL
NoSQL
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
The modern analytics architecture
The modern analytics architectureThe modern analytics architecture
The modern analytics architecture
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptx
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
 

More from Julian Hyde

Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteJulian Hyde
 
Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Julian Hyde
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQLJulian Hyde
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming languageJulian Hyde
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query LanguageJulian Hyde
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityJulian Hyde
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're IncubatingJulian Hyde
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteJulian Hyde
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesJulian Hyde
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineeringJulian Hyde
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Julian Hyde
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databasesJulian Hyde
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and FastJulian Hyde
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache CalciteJulian Hyde
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache CalciteJulian Hyde
 

More from Julian Hyde (19)

Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using Calcite
 
Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQL
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming language
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're Incubating
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databases
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 

Recently uploaded

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Recently uploaded (20)

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

Discardable In-Memory Materialized Queries With Hadoop

  • 1. Page1 © Hortonworks Inc. 2014 Discardable, In-Memory Materialized Query for Hadoop Julian Hyde Julian Hyde June 3rd, 2014
  • 2. Page2 © Hortonworks Inc. 2014 About me Julian Hyde Architect at Hortonworks Open source: • Founder & lead, Apache Optiq (query optimization framework) • Founder & lead, Pentaho Mondrian (analysis engine) • Committer, Apache Drill • Contributor, Apache Hive • Contributor, Cascading Lingual (SQL interface to Cascading) Past: • SQLstream (streaming SQL) • Broadbase (data warehouse) • Oracle (SQL kernel development)
  • 3. Page3 © Hortonworks Inc. 2014 Before we get started… The bad news • This software is not available to download • I am a database bigot • I am a BI bigot The good news • I believe in Hadoop • Now is a great time to discuss where Hadoop is going
  • 4. Page4 © Hortonworks Inc. 2014 Hadoop today Brute force Hadoop brings a lot of CPU, disk, IO Yarn, Tez, Vectorization are making Hadoop faster How to use that brute force is left to the application Business Intelligence Best practice is to pull data out of Hadoop • Populate enterprise data warehouse • In-memory analytics • Custom analytics, e.g. Lambda architecture Ineffective use of memory Opportunity to make Hadoop smarter
  • 5. Page5 © Hortonworks Inc. 2014 Brute force + diplomacy
  • 6. Page6 © Hortonworks Inc. 2014 Hardware trends Typical Hadoop server configuration Lots of memory - but not enough for all data More heterogeneous mix (disk + SSD + memory) What to do about memory? Year 2009 2014 Cores 4 – 8 24 Memory 8 GB 128 GB SSD None 1 TB Disk 4 x 1 TB 12 x 4 TB Disk : memory 512 : 1 384 : 1
  • 7. Page7 © Hortonworks Inc. 2014 Dumb use of memory - Buffer cache Example: 50 TB data 1 TB memory Full-table scan Analogous to virtual memory • Great while it works, but… • Table scan nukes the cache! Table (disk) Buffer cache (memory) Table scan Query operators (Tez, or whatever) Pre-fetch
  • 8. Page8 © Hortonworks Inc. 2014 “Document-oriented” analysis Operate on working sets small enough to fit in memory Analogous to working on a document (e.g. a spreadsheet) • Works well for problems that fit into memory (e.g. some machine-learning algorithms) • If your problem grows, you’re out of luck Working set (memory) Table (disk) Interactive user
  • 9. Page9 © Hortonworks Inc. 2014 Smarter use of memory - Materialized queries In-memory materialized queries Tables on disk
  • 10. Page10 © Hortonworks Inc. 2014 Census data Census table – 300M records Which state has the most males? Brute force – read 150M records Smarter – read 50 records CREATE TABLE Census (id, gender, zipcode, state, age); SELECT state, COUNT(*) AS c FROM Census WHERE gender = ‘M’ GROUP BY state ORDER BY c DESC LIMIT 1; CREATE TABLE CensusSummary AS SELECT state, gender, COUNT(*) AS c FROM Census GROUP BY state, gender; SELECT state, c FROM CensusSummary WHERE gender = ‘M’ ORDER BY c DESC LIMIT 1;
  • 11. Page11 © Hortonworks Inc. 2014 Materialized view – Automatic smartness A materialized view is a table that is declared to be identical to a given query Optimizer rewrites query on “Census” to use “CensusSummary” instead Even smarter – read 50 records CREATE MATERIALIZED VIEW CensusSummary STORAGE (MEMORY, DISCARDABLE) AS SELECT state, gender, COUNT(*) AS c FROM Census GROUP BY state, gender; SELECT state, COUNT(*) AS c FROM Census WHERE gender = ‘M’ GROUP BY state ORDER BY c DESC LIMIT 1;
  • 12. Page12 © Hortonworks Inc. 2014 Indistinguishable from magic Run query #1 System creates materialized view in background Related query #2 runs faster SELECT state, COUNT(*) AS c FROM Census WHERE gender = ‘M’ GROUP BY state ORDER BY c DESC LIMIT 1; CREATE MATERIALIZED VIEW CensusSummary STORAGE (MEMORY, DISCARDABLE) AS SELECT state, gender, COUNT(*) AS c FROM Census GROUP BY state, gender; SELECT state, COUNT(NULLIF(gender, ‘F’)) AS males, COUNT(NULLIF(gender, ‘M’)) AS females FROM Census GROUP BY state HAVING females > males;
  • 13. Page13 © Hortonworks Inc. 2014 Materialized views - Classic Classic materialized view (Oracle, DB2, Teradata, MSSql) 1. A table defined using a SQL query 2. Designed by DBA 3. Storage same as a regular table 1. On disk 2. Can define indexes 4. DB populates the table 5. Queries are rewritten to use the table** 6. DB updates the table to reflect changes to source data (usually deferred)* *Magic required SELECT t.year, AVG(s.units) FROM SalesFact AS s JOIN TimeDim AS t USING (timeId) GROUP BY t.year; CREATE MATERIALIZED VIEW SalesMonthZip AS SELECT t.year, t.month, c.state, c.zipcode, COUNT(*), SUM(s.units), SUM(s.price) FROM SalesFact AS s JOIN TimeDim AS t USING (timeId) JOIN CustomerDim AS c USING (customerId) GROUP BY t.year, t.month, c.state, c.zipcode;
  • 14. Page14 © Hortonworks Inc. 2014 Materialized views - DIMMQ DIMMQs - Discardable, In-memory Materialized Queries Differences with classic materialized views 1. May be in-memory 2. HDFS may discard – based on DDM (Distributed Discardable Memory) 3. Lifecycle support: 1. Assume table is populated 2. Don’t populate & maintain 3. User can flag as valid, invalid, or change definition (e.g. date range) 4. HDFS may discard 4. More design options: 1. DBA specifies 2. Retain query results (or partial results) 3. An agent builds MVs based on query traffic
  • 15. Page15 © Hortonworks Inc. 2014 DIMMQ compared to Spark RDDs It’s not “either / or” • Spark-on-YARN, SQL-on-Spark already exist; Cascading-on-Spark common soon • Hive-queries-on-RDDs, DIMMQs populated by Spark, Spark-on-Hive-tables are possible Spark RDD DIMMQ Access By reference By algebraic expression Sharing Within session Across sessions Execution model Built-in External Recovery Yes No Discard Failure or GC Failure or cost-based Native organization Memory Disk Language Scala (or other JVM language) Language-independent
  • 16. Page16 © Hortonworks Inc. 2014 Data independence This is not just about SQL standards compliance! Materialized views are supposed to be transparent in creation, maintenance and use. If not one DBA ever types “CREATE MATERIALIZED VIEW”, we have still succeeded Data independence Ability to move data around and not tell your application Replicas Redundant copies Moving between disk and memory Sort order, projections (à la Vertica), aggregates (à la Microstrategy) Indexes, and other weird data structures
  • 17. Page17 © Hortonworks Inc. 2014 Implementing DIMMQs Relational algebra Apache Optiq (just entered incubator – yay!) Algebra, rewrite rules, cost model Metadata Hive: “CREATE MATERIALIZED VIEW” Definitions of materialized views in HCatalog HDFS - Discardable Distributed Memory (DDM) Off-heap data in memory-mapped files Discard policy Build in-memory, replicate to disk; or vice versa Central namespace Evolution of existing components
  • 18. Page18 © Hortonworks Inc. 2014 Tiled queries in distributed memory Query: SELECT x, SUM(y) FROM t GROUP BY x In-memory materialized queries Tables on disk
  • 19. Page19 © Hortonworks Inc. 2014 An adaptive system Ongoing activities: • Agent suggests new MVs • MVs are built in background • Ongoing query activity uses MVs • User marks MVs as invalid due to source data changes • HDFS throws out MVs that are not pulling their weight Dynamic equilibrium DIMMQs continually created & destroyed System moves data around to adapt to changing usage patterns
  • 20. Page20 © Hortonworks Inc. 2014 Lambda architecture From “Runaway complexity in Big Data and a plan to stop it” (Nathan Marz)
  • 21. Page21 © Hortonworks Inc. 2014 Lambda architecture in Hadoop via DIMMQs Use DIMMQs for materialized historic & streaming data MV on disk MV in key- value store Hadoop Query via SQL
  • 22. Page22 © Hortonworks Inc. 2014 Variations on a theme Materialized queries don’t have to be in memory Materialized queries don’t need to be discardable Materialized queries don’t need to be accessed via SQL Materialized queries allow novel data structures to be described Maintaining materialized queries - build on Hive ACID Fine-grained invalidation Streaming into DIMMQs In-memory tables don’t have to be materialized queries Data aging – Older data to cheaper storage Discardable In-memory Algebraic
  • 23. Page23 © Hortonworks Inc. 2014 Lattice Space of possible materialized views A star schema, with mandatory many-to-one relationships Each view is a projected, filtered aggregation • Sales by zipcode and quarter in 2013 • Sales by state in Q1, 2012 Lattice gathers stats • “I used MV m to answer query q and avoided fetching r rows” • Cost of MV = construction effort + memory * time • Utility of MV = query processing effort saved Recommends & builds optimal MVs CREATE LATTICE SalesStar AS SELECT * FROM SalesFact AS s JOIN TimeDim AS t USING (timeId) JOIN CustomerDim AS c USING (customerId);
  • 24. Page24 © Hortonworks Inc. 2014 Conclusion Broadening the tent – batch, interactive BI, streaming, iterative Declarative, algebraic, transparent Seamless movement between memory and disk Adding brains to complement Hadoop’s brawn
  • 25. Page25 © Hortonworks Inc. 2014 Thank you! My next talk: “Cost-based query optimization in Hive” 4:35pm Wednesday @julianhyde http://hortonworks.com/blog/dmmq/ http://hortonworks.com/blog/ddm/ http://incubator.apache.org/projects/optiq.html

Editor's Notes

  1. This software does not yet exist. I hope and believe that it will. I don’t believe in promoting vaporware! I want to have a discussion about where Hadoop is going so as many of us can shape it. This seems like a great time and place to have it. My background is in database. I think like a database guy. I’ve been thinking really hard about how to bring the “good stuff” from databases over to Hadoop I have a deep knowledge of BI, and what BI tools need from their underlying data system, having written Mondrian
  2. Hadoop today brings a lot of brute force. It can be more effective if it were a bit smarter. We cannot make effective use of memory unless we are a bit smarter.
  3. Lots of memory now. But still not enough
  4. Bringing all kinds of data into the Hadoop tent - Batch, interactive, streaming, iterative Declarative, algebraic, transparent - Applications use different copies of data without knowing it Making Hadoop smarter – Adding brains to compliment Hadoop’s formidable brawn