SlideShare a Scribd company logo
1 of 34
Solving performance problems on Hadoop
Moving analytic workloads into production
1
Tyler Mitchell
Sr. Software Engineer
Actian Center of Excellence
Topics
How we got (stuck) here
Performance best practises
Sample business cases
Benchmarking results
2
Actian’s Lineage
Ingres – 1970’s Versant – 1988 ParAccel – 2006
Pervasive – 1982 Vectorwise – 2003
3
Actian
Actian at a Glance
4
10,000+
8 Countries; 7 US Cities
HQ Palo Alto
400+
Employees Customers
3
Businesses
Banking, Insurance
Telecom and Media
Data Management
Data Integration
Big Data Analytics
How We Got (Stuck) Here
5
Accidental Hadoop Tourist – Brief History
6
DataBusiness
Data Capture
Data Management
& Integration
Analytics
Query & Analyze
Solutions
Problem
Solved
Accidental Hadoop Tourist – Brief History
7
DataBusiness
Data Capture
Data Management
& Integration
Analytics Solutions
??????
Accidental Hadoop Tourist – Brief History
8
DataBusiness
Data Capture
Data Management
& Integration
Analytics
???
Solutions
???
Modern, best-in-class analytic database technology provides:
9
Measureable business impact: monetize Big Data to grow revenue,
reduce cost, mitigate risk, enable new business
The ability to make data driven business decisions using a massively
scalable platform
Decisive reduction in the cost of high performance analytics at scale
Performance that can meet all SLAs
Full leverage of existing SQL skills while deploying a modern analytic
infrastructure
Grow
Revenue
Reduce
Cost
Mitigate
Risk
Create
New
Business
Business Solution Architecture Challenges
Wide Ranges of Use Cases
10
Financial Services
Advanced Credit
Risk Analytics
across billions of
data points
Internet Scale
Application
Predictive
Analytics across
hundreds of
millions of
customers
Media
Data Science and
Discovery across
trillions of IoT
events
Dept of Defense
Cyber-Security:
Network
intrusion models
every second
Credit Card
Processing
Fraud
detection
every milli-
second
Performance Best Practises
11
3 Essential Big Data Concepts
12
0. Take nothing for granted
1. Partitioning vs Data skew
2. Data types matter
3. Maximize memory / minimize bottlenecks
4. Take nothing for granted
6 Game Changing Database Innovations
13
6 Game Changing Database Innovations
14
1. Use the CPU! – Vector Processing
2. Minimize bottlenecks – Exploiting Chip Cache
3. Got columnar?
4. Smarter compression
5. Smarter indexing
6. Multi-core matters
Actian VectorH Innovations
15
Big Data Business Use Cases
16
Customer 360: Understanding Experience, Driving Revenue
17
Telecom Challenge
Vast and growing repository of proprietary click data, customer records, service
call records, smart phone and device data GPS location, webserver, telephone,
network usage.
Queries took minutes or hours, and sometimes never returned at all.
Critical business analysis on a consolidated customer 360 data lake was
grinding to a halt.
The ability to gain deeper market insights, visualization and desired data
management and operational optimization was at risk
Customer 360: Initial Architecture
18
Development System
• 300+ node cluster
• HIVE access
• SQL based BI / Data Science
• Pre-processed as performance was unacceptable
• Views taking days to return snapshot views
Customer 360: Technical Improvements
19
Production Prototype
• 30 node cluster (10% of Hive)
• Actian Vector on Hadoop solution
• SQL based BI / Data Science
• No materialized view building required
• Join on demand faster than aggregate tables in Hive
• Reduced storage requirements
• 91TB – two years data, 1100 columns when joined
Customer 360: Understanding Experience, Driving Revenue
20
Results
Customer 360 across prior data silos
Leveraged for customer retention strategies
Predict and take proactive, tailored
responses
Enables next gen data-driven
troubleshooting, impact analysis and root
cause analysis
• Accelerated operations intelligence
• Improved customer experience
• Reduced customer churn
Impact
Financial Risk: Upgrading Legacy to Meet SLA
21
Challenge
Legacy single-purpose risk application took 3 hours to generate end-of-day risk report,
and failed to meet changing SLA’s for reporting risk.
In deciding to replace risk application, bank opted to build a multi-purpose risk
application, addressing multiple business requirements
Financial Risk: Upgrading Legacy to Meet SLA
22
Legacy System
• Single server architecture, MS SSAS, Oracle - ~30 applications
• Pre-processing of desired measures exploding data volumes
• Cube and Analysis engines being maxed out as they exceed 1.5TB range
• Unable to scale to the desired range of > 200GB/day new data
• Impala attempt failed
• Highly invested in apps built on Analysis service
Financial Risk: Upgrading Legacy to Meet SLA
23
New Possibilities
• Clustered solution – Hadoop 5 and 10 node
• No pre-processing cubes, SSAS partly kept
• Tested solutions 1TB -> 20TB at a time
• Produced interactive queries across large datasets
• Focused query results in 2s or less
• Processing all data in the database 6s – 80s
• 2x nodes ~ 200% speed improvement
Financial Risk: Upgrading Legacy to Meet SLA
24
Results
Increased data analyzed by 100X
2–200B rows / 1-20TB
Report run in 28 seconds vs. 3 hours
Use of application for:
• Intra-day reporting (surveillance)
• End of day reporting (compliance)
• Overnight float investment
options
• Annual CCAR Analysis
ActualGoal
Delivering the Results With Better Engineering
25
Technical Benchmarks
26
Technical Benchmarks – Single Machine
27
Technical Benchmarks – Single Machine
28
Technical Benchmarks: VectorH - SQL on Hadoop
29
TPC-H SF1000 *
VectorH vs other platforms, faster by how much?
Tuned platforms
Identical hardware **
* Not an official TPC result ** 10 nodes, each 2 x Intel 3.0GHz E5-2690v2 CPUs, 256GB RAM,
24x600GB HDD, 10Gb Ethernet, Hadoop 2.6.0
Actian VectorH Delivers More Efficient File Format
30
Better compression & functionality
Vector advantages:
• skip blocks via MinMax indexes
• sophisticated query processing
• efficient block format, esp. 64-bit int
Summary
Conscientious data handling & next gen engineering takes SQL
in Hadoop to new levels.
All Hadoop users can move from development into production
while delivering compelling business results.
31
Delivering the Results With Better Engineering
32
VectorH v5 – Spark integration, external table support, and more
SIGMOD 2016 Paper
33
Thank you!
tyler.mitchell@actian.com - @1tylermitchell
Blogs at Actian.com - MakeDataUseful.com
Visit us in booth 503
34

More Related Content

What's hot

Depositing Value from Transactional Data at Danske Bank
Depositing Value from Transactional Data at Danske BankDepositing Value from Transactional Data at Danske Bank
Depositing Value from Transactional Data at Danske BankDataWorks Summit/Hadoop Summit
 
Big Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondBig Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondDataWorks Summit/Hadoop Summit
 
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...DataWorks Summit
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo
 
Versa Shore Microsoft APS PDW webinar
Versa Shore Microsoft APS PDW webinarVersa Shore Microsoft APS PDW webinar
Versa Shore Microsoft APS PDW webinarShawn Rao
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseDataWorks Summit
 
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...DataWorks Summit
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesDataWorks Summit
 
Multi-tenant Hadoop - the challenge of maintaining high SLAS
Multi-tenant Hadoop - the challenge of maintaining high SLASMulti-tenant Hadoop - the challenge of maintaining high SLAS
Multi-tenant Hadoop - the challenge of maintaining high SLASDataWorks Summit
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data IntegrationsPat Patterson
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewDataWorks Summit/Hadoop Summit
 
Creating a Modern Data Architecture
Creating a Modern Data ArchitectureCreating a Modern Data Architecture
Creating a Modern Data ArchitectureZaloni
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudDataWorks Summit
 
Big Data & Data Lakes Building Blocks
Big Data & Data Lakes Building BlocksBig Data & Data Lakes Building Blocks
Big Data & Data Lakes Building BlocksAmazon Web Services
 
Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryNavigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryDataWorks Summit/Hadoop Summit
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architectureMilos Milovanovic
 
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaExtend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaDataWorks Summit/Hadoop Summit
 

What's hot (20)

Depositing Value from Transactional Data at Danske Bank
Depositing Value from Transactional Data at Danske BankDepositing Value from Transactional Data at Danske Bank
Depositing Value from Transactional Data at Danske Bank
 
Big Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondBig Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyond
 
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
Not Just a necessary evil, it’s good for business: implementing PCI DSS contr...
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
Versa Shore Microsoft APS PDW webinar
Versa Shore Microsoft APS PDW webinarVersa Shore Microsoft APS PDW webinar
Versa Shore Microsoft APS PDW webinar
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
 
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
 
Multi-tenant Hadoop - the challenge of maintaining high SLAS
Multi-tenant Hadoop - the challenge of maintaining high SLASMulti-tenant Hadoop - the challenge of maintaining high SLAS
Multi-tenant Hadoop - the challenge of maintaining high SLAS
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data Integrations
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture View
 
Creating a Modern Data Architecture
Creating a Modern Data ArchitectureCreating a Modern Data Architecture
Creating a Modern Data Architecture
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloud
 
Solving Big Data Problems using Hortonworks
Solving Big Data Problems using Hortonworks Solving Big Data Problems using Hortonworks
Solving Big Data Problems using Hortonworks
 
Big Data & Data Lakes Building Blocks
Big Data & Data Lakes Building BlocksBig Data & Data Lakes Building Blocks
Big Data & Data Lakes Building Blocks
 
Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryNavigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data Discovery
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaExtend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
 

Viewers also liked

A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseDataWorks Summit/Hadoop Summit
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit
 
Giip kb-hadoop sizing
Giip kb-hadoop sizingGiip kb-hadoop sizing
Giip kb-hadoop sizingLowy Shin
 
Pivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical OverviewPivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical OverviewVMware Tanzu
 
Wall Street Derivative Risk Solutions Using Apache Geode
Wall Street Derivative Risk Solutions Using Apache GeodeWall Street Derivative Risk Solutions Using Apache Geode
Wall Street Derivative Risk Solutions Using Apache GeodeAndre Langevin
 
Driving Real Insights Through Data Science
Driving Real Insights Through Data ScienceDriving Real Insights Through Data Science
Driving Real Insights Through Data ScienceVMware Tanzu
 
Troubleshooting App Health and Performance with PCF Metrics 1.2
Troubleshooting App Health and Performance with PCF Metrics 1.2Troubleshooting App Health and Performance with PCF Metrics 1.2
Troubleshooting App Health and Performance with PCF Metrics 1.2VMware Tanzu
 
Integrate Big Data into Your Organization with Informatica and Perficient
Integrate Big Data into Your Organization with Informatica and PerficientIntegrate Big Data into Your Organization with Informatica and Perficient
Integrate Big Data into Your Organization with Informatica and PerficientPerficient, Inc.
 
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Beyond TCO
Beyond TCOBeyond TCO
Beyond TCO
 
Workload Automation + Hadoop?
Workload Automation + Hadoop?Workload Automation + Hadoop?
Workload Automation + Hadoop?
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
Active Learning for Fraud Prevention
Active Learning for Fraud PreventionActive Learning for Fraud Prevention
Active Learning for Fraud Prevention
 
SQL and Search with Spark in your browser
SQL and Search with Spark in your browserSQL and Search with Spark in your browser
SQL and Search with Spark in your browser
 
Giip kb-hadoop sizing
Giip kb-hadoop sizingGiip kb-hadoop sizing
Giip kb-hadoop sizing
 
High-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in HadoopHigh-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in Hadoop
 
Enterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on HadoopEnterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on Hadoop
 
Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
 
Pivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical OverviewPivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical Overview
 
Wall Street Derivative Risk Solutions Using Apache Geode
Wall Street Derivative Risk Solutions Using Apache GeodeWall Street Derivative Risk Solutions Using Apache Geode
Wall Street Derivative Risk Solutions Using Apache Geode
 
Driving Real Insights Through Data Science
Driving Real Insights Through Data ScienceDriving Real Insights Through Data Science
Driving Real Insights Through Data Science
 
Troubleshooting App Health and Performance with PCF Metrics 1.2
Troubleshooting App Health and Performance with PCF Metrics 1.2Troubleshooting App Health and Performance with PCF Metrics 1.2
Troubleshooting App Health and Performance with PCF Metrics 1.2
 
Integrate Big Data into Your Organization with Informatica and Perficient
Integrate Big Data into Your Organization with Informatica and PerficientIntegrate Big Data into Your Organization with Informatica and Perficient
Integrate Big Data into Your Organization with Informatica and Perficient
 
Why is my Hadoop* job slow?
Why is my Hadoop* job slow?Why is my Hadoop* job slow?
Why is my Hadoop* job slow?
 
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...
Large Scale Health Telemetry and Analytics with MQTT, Hadoop and Machine Lear...
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 

Similar to Solving Performance Problems on Hadoop

The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsThe Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsSingleStore
 
Supercharging Smart Meter BIG DATA Analytics with Microsoft Azure Cloud- SRP ...
Supercharging Smart Meter BIG DATA Analytics with Microsoft Azure Cloud- SRP ...Supercharging Smart Meter BIG DATA Analytics with Microsoft Azure Cloud- SRP ...
Supercharging Smart Meter BIG DATA Analytics with Microsoft Azure Cloud- SRP ...Mike Rossi
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationAbdelkrim Hadjidj
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Denodo
 
sap hana|sap hana database| Introduction to sap hana
sap hana|sap hana database| Introduction to sap hanasap hana|sap hana database| Introduction to sap hana
sap hana|sap hana database| Introduction to sap hanaJames L. Lee
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Group
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationDatabricks
 
081622tdwi.pdf
081622tdwi.pdf081622tdwi.pdf
081622tdwi.pdfAlex446314
 
Slides: Success Stories for Data-to-Cloud
Slides: Success Stories for Data-to-CloudSlides: Success Stories for Data-to-Cloud
Slides: Success Stories for Data-to-CloudDATAVERSITY
 
From Data to Services at the Speed of Business
From Data to Services at the Speed of BusinessFrom Data to Services at the Speed of Business
From Data to Services at the Speed of BusinessAli Hodroj
 
Big Data Expo 2015 - Talend Delivering Real Time
Big Data Expo 2015 - Talend Delivering Real TimeBig Data Expo 2015 - Talend Delivering Real Time
Big Data Expo 2015 - Talend Delivering Real TimeBigDataExpo
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
 
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...Data Con LA
 
How Hewlett Packard Enterprise Gets Real with IoT Analytics
How Hewlett Packard Enterprise Gets Real with IoT AnalyticsHow Hewlett Packard Enterprise Gets Real with IoT Analytics
How Hewlett Packard Enterprise Gets Real with IoT AnalyticsArcadia Data
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraAttunity
 
Top SAP Online training institute in Hyderabad
Top SAP Online training institute in HyderabadTop SAP Online training institute in Hyderabad
Top SAP Online training institute in HyderabadAadhyaKrishnan
 
Yellowbrick MicroStrategy webcast
Yellowbrick MicroStrategy webcastYellowbrick MicroStrategy webcast
Yellowbrick MicroStrategy webcastYellowbrick Data
 
OpenWorld: 4 Real-world Cloud Migration Case Studies
OpenWorld: 4 Real-world Cloud Migration Case StudiesOpenWorld: 4 Real-world Cloud Migration Case Studies
OpenWorld: 4 Real-world Cloud Migration Case StudiesDatavail
 

Similar to Solving Performance Problems on Hadoop (20)

The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsThe Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
 
Supercharging Smart Meter BIG DATA Analytics with Microsoft Azure Cloud- SRP ...
Supercharging Smart Meter BIG DATA Analytics with Microsoft Azure Cloud- SRP ...Supercharging Smart Meter BIG DATA Analytics with Microsoft Azure Cloud- SRP ...
Supercharging Smart Meter BIG DATA Analytics with Microsoft Azure Cloud- SRP ...
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant Presentation
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
 
sap hana|sap hana database| Introduction to sap hana
sap hana|sap hana database| Introduction to sap hanasap hana|sap hana database| Introduction to sap hana
sap hana|sap hana database| Introduction to sap hana
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop Migration
 
081622tdwi.pdf
081622tdwi.pdf081622tdwi.pdf
081622tdwi.pdf
 
Slides: Success Stories for Data-to-Cloud
Slides: Success Stories for Data-to-CloudSlides: Success Stories for Data-to-Cloud
Slides: Success Stories for Data-to-Cloud
 
From Data to Services at the Speed of Business
From Data to Services at the Speed of BusinessFrom Data to Services at the Speed of Business
From Data to Services at the Speed of Business
 
Hadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data WarehouseHadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data Warehouse
 
Big Data Expo 2015 - Talend Delivering Real Time
Big Data Expo 2015 - Talend Delivering Real TimeBig Data Expo 2015 - Talend Delivering Real Time
Big Data Expo 2015 - Talend Delivering Real Time
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
 
How Hewlett Packard Enterprise Gets Real with IoT Analytics
How Hewlett Packard Enterprise Gets Real with IoT AnalyticsHow Hewlett Packard Enterprise Gets Real with IoT Analytics
How Hewlett Packard Enterprise Gets Real with IoT Analytics
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming Era
 
Top SAP Online training institute in Hyderabad
Top SAP Online training institute in HyderabadTop SAP Online training institute in Hyderabad
Top SAP Online training institute in Hyderabad
 
Yellowbrick MicroStrategy webcast
Yellowbrick MicroStrategy webcastYellowbrick MicroStrategy webcast
Yellowbrick MicroStrategy webcast
 
OpenWorld: 4 Real-world Cloud Migration Case Studies
OpenWorld: 4 Real-world Cloud Migration Case StudiesOpenWorld: 4 Real-world Cloud Migration Case Studies
OpenWorld: 4 Real-world Cloud Migration Case Studies
 

Recently uploaded

Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 

Recently uploaded (20)

Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 

Solving Performance Problems on Hadoop

  • 1. Solving performance problems on Hadoop Moving analytic workloads into production 1 Tyler Mitchell Sr. Software Engineer Actian Center of Excellence
  • 2. Topics How we got (stuck) here Performance best practises Sample business cases Benchmarking results 2
  • 3. Actian’s Lineage Ingres – 1970’s Versant – 1988 ParAccel – 2006 Pervasive – 1982 Vectorwise – 2003 3 Actian
  • 4. Actian at a Glance 4 10,000+ 8 Countries; 7 US Cities HQ Palo Alto 400+ Employees Customers 3 Businesses Banking, Insurance Telecom and Media Data Management Data Integration Big Data Analytics
  • 5. How We Got (Stuck) Here 5
  • 6. Accidental Hadoop Tourist – Brief History 6 DataBusiness Data Capture Data Management & Integration Analytics Query & Analyze Solutions Problem Solved
  • 7. Accidental Hadoop Tourist – Brief History 7 DataBusiness Data Capture Data Management & Integration Analytics Solutions ??????
  • 8. Accidental Hadoop Tourist – Brief History 8 DataBusiness Data Capture Data Management & Integration Analytics ??? Solutions ???
  • 9. Modern, best-in-class analytic database technology provides: 9 Measureable business impact: monetize Big Data to grow revenue, reduce cost, mitigate risk, enable new business The ability to make data driven business decisions using a massively scalable platform Decisive reduction in the cost of high performance analytics at scale Performance that can meet all SLAs Full leverage of existing SQL skills while deploying a modern analytic infrastructure Grow Revenue Reduce Cost Mitigate Risk Create New Business Business Solution Architecture Challenges
  • 10. Wide Ranges of Use Cases 10 Financial Services Advanced Credit Risk Analytics across billions of data points Internet Scale Application Predictive Analytics across hundreds of millions of customers Media Data Science and Discovery across trillions of IoT events Dept of Defense Cyber-Security: Network intrusion models every second Credit Card Processing Fraud detection every milli- second
  • 12. 3 Essential Big Data Concepts 12 0. Take nothing for granted 1. Partitioning vs Data skew 2. Data types matter 3. Maximize memory / minimize bottlenecks 4. Take nothing for granted
  • 13. 6 Game Changing Database Innovations 13
  • 14. 6 Game Changing Database Innovations 14 1. Use the CPU! – Vector Processing 2. Minimize bottlenecks – Exploiting Chip Cache 3. Got columnar? 4. Smarter compression 5. Smarter indexing 6. Multi-core matters
  • 16. Big Data Business Use Cases 16
  • 17. Customer 360: Understanding Experience, Driving Revenue 17 Telecom Challenge Vast and growing repository of proprietary click data, customer records, service call records, smart phone and device data GPS location, webserver, telephone, network usage. Queries took minutes or hours, and sometimes never returned at all. Critical business analysis on a consolidated customer 360 data lake was grinding to a halt. The ability to gain deeper market insights, visualization and desired data management and operational optimization was at risk
  • 18. Customer 360: Initial Architecture 18 Development System • 300+ node cluster • HIVE access • SQL based BI / Data Science • Pre-processed as performance was unacceptable • Views taking days to return snapshot views
  • 19. Customer 360: Technical Improvements 19 Production Prototype • 30 node cluster (10% of Hive) • Actian Vector on Hadoop solution • SQL based BI / Data Science • No materialized view building required • Join on demand faster than aggregate tables in Hive • Reduced storage requirements • 91TB – two years data, 1100 columns when joined
  • 20. Customer 360: Understanding Experience, Driving Revenue 20 Results Customer 360 across prior data silos Leveraged for customer retention strategies Predict and take proactive, tailored responses Enables next gen data-driven troubleshooting, impact analysis and root cause analysis • Accelerated operations intelligence • Improved customer experience • Reduced customer churn Impact
  • 21. Financial Risk: Upgrading Legacy to Meet SLA 21 Challenge Legacy single-purpose risk application took 3 hours to generate end-of-day risk report, and failed to meet changing SLA’s for reporting risk. In deciding to replace risk application, bank opted to build a multi-purpose risk application, addressing multiple business requirements
  • 22. Financial Risk: Upgrading Legacy to Meet SLA 22 Legacy System • Single server architecture, MS SSAS, Oracle - ~30 applications • Pre-processing of desired measures exploding data volumes • Cube and Analysis engines being maxed out as they exceed 1.5TB range • Unable to scale to the desired range of > 200GB/day new data • Impala attempt failed • Highly invested in apps built on Analysis service
  • 23. Financial Risk: Upgrading Legacy to Meet SLA 23 New Possibilities • Clustered solution – Hadoop 5 and 10 node • No pre-processing cubes, SSAS partly kept • Tested solutions 1TB -> 20TB at a time • Produced interactive queries across large datasets • Focused query results in 2s or less • Processing all data in the database 6s – 80s • 2x nodes ~ 200% speed improvement
  • 24. Financial Risk: Upgrading Legacy to Meet SLA 24 Results Increased data analyzed by 100X 2–200B rows / 1-20TB Report run in 28 seconds vs. 3 hours Use of application for: • Intra-day reporting (surveillance) • End of day reporting (compliance) • Overnight float investment options • Annual CCAR Analysis ActualGoal
  • 25. Delivering the Results With Better Engineering 25
  • 27. Technical Benchmarks – Single Machine 27
  • 28. Technical Benchmarks – Single Machine 28
  • 29. Technical Benchmarks: VectorH - SQL on Hadoop 29 TPC-H SF1000 * VectorH vs other platforms, faster by how much? Tuned platforms Identical hardware ** * Not an official TPC result ** 10 nodes, each 2 x Intel 3.0GHz E5-2690v2 CPUs, 256GB RAM, 24x600GB HDD, 10Gb Ethernet, Hadoop 2.6.0
  • 30. Actian VectorH Delivers More Efficient File Format 30 Better compression & functionality Vector advantages: • skip blocks via MinMax indexes • sophisticated query processing • efficient block format, esp. 64-bit int
  • 31. Summary Conscientious data handling & next gen engineering takes SQL in Hadoop to new levels. All Hadoop users can move from development into production while delivering compelling business results. 31
  • 32. Delivering the Results With Better Engineering 32 VectorH v5 – Spark integration, external table support, and more
  • 34. Thank you! tyler.mitchell@actian.com - @1tylermitchell Blogs at Actian.com - MakeDataUseful.com Visit us in booth 503 34

Editor's Notes

  1. 1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed. A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible. 2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache. 3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped. To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed. 4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution. We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache. 5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column. 6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
  2. 1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed. A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible. 2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache. 3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped. To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed. 4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution. We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache. 5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column. 6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
  3. 1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed. A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible. 2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache. 3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped. To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed. 4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution. We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache. 5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column. 6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
  4. 1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed. A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible. 2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache. 3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped. To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed. 4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution. We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache. 5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column. 6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
  5. 1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed. A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible. 2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache. 3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped. To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed. 4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution. We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache. 5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column. 6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.
  6. 1: We use vectorized processing to exploit modern CPU architecture. We execute one operation at a time on a vector of data, which allows for tight inner code loops without branching. This way, we can use SIMD instructions and, because of the lack of branching, make sure the CPU pipelines are not thrashed. A vector is typically 1024 rows of a single column, so it’s a manageable amount of data while the overhead per row is still negligible. 2: A vector will fit in the CPU cache together with the code for a particular operation, so all execution is in-cache. 3: To feed this engine with enough data, we’re also applying the vectorized paradigm to the storage subsystem. First of all, we’re using a column store, so only relevant columns are read from disk. Data is stored in blocks of typically 512MB and a single block contains only data from a single column (there are exceptions). Blocks of different columns can be interleaved per block, but typically more than one block of the same column is grouped. To keep the stable storage fast and defragmented, we use in-memory overlays to store updates to the data. These overlays are automatically flushed to stable storage when needed. 4: The blocks are stored compressed on-disk. We’ve got a number of lightweight compression algorithms and the most efficient one is chosen per block, depending on the data characteristics. The decompression takes place per vector and can be done in the CPU cache, which neatly ties in with our in-cache execution. We have a buffer manager that predicts what blocks are needed when and makes sure no blocks that will be used in the near future are evicted from the buffer cache. 5: We have min-max indexes on the disk blocks, so when data is not completely random we can narrow down the ranges of blocks we need to read from disk, per column. 6: Multi-core parallelism - the well-tuned query optimizer takes into account the query sequencing, data partitioning, and HDFS block locality to leverage the number of threads available to produce results in parallel, balancing the workload across system resources to improve throughput and response time All in all, the execution engine is able to do about 1.5GB/s per core, and high-end I/O subsystems are able to keep up with this.