Presentation that reviews architecture to enable Tableau users to query Hadoop directly from Tableau. We review how Jethro's index-access architecture is an ideal SQL-on-Hadoop solution for BI use and enables interactive, limitless BI directly from Hadoop or any other data source. Find out more at http://jethro.io
2. Topics
• BI on Big Data Trade-Off
• SQL-on-Hadoop Performance Challenges
• Live Demo: Tableau on Hadoop
Impala / Redshift / Jethro
• Jethro Technology Overview
3. • What Does Jethro Do?
– Acceleration server for BI on Big Data
• How It Works?
– Full Indexing and cube caching
– Combines Columnar SQL DB design
with search-indexing technology
• When to Use It?
– Reporting, dashboards, discovery, ad-hoc
• How to Get It
– Download & free evaluation
• Partnerships
– BI & Hadoop vendors
About Us
SQL
Data
4. • Typical usage based on extracting selective
data from remote data sources
• Extracted data then dynamically loaded into
memory for interactive analysis
• Challenges:
– Size: performance degradation typically
~250M rows
– Refresh lag time
BI & Big Data: Extract (In-Memory)
Tableau & Big Data
Data
Extract
5. • For every user interaction Tableau issues
SQL queries to the target DB
• DB retrieves requested data, processes
SQL aggregations and returns to Tableau
• Challenges:
– DB performance is significantly slower than
in-mem speed
BI & Big Data: Live-Connect (In-DB)
Tableau & Big Data
Queries
Live Access
6. SQL enables the change of data platform while keeping the analytic apps intact
Analytics: ETL, Predictive, Reporting, BI
10x-100x
Data
1/10 HW
$cost
Open
Platform
Big Data Platforms: Hadoop vs. EDW Appliances
SQL-on-Hadoop Performance Challenges
7. SQL-on-Hadoop
ETL Predictive Reporting
BI
Too SLOW on Hadoopx
It’s unrealistic to expect to the same performance when data is much larger and highly optimized
hardware is replaced with commodity boxes
The Hadoop Trade-Off: Scale & Cost vs. Performance
SQL-on-Hadoop Performance Challenges
8. More Hardware
– Add nodes, RAM, CPU, SSD, network
Different SQL-on-Hadoop engines
– Hive, Impala, Drill, SparkSQL, HAWQ,
Presto, Actian, etc.
Rigid Data Model
– Less granularity, more pre-aggregations
– Pre-defined OLAP Cubes
– De-normalize into single large table
– Multiple partition keys (replication)
Replicate from Hadoop to EDW
– Traditional: Teradata, Vertica, Netezza,
…
– Cloud: Redshift
– As-a-Svc: BigQuery, Snowflake, Qubole
No Hadoop, No EDW
– Search: Elastic + Kibana
– NoSQL: Hbase, Cassandra, MongoDB
BI & Data Combined
– Full-stack Hadoop: Platfora, Arcadia
– As-a-Svc: DOMO, QuikSight, PowerBI,
…
BI on Big Data: Technology Alternatives
9. A Library Analogy:
Billions of books, Thousands of racks
Query:
List books by author “Stephen King”
Process:
Every librarian pulls out book by book
from their rack and check for Author
• Hive
• Impala
• Presto
• SparkSQL
• Drill
• Pivotal/HAWQ
• IBM/Big SQL
• Actian
• …
SQL-on-Hadoop: MPP/Full-Scan Architecture
SQL-on-Hadoop Performance Challenges
Unsuitable for BI
10. Query:
List books by author “Stephen King”
Process:
Access Author index, entry of “Stephen
King”, get list of books, fetch only these
books
Result:
Fast, minimal resources, scalable
SQL-on-Hadoop: Index-Access Architecture
SQL-on-Hadoop Performance Challenges
Optimal for BI
11. Hardware Data Format Hadoop
Cluster
Compute
Cluster
Total RAM,
CPU
AWS $
per hr.
Jethro Jethro indexes 3x m1.xlarge 2x r3.4xlarge
(spot)
290GB, 44
cores
$0.75
Impala Parquet 6x r3.2xlarge
1x r3.xlarge
390GB, 52
cores
$4.25
Redshift Redshift 6x dc1.large 90GB, 12
cores
$1.50
• Point browser to: tableau.jethrodata.com
– Login: demo / demo
• Choose workbook: Jethro, Impala, Redshift
• Dashboard interaction: choose year,
category or any other filters to drill-down
• Data
– Based on TPC-DS benchmark
– 1TB raw data (400GB fact)
– Fact table: ~2.9B rows
– 7 Dimensions
LIVE Benchmark: Tableau on Hadoop (and Redshift)
Live Benchmark
12. Indexing Data for Jethro Acceleration
• Identify BI-worthy datasets
– Not all data in Hadoop should have Jethro
• Jethro “loader” creates an indexed version
– Stores back in same HDFS
• If no Hadoop is used it can also be stored in local
filesystem, network storage or cloud storage (e.g. S3)
– Highly efficient: ~1B rows/hour, 3x compression
• Incremental refresh
– As frequently as every min, hour, day, …
– Does not require a full-rebuild of index
Raw Indexed
14. Jethro Indexes – Superior Technology
http://www.google.com/patents/WO2013001535A3?cl=enPatent Pending:
Complete
– Every column is indexed
Simple
– Inverted-list indexes map each
column value to a list of rows
Fast to read
– Index-of-index provides direct
access to a value entry
– No need to scan entire index,
or load index to memory
Scalable
– Distributed, highly hierarchical
compressed bitmaps
Fast to write
– Appendable index structure for
fast incremental refresh
15. Automated Cube & Query Cashing
• Every query is cached
– Based on result-set size vs. execution time
• Cubes generated automatically
– Identify repeat query patterns
– For example: adding the filter as a col to a
GROUP BY
• All stored in HDFS
– 10,000’s of cashed cubes and queries
• Incremental refresh
– Query executes ONLY on the incremental data
and then merges with cached results
16. What Is Jethro for Tableau?
An indexing & cashing server
1. Tableau uses live connect
(ODBC) to send SQL queries
2. Jethro checks if query can
be served from existing
cubes
– Yes: reply to Tableau
3. Jethro uses indexed table to
access only necessary data
– Auto create a cube based on
this and similar queries
Live
Connect
HDFS
BI Tools
17. Why Jethro is the Right Technology for BI on Big
Data?
Limitless BI on Big Data: Supporting the full-range of BI use-cases.
Jethro’s technology is a unique and optimal fit.
1. Full indexing enables interactive discovery and fast drill down
– Eliminates need to repeatedly read unnecessary data. The deeper you go the
faster it gets!
2. Auto cubes & cache enables interactive dashboards and fast reports
– Optimize repeat query performance
3. Incremental-refresh enables LIVE BI over streaming data
– Reduces maintenance and cuts lag time
18. Ready to Try Jethro?
1. Register: jethro.io
– Download and Install on-prem or cloud
2. Schedule a 30min POC review with Jethro SA (free!)
3. Index BI-worthy datasets
4. Use Tableau
5. Train Jethro with BI apps
– Continuous performance improvement
That’s It!