SlideShare a Scribd company logo
1 of 50
Introduction to Apache Tajo:
Future of Data Warehouse
Jihoon Son / Gruter Inc.
I am
● Jihoon Son (@jihoonson)
○ Ph.D at Korea Univ.
○ Tajo project co-founder
○ Committer and PMC member of Apache Tajo
○ Research engineer at Gruter
○ Linkedin
■ https://www.linkedin.com/in/jihoonson
2
Today's Topic: Tajo
● What is Tajo?
○ Tajo / tάːzo / 타조
○ Ostrich in Korean
■ Fastest two-legged animal in
the world
3
Today's Topic: Tajo
● What is Apache Tajo?
○ Our Ostrich can do SQL
processing on big data!
■ SQL-on-Hadoop system
■ Apache Top-level project
4
Maybe You Think ...
5
SQL-on-Hadoop?
Boring..
This Ostrich is Different!
6
SQL-on-Hadoop Systems
7
SQL-on-Hadoop Systems
8
SQL-on-Hadoop Systems
9
Long-running
ETL jobs
Low-latency
interactive analysis
SQL-on-Hadoop Systems
10
● Requirements
○ Stable query execution
■ Fault-tolerance
● Can avoid query
resubmission
○ Adaptation to dynamic
environment
■ Available resources,
unpredictable delays, ...
Long-running
ETL jobs
SQL-on-Hadoop Systems
11
● Requirements
○ Fast query execution
■ Several query execution
techniques
■ In-memory processing Low-latency
interactive analysis
Tajo is designed for Both Workloads
12
Long-running
ETL jobs
Low-latency
interactive analysis
Who are using Tajo?
13
Use Cases: SK Telecom
● Data warehousing & analysis
○ 1st
telco in South Korea
■ 40 TB/day compressed data (2014)
14
ETLETLETL
Integration
Layer
Data
Warehouse
Operational
Systems
SK Telecom: Before Tajo
15
Marketing
Sales
ERP
SCM
ODS
Staging
Area
Data
Vault
Data Marts
Strategic
Marts
Hadoop MPP DBMS
ETLETLETL
Integration
Layer
Data
Warehouse
Operational
Systems
SK Telecom: After Tajo
16
Marketing
Sales
ERP
SCM
ODS
Staging
Area
Data
Vault
Data Marts
Strategic
Marts
ETLETLETL
Integration
Layer
Data
Warehouse
Operational
Systems
SK Telecom: After Tajo
17
Marketing
Sales
ERP
SCM
ODS
Staging
Area
Data
Vault
Data Marts
Strategic
Marts
● Long-running ETL jobs
● Ad-hoc analysis
Use Cases: SK Telecom
● Significantly reduced ETL & analysis time
○ Daily analysis becomes possible
○ More exploratory analysis is newly available
with remaining resources
18
Use Cases: Bluehole Studio
● Game log analysis
○ Finding principal
causes of service-
quality deficiencies
19
Use Cases: Bluehole Studio
● Tajo on EMR
20
Use Cases: Bluehole Studio
● Their first log analysis system
○ Easy and rapid deployment of Tajo
○ Low learning curve with SQL standard
● Immediate action becomes possible for
user complaints and hidden bugs
21
Use Cases: Melon
● Data discovery
○ Music streaming service (26 million users)
○ Analysis of purchase history for target
marketing
● Significantly reduced analysis time
○ Faster analysis by replacing Hive with Tajo
○ More analysis becomes possible
22
So, Why should you use Tajo?
23
So, Why should you use Tajo?
● Easy to use
24
So, Why should you use Tajo?
● Easy to use
○ ANSI-SQL standard compliance (2003)
■ CTAS, Window functions, ...
25
So, Why should you use Tajo?
● Easy to use
○ ANSI-SQL standard compliance (2003)
■ CTAS, Window functions, ...
○ Mature SQL features
■ Most existing queries can be executed without
modification
26
So, Why should you use Tajo?
● Easy to use
○ ANSI-SQL standard compliance (2003)
■ CTAS, Window functions, ...
○ Mature SQL features
■ Most existing queries can be executed without
modification
○ Various data format support
■ Text, JSON, Orc, Parquet, …
27
So, Why should you use Tajo?
● Optimized performance
28
So, Why should you use Tajo?
● Optimized performance
○ Optimized code
■ Optimized I/O performance
● Nearly max I/O performance (~120MB/s) per disk
■ Off-heap data processing
● Mitigating GC overhead
29
So, Why should you use Tajo?
● Optimized performance
○ Cost-based query plan optimization
■ Join ordering
■ Best algorithm selection
● According to input size
■ Progressive optimization
● Further optimize the query plan during query execution
● Especially excellent for long running queries
■ => Efficient start schema processing
30
So, Why should you use Tajo?
● Various storage type support
31
So, Why should you use Tajo?
● Various storage type support
32
Logical Data Warehouse with Tajo
33
Global view
Application DBMS NoSQL
Cloud
storage
On-premise
storage
Logical Data Warehouse with Tajo
34
Global view
Application DBMS NoSQL
Cloud
storage
On-premise
storage
● Fast delivery
● Easy maintenance
● Simple data flow
How fast is Tajo?
35
Evaluation on Cloud Environment
● Google Cloud Platform
○ Instance type: n1-standard-8
■ 8 core, 30GB RAM
36
Target Systems
● Hive (0.12)
○ Baseline performance
○ Default configuration provided by GCP
■ Use the whole cpu and memory
● Tajo (0.11.0)
○ Default configuration provided by GCP
■ Use the whole cpu and memory
37
Target Systems
● Spark-SQL (1.5.0)
○ Default configuration provided by GCP
■ Use the whole cpu and memory
■ Tungsten enabled by default
○ spark.sql.shuffle.partitions is
adjusted for better performance
38
TPC-DS
● Data
○ 24 tables
■ Plain text format
■ Stored on Google Cloud Storage
● Query
○ Which can be executed on every system
without modifications
■ For Hive, 0.12 doesn't support implicit join, so
every query had to be changed
39
SF 1000, 50 instances
40
SF 1000, 50 instances
41
SF 1000, 50 instances
42
Cannot be run
on 1TB
SF 10000, 50 instances
43
SF 10000, 50 instances
44
Demo
45
Simple Demo on EMR
46
● Using TPC-H data set, but
○ Lineitem table is stored on HDFS
○ Orders table is stored on PostgreSQL
○ Other tables are stored on S3
Apache Tajo
● Is excellent for both long-running ETL jobs
and exploratory ad-hoc analysis
● Is very fast
● Supports query federation on diverse data
sources
47
Get Involved!
● We are recruiting contributors!
● General
○ http://tajo.apache.org/
● Getting Started
○ http://tajo.apache.org/docs/current/getting_started.html
● Downloads
○ http://tajo.apache.org/downloads.html
● Issue tracker
○ http://issues.apache.org/jira/browse/TAJO
● Join the mailing list
○ dev-subscribe@tajo.apache.org
○ issues-subscribe@tajo.apache.org
48
Useful Links
49
● EMR bootstrap
○ https://github.com/awslabs/emr-bootstrap-
actions/tree/master/tajo
● How to setup Tajo on EMR
○ http://www.gruter.com/blog/setting-up-a-
tajo-cluster-on-amazon-emr/
Q & A
50

More Related Content

What's hot

Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseJimmy Angelakos
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter
 
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...Insight Technology, Inc.
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexesDaniel Lemire
 
Tajo case study bay area hug 20131105
Tajo case study bay area hug 20131105Tajo case study bay area hug 20131105
Tajo case study bay area hug 20131105Gruter
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slidesDat Tran
 
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Zekeriya Besiroglu
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseHakan Ilter
 
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...DataStax
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using PigDavid Wellman
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallSpark Summit
 
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Lviv Startup Club
 
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016DataStax
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Nathan Bijnens
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAmir Sedighi
 
Webinar: Performance Tuning + Optimization
Webinar: Performance Tuning + OptimizationWebinar: Performance Tuning + Optimization
Webinar: Performance Tuning + OptimizationMongoDB
 
Realtime Analytics with Druid
Realtime Analytics with DruidRealtime Analytics with Druid
Realtime Analytics with DruidSeungWoo Han
 
Heuritech: Apache Spark REX
Heuritech: Apache Spark REXHeuritech: Apache Spark REX
Heuritech: Apache Spark REXdidmarin
 

What's hot (20)

Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
 
Tajo case study bay area hug 20131105
Tajo case study bay area hug 20131105Tajo case study bay area hug 20131105
Tajo case study bay area hug 20131105
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
 
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
Vitalii Bondarenko - Масштабована бізнес-аналітика у Cloud Big Data Cluster. ...
 
Treasure Data Cloud Strategy
Treasure Data Cloud StrategyTreasure Data Cloud Strategy
Treasure Data Cloud Strategy
 
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoop
 
Webinar: Performance Tuning + Optimization
Webinar: Performance Tuning + OptimizationWebinar: Performance Tuning + Optimization
Webinar: Performance Tuning + Optimization
 
Realtime Analytics with Druid
Realtime Analytics with DruidRealtime Analytics with Druid
Realtime Analytics with Druid
 
Heuritech: Apache Spark REX
Heuritech: Apache Spark REXHeuritech: Apache Spark REX
Heuritech: Apache Spark REX
 

Viewers also liked

MelOn 빅데이터 플랫폼과 Tajo 이야기
MelOn 빅데이터 플랫폼과 Tajo 이야기MelOn 빅데이터 플랫폼과 Tajo 이야기
MelOn 빅데이터 플랫폼과 Tajo 이야기Gruter
 
Expanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoExpanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoGruter
 
응답하라 청춘
응답하라 청춘 응답하라 청춘
응답하라 청춘 hhj7934
 
Boomer 마케팅2팀 네이버&멜론 제안서
Boomer 마케팅2팀 네이버&멜론 제안서Boomer 마케팅2팀 네이버&멜론 제안서
Boomer 마케팅2팀 네이버&멜론 제안서jackboom
 
Elastic Search Performance Optimization - Deview 2014
Elastic Search Performance Optimization - Deview 2014Elastic Search Performance Optimization - Deview 2014
Elastic Search Performance Optimization - Deview 2014Gruter
 
Introduction to Apache Tajo
Introduction to Apache TajoIntroduction to Apache Tajo
Introduction to Apache TajoGruter
 
내가 좋아하는 인터넷_서비스
내가 좋아하는 인터넷_서비스내가 좋아하는 인터넷_서비스
내가 좋아하는 인터넷_서비스oinging17
 
Data-Driven Government: Explore the Four Pillars of Value
Data-Driven Government: Explore the Four Pillars of ValueData-Driven Government: Explore the Four Pillars of Value
Data-Driven Government: Explore the Four Pillars of ValueThomas Robbins
 
홍익경영혁신2015 a831219 이규희
홍익경영혁신2015 a831219 이규희홍익경영혁신2015 a831219 이규희
홍익경영혁신2015 a831219 이규희gyuhee
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetupiwrigley
 
Webinar -Data Warehouse Augmentation: Cut Costs, Increase Power
Webinar -Data Warehouse Augmentation: Cut Costs, Increase PowerWebinar -Data Warehouse Augmentation: Cut Costs, Increase Power
Webinar -Data Warehouse Augmentation: Cut Costs, Increase PowerZaloni
 
India, Internet of things and the role of government
India, Internet of things and the role of governmentIndia, Internet of things and the role of government
India, Internet of things and the role of governmentSyam Madanapalli
 
디지털 음악시장 현황보고
디지털 음악시장 현황보고디지털 음악시장 현황보고
디지털 음악시장 현황보고SeungBeom Ha
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
What is data-driven government for public safety?
What is data-driven government for public safety?What is data-driven government for public safety?
What is data-driven government for public safety?IBM Analytics
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
IoT - Make in India: Challenges and Opportunities
IoT - Make in India: Challenges and OpportunitiesIoT - Make in India: Challenges and Opportunities
IoT - Make in India: Challenges and OpportunitiesSyam Madanapalli
 
중간발표
중간발표중간발표
중간발표rururuadf
 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldCloudera, Inc.
 

Viewers also liked (20)

MelOn 빅데이터 플랫폼과 Tajo 이야기
MelOn 빅데이터 플랫폼과 Tajo 이야기MelOn 빅데이터 플랫폼과 Tajo 이야기
MelOn 빅데이터 플랫폼과 Tajo 이야기
 
Expanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoExpanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with Tajo
 
응답하라 청춘
응답하라 청춘 응답하라 청춘
응답하라 청춘
 
최인철
최인철최인철
최인철
 
Boomer 마케팅2팀 네이버&멜론 제안서
Boomer 마케팅2팀 네이버&멜론 제안서Boomer 마케팅2팀 네이버&멜론 제안서
Boomer 마케팅2팀 네이버&멜론 제안서
 
Elastic Search Performance Optimization - Deview 2014
Elastic Search Performance Optimization - Deview 2014Elastic Search Performance Optimization - Deview 2014
Elastic Search Performance Optimization - Deview 2014
 
Introduction to Apache Tajo
Introduction to Apache TajoIntroduction to Apache Tajo
Introduction to Apache Tajo
 
내가 좋아하는 인터넷_서비스
내가 좋아하는 인터넷_서비스내가 좋아하는 인터넷_서비스
내가 좋아하는 인터넷_서비스
 
Data-Driven Government: Explore the Four Pillars of Value
Data-Driven Government: Explore the Four Pillars of ValueData-Driven Government: Explore the Four Pillars of Value
Data-Driven Government: Explore the Four Pillars of Value
 
홍익경영혁신2015 a831219 이규희
홍익경영혁신2015 a831219 이규희홍익경영혁신2015 a831219 이규희
홍익경영혁신2015 a831219 이규희
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
 
Webinar -Data Warehouse Augmentation: Cut Costs, Increase Power
Webinar -Data Warehouse Augmentation: Cut Costs, Increase PowerWebinar -Data Warehouse Augmentation: Cut Costs, Increase Power
Webinar -Data Warehouse Augmentation: Cut Costs, Increase Power
 
India, Internet of things and the role of government
India, Internet of things and the role of governmentIndia, Internet of things and the role of government
India, Internet of things and the role of government
 
디지털 음악시장 현황보고
디지털 음악시장 현황보고디지털 음악시장 현황보고
디지털 음악시장 현황보고
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
What is data-driven government for public safety?
What is data-driven government for public safety?What is data-driven government for public safety?
What is data-driven government for public safety?
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
IoT - Make in India: Challenges and Opportunities
IoT - Make in India: Challenges and OpportunitiesIoT - Make in India: Challenges and Opportunities
IoT - Make in India: Challenges and Opportunities
 
중간발표
중간발표중간발표
중간발표
 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
 

Similar to Introduction to Apache Tajo: Future of Data Warehouse

Procella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at YoutubeProcella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at YoutubeDataWorks Summit
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series databasefelixbarny
 
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Jonathan Singer
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in productionPingCAP
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloudOVHcloud
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleItai Yaffe
 
Presto Bangalore Meetup1 Repertoire@Myntra
Presto Bangalore Meetup1 Repertoire@MyntraPresto Bangalore Meetup1 Repertoire@Myntra
Presto Bangalore Meetup1 Repertoire@MyntraShubham Tagra
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodbPGConf APAC
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Scalable, good, cheap
Scalable, good, cheapScalable, good, cheap
Scalable, good, cheapMarc Cluet
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
High performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbHigh performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbWei Shan Ang
 
20140120 presto meetup_en
20140120 presto meetup_en20140120 presto meetup_en
20140120 presto meetup_enOgibayashi
 

Similar to Introduction to Apache Tajo: Future of Data Warehouse (20)

Procella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at YoutubeProcella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at Youtube
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
 
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 
Presto Bangalore Meetup1 Repertoire@Myntra
Presto Bangalore Meetup1 Repertoire@MyntraPresto Bangalore Meetup1 Repertoire@Myntra
Presto Bangalore Meetup1 Repertoire@Myntra
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Scalable, good, cheap
Scalable, good, cheapScalable, good, cheap
Scalable, good, cheap
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
High performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbHigh performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodb
 
20140120 presto meetup_en
20140120 presto meetup_en20140120 presto meetup_en
20140120 presto meetup_en
 

More from Gruter

스타트업사례로 본 로그 데이터분석 : Tajo on AWS
스타트업사례로 본 로그 데이터분석 : Tajo on AWS스타트업사례로 본 로그 데이터분석 : Tajo on AWS
스타트업사례로 본 로그 데이터분석 : Tajo on AWSGruter
 
What's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondWhat's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondGruter
 
Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)Gruter
 
Efficient In­‐situ Processing of Various Storage Types on Apache Tajo
Efficient In­‐situ Processing of Various Storage Types on Apache TajoEfficient In­‐situ Processing of Various Storage Types on Apache Tajo
Efficient In­‐situ Processing of Various Storage Types on Apache TajoGruter
 
Tajo TPC-H Benchmark Test on AWS
Tajo TPC-H Benchmark Test on AWSTajo TPC-H Benchmark Test on AWS
Tajo TPC-H Benchmark Test on AWSGruter
 
Data analysis with Tajo
Data analysis with TajoData analysis with Tajo
Data analysis with TajoGruter
 
Gruter TECHDAY 2014 MelOn BigData
Gruter TECHDAY 2014 MelOn BigDataGruter TECHDAY 2014 MelOn BigData
Gruter TECHDAY 2014 MelOn BigDataGruter
 
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter
 
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)Gruter
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Gruter
 
Hadoop security DeView 2014
Hadoop security DeView 2014Hadoop security DeView 2014
Hadoop security DeView 2014Gruter
 
Vectorized processing in_a_nutshell_DeView2014
Vectorized processing in_a_nutshell_DeView2014Vectorized processing in_a_nutshell_DeView2014
Vectorized processing in_a_nutshell_DeView2014Gruter
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopBig Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopGruter
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Gruter
 
Cloumon sw제품설명회 발표자료
Cloumon sw제품설명회 발표자료Cloumon sw제품설명회 발표자료
Cloumon sw제품설명회 발표자료Gruter
 
Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013Gruter
 
SQL-on-Hadoop with Apache Tajo, and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo,  and application case of SK TelecomSQL-on-Hadoop with Apache Tajo,  and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo, and application case of SK TelecomGruter
 
DeView2013 Big Data Platform Architecture with Hadoop - Hyeong-jun Kim
DeView2013 Big Data Platform Architecture with Hadoop - Hyeong-jun KimDeView2013 Big Data Platform Architecture with Hadoop - Hyeong-jun Kim
DeView2013 Big Data Platform Architecture with Hadoop - Hyeong-jun KimGruter
 
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: Tajo와 SQL-on-Hadoop
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: Tajo와 SQL-on-HadoopGRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: Tajo와 SQL-on-Hadoop
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: Tajo와 SQL-on-HadoopGruter
 
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: 온라인 컨텐츠 서비스를 위한 빅데이터 구축 사례
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: 온라인 컨텐츠 서비스를 위한 빅데이터 구축 사례GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: 온라인 컨텐츠 서비스를 위한 빅데이터 구축 사례
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: 온라인 컨텐츠 서비스를 위한 빅데이터 구축 사례Gruter
 

More from Gruter (20)

스타트업사례로 본 로그 데이터분석 : Tajo on AWS
스타트업사례로 본 로그 데이터분석 : Tajo on AWS스타트업사례로 본 로그 데이터분석 : Tajo on AWS
스타트업사례로 본 로그 데이터분석 : Tajo on AWS
 
What's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondWhat's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its Beyond
 
Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)
 
Efficient In­‐situ Processing of Various Storage Types on Apache Tajo
Efficient In­‐situ Processing of Various Storage Types on Apache TajoEfficient In­‐situ Processing of Various Storage Types on Apache Tajo
Efficient In­‐situ Processing of Various Storage Types on Apache Tajo
 
Tajo TPC-H Benchmark Test on AWS
Tajo TPC-H Benchmark Test on AWSTajo TPC-H Benchmark Test on AWS
Tajo TPC-H Benchmark Test on AWS
 
Data analysis with Tajo
Data analysis with TajoData analysis with Tajo
Data analysis with Tajo
 
Gruter TECHDAY 2014 MelOn BigData
Gruter TECHDAY 2014 MelOn BigDataGruter TECHDAY 2014 MelOn BigData
Gruter TECHDAY 2014 MelOn BigData
 
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
 
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
Gruter_TECHDAY_2014_01_SearchEngine (in Korean)
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014
 
Hadoop security DeView 2014
Hadoop security DeView 2014Hadoop security DeView 2014
Hadoop security DeView 2014
 
Vectorized processing in_a_nutshell_DeView2014
Vectorized processing in_a_nutshell_DeView2014Vectorized processing in_a_nutshell_DeView2014
Vectorized processing in_a_nutshell_DeView2014
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopBig Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
 
Cloumon sw제품설명회 발표자료
Cloumon sw제품설명회 발표자료Cloumon sw제품설명회 발표자료
Cloumon sw제품설명회 발표자료
 
Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013Tajo and SQL-on-Hadoop in Tech Planet 2013
Tajo and SQL-on-Hadoop in Tech Planet 2013
 
SQL-on-Hadoop with Apache Tajo, and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo,  and application case of SK TelecomSQL-on-Hadoop with Apache Tajo,  and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo, and application case of SK Telecom
 
DeView2013 Big Data Platform Architecture with Hadoop - Hyeong-jun Kim
DeView2013 Big Data Platform Architecture with Hadoop - Hyeong-jun KimDeView2013 Big Data Platform Architecture with Hadoop - Hyeong-jun Kim
DeView2013 Big Data Platform Architecture with Hadoop - Hyeong-jun Kim
 
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: Tajo와 SQL-on-Hadoop
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: Tajo와 SQL-on-HadoopGRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: Tajo와 SQL-on-Hadoop
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: Tajo와 SQL-on-Hadoop
 
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: 온라인 컨텐츠 서비스를 위한 빅데이터 구축 사례
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: 온라인 컨텐츠 서비스를 위한 빅데이터 구축 사례GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: 온라인 컨텐츠 서비스를 위한 빅데이터 구축 사례
GRUTER가 들려주는 Big Data Platform 구축 전략과 적용 사례: 온라인 컨텐츠 서비스를 위한 빅데이터 구축 사례
 

Recently uploaded

What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 

Recently uploaded (20)

What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 

Introduction to Apache Tajo: Future of Data Warehouse

  • 1. Introduction to Apache Tajo: Future of Data Warehouse Jihoon Son / Gruter Inc.
  • 2. I am ● Jihoon Son (@jihoonson) ○ Ph.D at Korea Univ. ○ Tajo project co-founder ○ Committer and PMC member of Apache Tajo ○ Research engineer at Gruter ○ Linkedin ■ https://www.linkedin.com/in/jihoonson 2
  • 3. Today's Topic: Tajo ● What is Tajo? ○ Tajo / tάːzo / 타조 ○ Ostrich in Korean ■ Fastest two-legged animal in the world 3
  • 4. Today's Topic: Tajo ● What is Apache Tajo? ○ Our Ostrich can do SQL processing on big data! ■ SQL-on-Hadoop system ■ Apache Top-level project 4
  • 5. Maybe You Think ... 5 SQL-on-Hadoop? Boring..
  • 6. This Ostrich is Different! 6
  • 10. SQL-on-Hadoop Systems 10 ● Requirements ○ Stable query execution ■ Fault-tolerance ● Can avoid query resubmission ○ Adaptation to dynamic environment ■ Available resources, unpredictable delays, ... Long-running ETL jobs
  • 11. SQL-on-Hadoop Systems 11 ● Requirements ○ Fast query execution ■ Several query execution techniques ■ In-memory processing Low-latency interactive analysis
  • 12. Tajo is designed for Both Workloads 12 Long-running ETL jobs Low-latency interactive analysis
  • 13. Who are using Tajo? 13
  • 14. Use Cases: SK Telecom ● Data warehousing & analysis ○ 1st telco in South Korea ■ 40 TB/day compressed data (2014) 14
  • 15. ETLETLETL Integration Layer Data Warehouse Operational Systems SK Telecom: Before Tajo 15 Marketing Sales ERP SCM ODS Staging Area Data Vault Data Marts Strategic Marts Hadoop MPP DBMS
  • 16. ETLETLETL Integration Layer Data Warehouse Operational Systems SK Telecom: After Tajo 16 Marketing Sales ERP SCM ODS Staging Area Data Vault Data Marts Strategic Marts
  • 17. ETLETLETL Integration Layer Data Warehouse Operational Systems SK Telecom: After Tajo 17 Marketing Sales ERP SCM ODS Staging Area Data Vault Data Marts Strategic Marts ● Long-running ETL jobs ● Ad-hoc analysis
  • 18. Use Cases: SK Telecom ● Significantly reduced ETL & analysis time ○ Daily analysis becomes possible ○ More exploratory analysis is newly available with remaining resources 18
  • 19. Use Cases: Bluehole Studio ● Game log analysis ○ Finding principal causes of service- quality deficiencies 19
  • 20. Use Cases: Bluehole Studio ● Tajo on EMR 20
  • 21. Use Cases: Bluehole Studio ● Their first log analysis system ○ Easy and rapid deployment of Tajo ○ Low learning curve with SQL standard ● Immediate action becomes possible for user complaints and hidden bugs 21
  • 22. Use Cases: Melon ● Data discovery ○ Music streaming service (26 million users) ○ Analysis of purchase history for target marketing ● Significantly reduced analysis time ○ Faster analysis by replacing Hive with Tajo ○ More analysis becomes possible 22
  • 23. So, Why should you use Tajo? 23
  • 24. So, Why should you use Tajo? ● Easy to use 24
  • 25. So, Why should you use Tajo? ● Easy to use ○ ANSI-SQL standard compliance (2003) ■ CTAS, Window functions, ... 25
  • 26. So, Why should you use Tajo? ● Easy to use ○ ANSI-SQL standard compliance (2003) ■ CTAS, Window functions, ... ○ Mature SQL features ■ Most existing queries can be executed without modification 26
  • 27. So, Why should you use Tajo? ● Easy to use ○ ANSI-SQL standard compliance (2003) ■ CTAS, Window functions, ... ○ Mature SQL features ■ Most existing queries can be executed without modification ○ Various data format support ■ Text, JSON, Orc, Parquet, … 27
  • 28. So, Why should you use Tajo? ● Optimized performance 28
  • 29. So, Why should you use Tajo? ● Optimized performance ○ Optimized code ■ Optimized I/O performance ● Nearly max I/O performance (~120MB/s) per disk ■ Off-heap data processing ● Mitigating GC overhead 29
  • 30. So, Why should you use Tajo? ● Optimized performance ○ Cost-based query plan optimization ■ Join ordering ■ Best algorithm selection ● According to input size ■ Progressive optimization ● Further optimize the query plan during query execution ● Especially excellent for long running queries ■ => Efficient start schema processing 30
  • 31. So, Why should you use Tajo? ● Various storage type support 31
  • 32. So, Why should you use Tajo? ● Various storage type support 32
  • 33. Logical Data Warehouse with Tajo 33 Global view Application DBMS NoSQL Cloud storage On-premise storage
  • 34. Logical Data Warehouse with Tajo 34 Global view Application DBMS NoSQL Cloud storage On-premise storage ● Fast delivery ● Easy maintenance ● Simple data flow
  • 35. How fast is Tajo? 35
  • 36. Evaluation on Cloud Environment ● Google Cloud Platform ○ Instance type: n1-standard-8 ■ 8 core, 30GB RAM 36
  • 37. Target Systems ● Hive (0.12) ○ Baseline performance ○ Default configuration provided by GCP ■ Use the whole cpu and memory ● Tajo (0.11.0) ○ Default configuration provided by GCP ■ Use the whole cpu and memory 37
  • 38. Target Systems ● Spark-SQL (1.5.0) ○ Default configuration provided by GCP ■ Use the whole cpu and memory ■ Tungsten enabled by default ○ spark.sql.shuffle.partitions is adjusted for better performance 38
  • 39. TPC-DS ● Data ○ 24 tables ■ Plain text format ■ Stored on Google Cloud Storage ● Query ○ Which can be executed on every system without modifications ■ For Hive, 0.12 doesn't support implicit join, so every query had to be changed 39
  • 40. SF 1000, 50 instances 40
  • 41. SF 1000, 50 instances 41
  • 42. SF 1000, 50 instances 42 Cannot be run on 1TB
  • 43. SF 10000, 50 instances 43
  • 44. SF 10000, 50 instances 44
  • 46. Simple Demo on EMR 46 ● Using TPC-H data set, but ○ Lineitem table is stored on HDFS ○ Orders table is stored on PostgreSQL ○ Other tables are stored on S3
  • 47. Apache Tajo ● Is excellent for both long-running ETL jobs and exploratory ad-hoc analysis ● Is very fast ● Supports query federation on diverse data sources 47
  • 48. Get Involved! ● We are recruiting contributors! ● General ○ http://tajo.apache.org/ ● Getting Started ○ http://tajo.apache.org/docs/current/getting_started.html ● Downloads ○ http://tajo.apache.org/downloads.html ● Issue tracker ○ http://issues.apache.org/jira/browse/TAJO ● Join the mailing list ○ dev-subscribe@tajo.apache.org ○ issues-subscribe@tajo.apache.org 48
  • 49. Useful Links 49 ● EMR bootstrap ○ https://github.com/awslabs/emr-bootstrap- actions/tree/master/tajo ● How to setup Tajo on EMR ○ http://www.gruter.com/blog/setting-up-a- tajo-cluster-on-amazon-emr/