2. I am
● Jihoon Son (@jihoonson)
○ Ph.D at Korea Univ.
○ Tajo project co-founder
○ Committer and PMC member of Apache Tajo
○ Research engineer at Gruter
○ Linkedin
■ https://www.linkedin.com/in/jihoonson
2
3. Today's Topic: Tajo
● What is Tajo?
○ Tajo / tάːzo / 타조
○ Ostrich in Korean
■ Fastest two-legged animal in
the world
3
4. Today's Topic: Tajo
● What is Apache Tajo?
○ Our Ostrich can do SQL
processing on big data!
■ SQL-on-Hadoop system
■ Apache Top-level project
4
18. Use Cases: SK Telecom
● Significantly reduced ETL & analysis time
○ Daily analysis becomes possible
○ More exploratory analysis is newly available
with remaining resources
18
19. Use Cases: Bluehole Studio
● Game log analysis
○ Finding principal
causes of service-
quality deficiencies
19
21. Use Cases: Bluehole Studio
● Their first log analysis system
○ Easy and rapid deployment of Tajo
○ Low learning curve with SQL standard
● Immediate action becomes possible for
user complaints and hidden bugs
21
22. Use Cases: Melon
● Data discovery
○ Music streaming service (26 million users)
○ Analysis of purchase history for target
marketing
● Significantly reduced analysis time
○ Faster analysis by replacing Hive with Tajo
○ More analysis becomes possible
22
25. So, Why should you use Tajo?
● Easy to use
○ ANSI-SQL standard compliance (2003)
■ CTAS, Window functions, ...
25
26. So, Why should you use Tajo?
● Easy to use
○ ANSI-SQL standard compliance (2003)
■ CTAS, Window functions, ...
○ Mature SQL features
■ Most existing queries can be executed without
modification
26
27. So, Why should you use Tajo?
● Easy to use
○ ANSI-SQL standard compliance (2003)
■ CTAS, Window functions, ...
○ Mature SQL features
■ Most existing queries can be executed without
modification
○ Various data format support
■ Text, JSON, Orc, Parquet, …
27
29. So, Why should you use Tajo?
● Optimized performance
○ Optimized code
■ Optimized I/O performance
● Nearly max I/O performance (~120MB/s) per disk
■ Off-heap data processing
● Mitigating GC overhead
29
30. So, Why should you use Tajo?
● Optimized performance
○ Cost-based query plan optimization
■ Join ordering
■ Best algorithm selection
● According to input size
■ Progressive optimization
● Further optimize the query plan during query execution
● Especially excellent for long running queries
■ => Efficient start schema processing
30
31. So, Why should you use Tajo?
● Various storage type support
31
32. So, Why should you use Tajo?
● Various storage type support
32
33. Logical Data Warehouse with Tajo
33
Global view
Application DBMS NoSQL
Cloud
storage
On-premise
storage
34. Logical Data Warehouse with Tajo
34
Global view
Application DBMS NoSQL
Cloud
storage
On-premise
storage
● Fast delivery
● Easy maintenance
● Simple data flow
36. Evaluation on Cloud Environment
● Google Cloud Platform
○ Instance type: n1-standard-8
■ 8 core, 30GB RAM
36
37. Target Systems
● Hive (0.12)
○ Baseline performance
○ Default configuration provided by GCP
■ Use the whole cpu and memory
● Tajo (0.11.0)
○ Default configuration provided by GCP
■ Use the whole cpu and memory
37
38. Target Systems
● Spark-SQL (1.5.0)
○ Default configuration provided by GCP
■ Use the whole cpu and memory
■ Tungsten enabled by default
○ spark.sql.shuffle.partitions is
adjusted for better performance
38
39. TPC-DS
● Data
○ 24 tables
■ Plain text format
■ Stored on Google Cloud Storage
● Query
○ Which can be executed on every system
without modifications
■ For Hive, 0.12 doesn't support implicit join, so
every query had to be changed
39
46. Simple Demo on EMR
46
● Using TPC-H data set, but
○ Lineitem table is stored on HDFS
○ Orders table is stored on PostgreSQL
○ Other tables are stored on S3
47. Apache Tajo
● Is excellent for both long-running ETL jobs
and exploratory ad-hoc analysis
● Is very fast
● Supports query federation on diverse data
sources
47
48. Get Involved!
● We are recruiting contributors!
● General
○ http://tajo.apache.org/
● Getting Started
○ http://tajo.apache.org/docs/current/getting_started.html
● Downloads
○ http://tajo.apache.org/downloads.html
● Issue tracker
○ http://issues.apache.org/jira/browse/TAJO
● Join the mailing list
○ dev-subscribe@tajo.apache.org
○ issues-subscribe@tajo.apache.org
48
49. Useful Links
49
● EMR bootstrap
○ https://github.com/awslabs/emr-bootstrap-
actions/tree/master/tajo
● How to setup Tajo on EMR
○ http://www.gruter.com/blog/setting-up-a-
tajo-cluster-on-amazon-emr/