1. T R E A S U R E D A T A
USER DEFINED PARTITIONING
A New Partitioning Strategy accelerating CDP Workload
Kai Sasaki
Software Engineer in Treasure Data
2. ABOUT ME
- Kai Sasaki (@Lewuathe)
- Software Engineer in Treasure Data since 2015
Working in Query Engine Team (Managing Hive, Presto in Treasure data)
- Contributor of Hadoop, Spark, Presto
3. TOPICS
PlazmaDB
PlazmaDB is a metadata storage for all log data in
Treasure Data. It supports import, export, INSERT
INTO, CREATE TABLE, DELETE etc on top of
PostgreSQL transaction mechanism.
Time Index Partitioning
Partitioning log data by the time log generated. The
time when the log is generated is specified as “time”
column in Treasure Data. It enables us to skip to
read unnecessary partitions.
User Defined Partitioning
(New!)
In addition to “time” column, we can use any
column as partitioning key. It provides us more
flexible partitioning strategy that fits CDP workload.
5. PRESTO IN TREASURE DATA
• Multiple clusters with 50~60 worker cluster
• Presto 0.188
Stats
• 4.3+ million queries / month
• 400 trillion records / month
• 6+ PB / month
At the end of 2017
6. HIVE AND PRESTO ON PLAZMADB
Bulk Import
Fluentd
Mobile SDK
PlazmaDB
Presto
Hive
SQL, CDP
Amazon S3
9. PROBLEM
• Time index partitioning is efficient only when “time” value is specified.
Specifying other columns cause full-scan which can make
performance worse.
• The number of records in a partition highly depends on the table type, user usage.
SELECT
COUNT(1)
FROM table
WHERE
user_id = 1;
id data_set_id first_index_key last_index_key record_count path
P1 3065124 100 1412323028 1412385139 1 abcdefg-1234567-abcdefg-1234567
P2 3065125 100 1412323030 1412324030 1 abcdefg-1234567-abcdefg-9182841
P3 3065126 100 1412327028 1412328028 1 abcdefg-1234567-abcdefg-5818231
P4 3065127 200 1412325011 1412326001 101021 abcdefg-1234567-abcdefg-7271828
11. USER DEFINED PARTITIONING
• User can specify the partitioning strategy based on their usage using partitioning key column
max time range.
1h 1h 1h 1h1h
time
v1
v2
v3
c1
12. USER DEFINED PARTITIONING
1h 1h 1h 1h1h
time
c1
v1
v2
v3
… WHERE c1 = ‘v1’ AND time = …
• User can specify the partitioning strategy based on their usage using partitioning key column
max time range.
13. USER DEFINED PARTITIONING
1h 1h 1h 1h1h
time
c1
v1
v2
v3
… WHERE c1 = ‘v1’ AND time = …
1h 1h 1h 1h1h
time
c1
v1
v2
v3
… WHERE c1 = ‘v1’ AND time = …
• User can specify the partitioning strategy based on their usage using partitioning key column
max time range.
14. USER DEFINED PARTITIONING
CREATE TABLE via Presto or Hive
Insert data partitioned by set partitioning key
Set user defined configuration
The number of bucket, hash function, partitioning key
Read the data from UDP table
UDP table is now visible via Presto and HiveLOG
15. USER DEFINED CONFIGURATION
• We need to set columns to be used as partitioning key and the number of partitions.
It should be custom configuration by each user.
user_table_id columns bucket_count partiton_function
T1 141849 [["o_orderkey","long"]] 32 hash
T2 141850 [[“user_id","long"]] 32 hash
T3 141910 [[“item_id”,”long"]] 16 hash
T4 151242
[[“region_id”,”long"],
[“device_id”,”long”]]
256 hash
16. CREATE UDP TABLE VIA PRESTO
• Presto and Hive support CREATE TABLE/INSERT INTO on UDP table
CREATE TABLE udp_customer
WITH (
bucketed_on = array[‘customer_id’],
bucket_count = 128
)
AS SELECT * from normal_customer;
17. CREATE UDP TABLE VIA PRESTO
• Override ConnectorPageSink to write MPC1 file based on user defined partitioning key.
PlazmaPageSink
PartitionedMPCWriter
TimeRangeMPCWriter
TimeRangeMPCWriter
TimeRangeMPCWriter
BufferedMPCWriter
BufferedMPCWriter
BufferedMPCWriter
.
.
.
b1
b2
b3
Page
1h
1h
1h
18. CREATE UDP TABLE VIA PRESTO
• Override ConnectorPageSink to write MPC1 file based on user defined partitioning key.
PlazmaPageSink
PartitionedMPCWriter
TimeRangeMPCWriter
TimeRangeMPCWriter
TimeRangeMPCWriter
BufferedMPCWriter
BufferedMPCWriter
BufferedMPCWriter
.
.
.
Page
19. CREATE UDP TABLE VIA PRESTO
id data_set_id first_index_key last_index_key record_count path
bucket_
number
P1 3065124 187250 1412323028 1412385139 109 abcdefg-1234567-abcdefg-1234567 1
P2 3065125 187250 1412323030 1412324030 209 abcdefg-1234567-abcdefg-9182841 2
P3 3065126 187250 1412327028 1412328028 31 abcdefg-1234567-abcdefg-5818231 3
P4 3065127 187250 1412325011 1412326001 102 abcdefg-1234567-abcdefg-7271828 2
P5 3065128 281254 1412324214 1412325210 987 abcdefg-1234567-abcdefg-6717284 16
P6 3065129 281254 1412325123 1412329800 541 abcdefg-1234567-abcdefg-5717274 14
• New bucket_number column is added to partition record in PlazmaDB.
20. READ DATA FROM UDP TABLE
ConnectorSplitManager#getSplits
returns data source splits to be read by Presto
cluster.
Decide target bucket from constraint
Constraint specifies the range should be read from
the table. ConnectorSplitManager asks PlazmaDB to
get the partitions on the target bucket.
Override Presto Connector to data source
Presto provides a plugin mechanism to connect any
data source flexibly. The connector provides the
information about metadata and location of real data
source, UDFs.
Receive constraint as TupleDomain
TupleDomain is created from query plan and
passed through TableLayout which is available
in ConnectorSplitManager
21. READ DATA FROM UDP TABLE
SplitManager
PlazmaDB
TableLayout
SQL
constraint
Map<ColumnHandle, Domain>
Distribute PageSource
… WHERE bucker_number in () …
24. COLOCATED JOIN
time
left right
l1 r1 l2 r2 l3 r3
left right left right
time
Distributed Join
l1 r1
l1 r1 l2 r2 l3 r3
l2 r2 l3 r3
Colocated Join
25. PERFORMANCE COMPARISON
SQLs on TPC-H (scaled factor=1000)
elapsedtime
0 sec
20 sec
40 sec
60 sec
80 sec
between mod_predicate count_distinct
NORMAL UDP
26. USER DEFINED PARTITIONING
1h 1h 1h 1h1h
time
c1
v1
v2
v3
… WHERE time = …
1h 1h 1h 1h1h
time
c1
v1
v2
v3
… WHERE time = …
27. FUTURE WORKS
• Maintaining efficient partitioning structure
• Developing Stella job to rearranging partitioning schema flexibly by using Presto resource.
• Various kinds of pipeline (streaming import etc) should support UDP table.
• Documentation