4. Agenda
- Who is Treasure Data
- What is distributed data analysis?
- What kind of challenges we have?
- Our approach
- Columnar Storage
- Partitioning
- Repartitioning
4
25. Partition Size?
• The partition file size affects the
performance significantly
• 1000000 records / file
• 256MB / file
• But depends on the workload
25
34. Stella Connector
• Repartitioning & UDP is designed as a
Presto connector
• Make use of Presto high scalability and
reliability for such high workload
34
35. Stella Connector
35
CREATE TABLE remerged WITH (max_file_size = '256MB', max_time_range='48h') AS
SELECT * FROM partition.sources
WHERE table_schema = 'tpch_s1'
AND table_name = 'lineitem' AND TD_TIME_RANGE(time, '1998-10-11', '1998-10-20')