5. 2015/7/22 5www.transwarp.io confidential
• 2014年10月,公司在纽约召开的Strata+Hadoop World大会上发
布了Transwarp Data Hub 3.4新版本Hadoop发行版软件
– 这次Strata是近年来规模最大的大数据盛会,有5500多人参加这次
大会,130多家厂商参展,门票在开会前就售罄。这么大规模的盛
会标志着hadoop已经真正成为大数据处理技术的主流地位。这也
是星环首次在美国-大数据的大本营-发布大数据最新产品
– 在这次大会上公司发布了最新的性能数据,相对于Cloudera
Impala,性能更好,SQL支持更完整
• 2014年12月,公司在北京召开的年度大数据大会BDTC上,发布了
TDH4.0最新版本,全面升级了各组件的功能,同时全方位提升了平台
性能
• 2015年4月16日,公司在中国数据库大会上发布TDH4.1版本,更好
地支持数据仓库应用。同时宣布在6月底发布TOS1.0版本,进入
Hadoop on Docker时代。
近期公司大事记
CCTV采访DTCC专访Databricks CEO
Ion Stoica
HBase PMC Chairman
Michael Stack
2014/12 2015/04
Cloudera Founder
Mike Olson
Strata Conference
New York 2014/10
Tony Baer, Big Data Analyst at NY
Cloudera Founder
Mike Olson
Tony Baer, BigData Analyst @NewYork
16. 2015/7/22 16www.transwarp.io confidential
– Slice
– Dice
– Rollup
– Drill Up
– Drill Down
– Pivot
交互式OLAP分析:Distributed Cube
Holodesk – A Columnar Store on SSD cache layer
Executor
Inceptor
Server
Executor Executor Executor
Columnar Store API
Cube(D
1
,D
2
,
D
3
)
INDEX
ColumnD
1
INDEX
ColumnD
2
INDEX
ColumnD
3
INDEX
ColumnM
1
Cube(D
1
,D
2
),
(D
2
,D
3
),(D
1
,
D
3
)
Columnar Store API
Cube(D
1
,D
2
,
D
3
)
INDEX
ColumnD
1
INDEX
ColumnD
2
INDEX
ColumnD3
INDEX
ColumnM
1
Cube(D
1
,D
2
),
(D
2
,D
3
),(D
1
,
D
3
)
Columnar Store API
Cube(D
1
,D
2
,
D
3
)
INDEX
ColumnD
1
INDEX
ColumnD
2
INDEX
ColumnD
3
INDEX
ColumnM
1
Cube(D
1
,D
2
),
(D
2
,D
3
),(D
1
,
D
3
)
Columnar Store API
Cube(D
1
,D
2
,
D
3
)
INDEX
ColumnD
1
INDEX
ColumnD
2
INDEX
ColumnD
3
INDEX
ColumnM
1
Cube(D
1
,D
2
),
(D
2
,D
3
),(D
1
,
D
3
)
如何定义一个Cube?
Cube Size
256KB固定大小
ZK Cluster
• Cube on Transwarp Holodesk
• Cube是OLAP分析的常用技术
create table store_sales tblproperties(
‘cache’=‘ram’,
‘holodesk.dimensions’=‘product,
cities, time’
) as select * from store_sales;
计算下沉到存储层
Compute and filters
pushed down to storage
layer
17. 2015/7/22 17www.transwarp.io confidential
0.9
9.8 12.4 12.1 14.0
1.3
8.8
12.7
20.2
43.3
58.9
86.6
136.1
1.4
55.2 56.5
0
20
40
60
80
100
120
140
160
1 2 3 4 5 6 7 8
执行时间(秒)
w/ cube w/o cube
Holodesk Cube带来的性能加速
Operation SQL query
q1 count select count(*) from store_sales
q2 measure select sum(ss_sales_price) from store_sales
q3 aggregation
select sum(ss_sales_price) from store_sales group by
ss_customer_sk
q4 drill down
select sum(ss_sales_price) from store_sales group by
ss_sold_date_sk
q5 drill down
select sum(ss_sales_price) from store_sales group by
ss_customer_sk, ss_sold_date_sk
q6 slice
select sum(ss_sales_price) from store_sales_r where
ss_customer_sk=5000 group by
ss_customer_sk,ss_sold_date_sk
q7 dice
select sum(ss_sales_price) from store_sales where
ss_sold_date_sk between 2450629 and 2451816 group by
ss_customer_sk
q8 pivot
select sum(ss_sales_price) from store_sales where
ss_customer_sk > 5000 and ss_sold_date_sk between
2450629 and 2451816 group by
ss_customer_sk,ss_sold_date_sk
40亿条记录
共500GB驻留内存
4台两路普通服务器
每台服务器
内存:256GB
CPU:E5-2620v2
网络:万兆网络
18. 2015/7/22 18www.transwarp.io confidential
为SSD设计专有格式 - Holodesk
1 W A
2 X B
3 Y C
4 Z D
5 O E
6 P F
7 Q G
8 R H
Holodesk – A Columnar Store on SSD cache layer
Spark
1 W A
GLOBAL
INDEX
2 X B
Dictionary
BITMAP
INDEX
FILTER
BITMAP
INDEX
FILTER
BITMAP
INDEX
FILTER
3 Y C
4 Z D
BITMAP
INDEX
FILTER
BITMAP
INDEX
FILTER
BITMAP
INDEX
FILTER
Dictionary
5 O E
6 P F
BITMAP
INDEX
FILTER
BITMAP
INDEX
FILTER
BITMAP
INDEX
FILTER
Dictionary
7 Q G
8 R H
BITMAP
INDEX
FILTER
BITMAP
INDEX
FILTER
BITMAP
INDEX
FILTER
Dictionary
HDFS Storage Layer
HDFS Text or ORC or Parquet Files
Memory Tier
SSD Tier
• HDFS Storage Tier – 让应用程序来选择存储层
– Memory as storage tier
– SSD Storage Tier
• 但是,现有的Text以及行列混合(ORC or Parquet)等文件格式都
不足以利用SSD的高性能。
Executor
Spark
Context
Executor Executor Executor
Columnar Store APIColumnar Store APIColumnar Store APIColumnar Store API
File System API
CREATE TABLE t1
TBLPROPERTIES(
"cache"=“SSD”,
“filters”=“hashbucket(360):c1”
) AS
SELECT *
FROM src
DISTRIBUTE BY c1;
• Off-Heap
• Columnar store
• Secondary index
• Table format/access
• SSD as cache
ZK Cluster
20. 2015/7/22 20www.transwarp.io confidential
Cost Based Optimizer
20
Table A
100M
Records
kurt
mary
john
smith
622523454095243
622550042034568
622544334568763
622534878982324
v_name Card_id
1
2
…
…
9999999
10000000
No.
Table B
100M records
JOIN ON
A.card_id=B.card_id
Cost based
optimizer
Table size
Immediate
result size
Data skew
Value
distribution
selectivity
Map Join
Lookup Join
Hash Join
Query Plan
Common Join
Co-Group Join
21. 2015/7/22 21www.transwarp.io confidential
与数据可视化工具良好对接
在数据可视化的过程中Spark扩展支持大量的可视化及报表生成工具,如 Tableau,SAP
Business Objects, Oracle Business Intelligence等,使得基于大数据分析的商业决策更
易被理解和接受,从而将大数据的潜在价值最大化。
业务人员通过简单的拖拽既可定制个性化报表,跳过了数据准备的工作环节。
22. 2015/7/22 22www.transwarp.io confidential
对R语言的完整支持
R package
from Transwarp
R – SQL
Interface
from Transwarp
Tables
Distributed Columnar
Store on SSD
Statistics Library
Machine Learning
Library
Files
Hadoop Distributed File
System
R – Spark
Interface
from Transwarp
Spark RDD
Resilient Distributed
Dataset in Memory
Call parallelized algorithms
Call SQL
call sequential algorithm for distributed dataset
算法名称 TDH MLlib
Min/Max YES YES
Mean/Variance YES YES
Normalization YES YES
Standard scaling YES YES
Correlation YES YES
Histogram YES NO
Bining YES NO
Percentile YES NO
Median YES NO
Boxplot YES NO
Screen YES NO
Cardnality YES NO
Logistic Regression YES YES
Naive Bayes YES YES
SVM YES YES
KMeans YES YES
Collaborative Filtering YES YES
Linear Regression YES YES
Ridge Regression YES YES
Lasso Regression YES YES
GLM YES NO
DecisionTree YES YES
Apriori YES NO
Asocciation rules YES NO
Gradient Boosted Trees YES NO
Random Forest YES NO
Deep Learning YES NO
R Studio
32. 2015/7/22 32www.transwarp.io confidential
Typical Case: Streaming in Intelligent Transportation System
Inceptor
Hyperbase
Real time database
• Real time picture
serving for road
segments
• Legacy applications
• Real time road condition
• Average speed estimation
• Regulation Check
… …
• Traffic info online serving
• Traffic pattern mining
Streaming Cluster
Inceptor
Kafka
Distributed
Message
Queue
Result tables stored in hyperbase
Deployed for ITS of Shandong Province in China
End to end latency is < 2 seconds,
streaming cluster with >30 nodes
Message
cluster with >10
nodes
>30 nodes
30 million events/day,
10000 events/second in rush
hours