20. RDDs
maintain
lineage
information
that
can
be
used
to
reconstruct
lost
partitions
Ex:
messages = textFile(...).filter(_.startsWith(“ERROR”))
Result = messages.map(_.split(‘t’)(2))
HDFS
File
Filtered
RDD
Mapped
RDD
filter
(func
=
_.startsWith(...))
map
(func
=
_.split(...))
27. val
conf
=
new
SparkConf()
val
ssc
=
new
StreamingContext(conf,
Seconds(1))
val
lines
=
ssc.textFileStream(args(1))
val
words
=
lines.flatMap(_.split("
"))
val
result
=
words.map(x
=>
(x,
1)).reduceByKey(_
+
_).collect()
ssc.start()
val
conf
=
new
SparkConf()
val ハsc ハ= ハnew ハSparkContext(conf)
val ハlines ハ ハ= ハsc.textFile(args(1))
val ハwords ハ= ハlines.fl゚atMap(_.split("
"))
val result = words.map(x ハ=> ハ(x, ハ1)).reduceByKey(_
+
_).collect()
28. n Hive-‐like
interface(JDBC
Service
/
CLI)
n Both
Hive
QL
&
Simple
SQL
dialects
are
Supported
n DDL
is
100%
compatible
with
Hive
Metastore
n Hive
QL
aims
to
100%
compatible
with
Hive
DML
Spark
Core
Spark
Execution
Operators
Catalyst
Hive
QL
Simple
SQL
SQL
API
CLI
User Application
JDBC
Service
Data
Analyst
Hive
Meta
Store
Simple
Catalog
29. n First
released
in
Spark
1.0
(May,
2014)
n Initial
committed
by
Michael
Armbrust
&
Reynold
Xin
from
Databricks
30. ¡ MLlib
机器学习算法库:
§ Initial
contribution
from
AMPLab,
UC
Berkeley
§ Shipped
with
Spark
since
version
0.8
(Sep
2013)
¡ 数据类型
§ Dense
§ Sparse
(
Since
1.0)
▪ 现实世界中,众多的数据集都是稀疏的
¡ 算法集
§ Classification
/
Regression
/collaborative
filtering
/
Clustering
/
Decomposition
34. ¡ 用于交互式运⾏行测试Spark程序
§ 便于快速测试程序局部逻辑
¡ 构建在Scala
Repl的基础上
§ Repl:读取 执⾏行 打印 循环
§ 拓展:
▪ Modified
wrapper
code
generation
so
that
each
line
typed
has
references
to
objects
for
its
dependencies
▪ Distribute
generated
classes
over
the
network
35.
36. ¡ Pluggable
shuffl゚e
Interface
§ Hash
-‐>
Sort
▪ Memory/performance
etc.
¡ Improved
Data
transfer
mechanism
§ Pluggable
§ Employ
Netty
¡ Others
§ pySpark
/
JDBC
server
/
Dynamic
metric
…
37. ¡ Core
§ Pluggable
Storage
Interface
▪ To
support
various
Storage
type,
SSD,HDFS
Cache
etc ハ
¡ Spark ハSQL
§ 更多的数据源的支持
▪ (Cassandra, MongoDB)
RDMS
(SAP/Vertica/Oracle)
§ 性能优化(code
gen,
faster
joins,
etc)
§ 语法增强(towards
SQL92)
¡ Graphx
§ Move
graphx
out
of
“Alpha”
¡ 稳定性和可扩展性
38. ¡ Better
Yarn
Integration
§ Security
§ Dynamic
resource
adjustment
¡ More
Algorithms
for
Mllib
§ On
June,
15+
§ Should
Double
quickly.
¡ Spark ハStreaming
§ Streaming
SQL
/
More
data
source
etc.