在 OCF and OSSF 的邀請下分享一下 Spark
If you have any interest about 財團法人開放文化基金會(OCF) or 自由軟體鑄造場(OSSF)
Please check http://ocf.tw/ or http://www.openfoundry.org/
另外感謝 CLBC 的場地
如果你想到在一個良好的工作環境下工作
歡迎跟 CLBC 接洽 http://clbc.tw/
8. Opinion from Cloudera
• The leading candidate for “successor to
MapReduce” today is Apache Spark
• No vendor — no new project — is likely to catch
up. Chasing Spark would be a waste of time,
and would delay availability of real-time analytic
and processing services for no good reason. !
• From http://0rz.tw/y3OfM
9. What is Spark
• From UC Berkeley AMP Lab
• Most activity Big data open source project since
Hadoop
15. Hadoop vs Spark
Hive Shark/SparkSQL
YARN
HDFS
MapReduce
Spark
16. Spark vs Hadoop
• Spark run on Yarn, Mesos or Standalone mode
• Spark’s main concept is based on MapReduce
• Spark can read from
• HDFS: data locality
• HBase
• Cassandra
23. Python Word Count
• file = spark.textFile("hdfs://...")
• counts = file.flatMap(lambda line: line.split(" "))
• .map(lambda word: (word, 1))
• .reduceByKey(lambda a, b: a + b)
• counts.saveAsTextFile("hdfs://...")
Access data via
Spark API
Process via Python
24. What is Spark
• Apache Spark™ is a very fast and general
engine for large-scale data processing
27. a 1.0
1.0
1.0
1.0
PageRank
b
b
1st Iter 2nd Iter 3rd Iter
b
d
c
Rank
Tmp
Result
Rank
Tmp
Result
a 1.85
1.0
0.58
d
c
0.58
a 1.31
1.72
0.39
d
c
0.58
28. HDFS is 100x slower than memory
Input
(HDFS)
Iter 1
Tmp
(HDFS)
Iter 2
Tmp
(HDFS)
Iter N
Input
(HDFS)
Iter 1
Tmp
(Mem)
Iter 2
Tmp
(Mem)
Iter N
MapReduce
Spark
29. 3rd iteration(mem)!
take 7.7 sec
2nd iteration(mem)!
take 7.4 sec
First iteration(HDFS)!
take 200 sec
Page Rank algorithm in 1 billion record url
34. RDD
• Resilient Distributed Dataset
• Collections of objects spread across a cluster,
stored in RAM or on Disk
• Built through parallel transformations
67. Bridge people together
• Scala : Engineer
• Java : Engineer
• Python : Data Scientist , Engineer
• R : Data Scientist , Data Analyst
• SQL : Data Analyst
68. Yahoo EC team
Data Platform!
!
!
!
!
!
!
!
!
!
Filtered
Data!
(HDFS)
Data
Mart!
(Oracle)
ML Model!
(Spark)
BI Report!
(MSTR)
Traffic!
Data
Transaction!
Data
Shark
77. Background Knowledge
• Tweet real time data store into SQL database
• Spark MLLib use Wikipedia data to train a TF-IDF
model
• SparkSQL select tweet and filter by TF-IDF
model
• Generate live BI report
78. Code
• val wiki = sql(“select text from wiki”)
• val model = new TFIDF()
• model.train(wiki)
• registerFunction(“similarity” , model.similarity _ )
• select tweet from tweet where similarity(tweet,
“$search” > 0.01 )