6. Data model
Spark
• RDD for batch, DStream for streaming
• Explicit caching semantics
• Two sets ofAPIs
Dataflow
• PCollection for both batch and streaming
• Windowed and timestamped values
• One unifiedAPI
7. Execution
Spark
• Driver and executors
• Dynamic execution from driver
• Transforms and actions
Dataflow
• No master
• Static execution planning
• Transforms only, no actions
9. Why not Scalding on GCE
Pros
• Community
Twitter, eBay, Etsy, Stripe, LinkedIn, …
• Stable and proven
10. Why not Scalding on GCE
Cons
• Hadoop cluster operations
• Multi-tenancy
resource contention and utilization
• No streaming mode (Summingbird?)
11. Why not Spark on GCE
Pros
• Batch, streaming, interactive and SQL
• MLlib, GraphX
• Scala, Python, and R support
• Zeppelin, spark-notebook, Hue
12. Why not Spark on GCE
Cons
• Hard to tune and scale
• Cluster lifecycle management
13. Why Dataflow with Scala
Dataflow
• Hosted solution, no operations
• Ecosystem
GCS, BigQuery, PubSub, Bigtable, …
• Unified batch and streaming model
14. Why Dataflow with Scala
Scala
• High level DSL
easytransition for developers
• Reusable and composable code via FP
• Numerical libraries: Breeze,Algebird
18. WordCount
Almost identical to Spark version
val sc = ScioContext()
sc.textFile("shakespeare.txt")
.flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty))
.countByValue()
.saveAsTextFile("wordcount.txt")
19. PageRank
def pageRank(in: SCollection[(String, String)]) = {
val links = in.groupByKey()
var ranks = links.mapValues(_ => 1.0)
for (i <- 1 to 10) {
val contribs = links.join(ranks).values
.flatMap { case (urls, rank) =>
val size = urls.size
urls.map((_, rank / size))
}
ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _)
}
ranks
}
20. Spotify Running
60 million tracks
30m users * 10 tempo buckets * 25 tracks
Audio: tempo, energy, time signature ...
Metadata: genres, categories, …
Latent vectors from collaborative filtering
21.
22.
23.
24.
25. Personalized new releases
• Pre-computed weekly on Hadoop
(on-premise cluster)
• 100GB recommendations
from HDFS to Bigtable in US+EU
• 250GB Bloom filters from Bigtable to HDFS
• 200 LOC
26. User conversion analysis
• For marketing and campaigning strategies
• Track usertransitions through products
• Aggregated for simulation and projection
• 150GB BigQuery in and out