Talk at Scala Up North Jul 21 2017
We will talk about Spotify's story with Scala big data and our journey to migrate our entire data infrastructure to Google Cloud and how Justin Bieber contributed to breaking it. We'll talk about Scio, a Scala API for Apache Beam and Google Cloud Dataflow, and the technology behind it, including macros, algebird, chill and shapeless. There'll also be a live coding demo.
6. Hadoop at Spotify
● On-premise → Amazon EMR → On-premise
● ~2,500 nodes, largest in EU
● 100PB+ Disk, 100TB+ RAM
● 60TB+ per day log ingestion
● 20K+ jobs per day
7. Data Processing
● Luigi, Python M/R, circa 2011
● Scalding, Spark, circa 2013
● 200+ Scala users, 1K+ unique jobs
● Storm for real-time
● Hive for ad-hoc analysis
10. Top Tracks in Vancouver Jun 2017
● 30 date partitioned tables, 60TB
● 1 metadata table, 418GB
● 94.2s, 4.82TB processed
Track Artist Count
Despacito - Remix Luis Fonsi 349188
I'm the One DJ Khaled 214863
HUMBLE. Kendrick Lamar 200690
Unforgettable French Montana 192141
XO TOUR Llif3 Lil Uzi Vert 148546
Slide Calvin Harris 134236
Attention Charlie Puth 131514
Strip That Down Liam Payne 125191
2U David Guetta 124998
11. Top Bieber Tracks Jun 2017
● 30 date partitioned tables, 60TB
● 1 metadata table, 418GB
● 81.8s, 2.92TB processed
Track Count
Love Yourself 28400153
Sorry 27269157
What Do You Mean? 20047452
Company 11794409
Purpose 8314381
I'll Show You 6627706
Baby 6161099
Boyfriend 5854955
The Feeling 5415162
As Long As You Love Me 4738529
15. The Evolution of Apache Beam
MapReduce
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
Millwheel
Apache
Beam
Google Cloud
Dataflow
16. What is Apache Beam?
1. The Beam Programming Model
2. SDKs for writing Beam pipelines -- Java & Python
3. Runners for existing distributed processing backends
○ Apache Flink (thanks to data Artisans)
○ Apache Spark (thanks to Cloudera and PayPal)
○ Google Cloud Dataflow (fully managed service)
○ Local runner for testing
17. 17
The Beam Model: Asking the Right Questions
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
18. 18
Customizing What Where When How
3
Streaming
4
Streaming
+ Accumulation
1
Classic
Batch
2
Windowed
Batch
23. ● High level DSL
● Familiarity with Scalding, Spark and Flink
● Functional programming natural fit for data
● Numerical libraries - Breeze, Algebird
● Macros & shapeless for code generation
Beam and Scala
31. Type safe BigQuery
Macro generated case classes, schemas and converters
@BigQuery.fromQuery("SELECT id, name FROM [users] WHERE ...")
class User // look mom no code!
sc.typedBigQuery[User]().map(u => (u.id, u.name))
@BigQuery.toTable
case class Score(id: String, score: Double)
data.map(kv => Score(kv._1, kv._2)).saveAsTypedBigQuery("table")
32. ● Best of both worlds
● BigQuery for slicing and
dicing huge datasets
● Scala for custom logic
● Seamless integration
BigQuery + Scio
33. REPL
$ scio-repl
Welcome to
_____
________________(_)_____
__ ___/ ___/_ /_ __
_(__ )/ /__ _ / / /_/ /
/____/ ___/ /_/ ____/ version 0.3.4
Using Scala version 2.11.11 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.
Using 'scio-test' as your BigQuery project.
BigQuery client available as 'bq'
Scio context available as 'sc'
scio> _
Available in github.com/spotify/homebrew-public
34. ● DAG and source code visualization
● BigQuery caching, legacy & SQL 2011 support
● HDFS, Protobuf, TensorFlow, Bigtable,
Elasticsearch & Cassandra I/O
● Join optimizations - hash, skewed, sparse
● Job metrics
● DistCache, job orchestration
Other goodies
36. Adoption
● At Spotify
○ 200+ users, 700+ unique production pipelines (from ~70 10 months ago)
○ Most new to Scala and Scio
○ Both batch and streaming jobs
● Externally
○ ~10 companies, several fairly large ones
41. Listening history
● user x track x
[day/week/month/all time]
● 300B elements
● 800 n1-highmem-32 workers
● 32 core 208GB RAM
● 240TB in - Bigtable
● 90TB out - GCS
42. BigDiffy
● Pairwise field-level statistical diff
● Diff 2 SCollection[T] given keyFn: T => String
● T: Avro, BigQuery JSON, Protobuf
● Leaf field Δ - numeric, string (Levenshtein), vector (Cosine)
● Δ statistics - min, max, μ, σ, etc.
● Non-deterministic fields
○ ignore field
○ treat "repeated" field (List) as unordered list (Set)
Part of github.com/spotify/ratatool
43. Dataset Diff
● Diff stats
○ Global: # of SAME, DIFF, MISSING LHS/RHS
○ Key: key → SAME, DIFF, MISSING LHS/RHS
○ Field: field → min, max, μ, σ, etc.
● Use cases
○ Validating pipeline migration
○ Sanity checking ML models
44. Pairwise field-level deltas
val lKeyed = lhs.keyBy(keyFn)
val rKeyed = rhs.keyBy(keyFn)
val deltas = (lKeyed outerJoin rKeyed).map { case (k, (lOpt, rOpt)) =>
(lOpt, rOpt) match {
case (Some(l), Some(r)) =>
val ds = diffy(l, r) // Seq[Delta]
val dt = if (ds.isEmpty) SAME else DIFFERENT
(k, (ds, dt))
case (_, _) =>
val dt = if (lOpt.isDefined) MISSING_RHS else MISSING_LHS
(k, (Nil, dt))
}
}
48. Transforming features
case class Record(d1: Double, d2: Option[Double], d3: Double,
s: Seq[String])
val spec = FeatureSpec.of[Record]
.required(_.d1)(Binarizer("bin", threshold = 0.5))
.optional(_.d2)(Bucketizer("bucket", Array(0.0, 10.0, 20.0)))
.required(_.d3)(StandardScaling("std"))
.required(_.d3)(QuantileDiscretizer("q", 4))
.required(_.s)(NHotEncoder("nhot"))
val f = spec.extract(records)
f.featureNames // human readable names
f.featureValues[DenseVector[Double]] // output format
f.featureSettings // settings can be reloaded
49. shapeless-datatype
● Case class ↔ Avro, BigQuery, Datastore & TensorFlow types
● Compile time and extendable type mapping
● Generic mapper between case class types
● Type and lens based record matcher
https://github.com/nevillelyh/shapeless-datatype
50. protobuf-generic
● Generic Protobuf manipulation without compiled classes
● Similar to Avro GenericRecord
● Protobuf type T → JSON schema
● Bytes ↔ JSON with JSON schema
● Command line tool to inspect binary files
https://github.com/nevillelyh/protobuf-generic
51. Other uses
● AB testing
○ Statistical analysis with bootstrap
and DimSum
● Monetization
○ Ads targeting
○ User conversion analysis
● User understanding
○ Diversity
○ Session analysis
○ Behavior analysis
● Home page ranking
● Audio fingerprint analysis
56. Serialization
● Data ser/de
○ Scalding, Spark and Storm uses Kryo and Twitter Chill
○ Dataflow/Beam requires explicit Coder[T]
Sometimes inferable via Guava TypeToken
○ ClassTag to the rescue, fallback to Kryo/Chill
● Lambda ser/de
○ Custom ClosureCleaner, Serializable and @transient lazy val
○ Scala 2.12, Java 8 lambda issues #10232 #10233
57. REPL
● Spark REPL transports lambda bytecode via HTTP
● Dataflow requires job jar for execution (masterless)
● Custom class loader and ILoop
● Interpreted classes → job jar → job submission
● SCollection[T]#closeAndCollect(): Iterator[T]
to mimic Spark actions
58. Macros and IntelliJ IDEA
● IntelliJ IDEA does not see macro expanded classes
https://youtrack.jetbrains.com/issue/SCL-8834
● @BigQueryType.{fromTable, fromQuery}
class MyRecord
● Scio IDEA plugin
https://github.com/spotify/scio-idea-plugin