This document provides an overview of Scala data pipelines at Spotify. It discusses:
- The speaker's background and Spotify's scale with over 75 million active users.
- Spotify's music recommendation systems including Discover Weekly and personalized radio.
- How Scala and frameworks like Scalding, Spark, and Crunch are used to build data pipelines for tasks like joins, aggregations, and machine learning algorithms.
- Techniques for optimizing pipelines including distributed caching, bloom filters, and Parquet for efficient storage and querying of large datasets.
- The speaker's success in migrating over 300 jobs from Python to Scala and growing the team of engineers building Scala pipelines at Spotify.
2. Who am I?
โฃ SpotifyNYCsince2011
โฃ FormerlyYahoo!Search
โฃ Musicrecommendations
โฃ Datainfrastructure
โฃ Scalasince2013
3. Spotify in numbers
โข Started in 2006, 58 markets
โข 75M+ active users, 20M+ paying
โข 30M+ songs, 20K new per day
โข 1.5 billion playlists
โข 1 TB logs per day
โข 1200+ node Hadoop cluster
โข 10K+ Hadoop jobs per day
4. Music recommendation @ Spotify
โข Discover Weekly
โข Radio
โข RelatedArtists
โข Discover Page
6. A little teaser
PGroupedTable<K,V>::combineValues(CombineFn<K,V> combineFn,
CombineFn<K,V> reduceFn)
Crunch: CombineFns are used to represent the associative operationsโฆ
Grouped[K, +V]::reduce[U >: V](fn: (U, U) U)
Scalding: reduce with fn which must be associative and commutativeโฆ
PairRDDFunctions[K, V]::reduceByKey(fn: (V, V) => V)
Spark: Merge the values for each key using an associative reduce functionโฆ
8. One more teaser
Linear equation inAlternate Least Square (ALS) Matrix factorization
xu = (YTY + YT(Cu โ I)Y)โ1YTCup(u)
vectors.map { case (id, v) => (id, v * v) }.map(_._2).reduce(_ + _) // YtY
ratings.keyBy(fixedKey).join(outerProducts) // YtCuIY
.map { case (_, (r, op)) =>
(solveKey(r), op * (r.rating * alpha))
}.reduceByKey(_ + _)
ratings.keyBy(fixedKey).join(vectors) // YtCupu
.map { case (_, (r, v)) =>
val (Cui, pui) = (r.rating * alpha + 1, if (Cui > 0.0) 1.0 else 0.0)
(solveKey(r), v * (Cui * pui))
}.reduceByKey(_ + _)
http://www.slideshare.net/MrChrisJohnson/scala-data-pipelines-for-music-recommendations
9. Success story
โข Mid 2013: 100+ Python Luigi M/R jobs, few tests
โข 10+ new hires since, most fresh grads
โข Few with Java experience, none with Scala
โข Now: 300+ Scalding jobs, 400+ tests
โข More ad-hoc jobs untracked
โข Spark also taking off
19. Key-value file as distributed cache
val streams: TypedPipe[(String, String)] = _ // (gid, user)
val tgp: SparkeyManager = _ // tgp replicated to all mappers
streams
.map { case (track, user) =>
(user, tgp.get(track).split(",").toSet)
}
.group
.sum
https://github.com/spotify/sparkey
SparkeyManagerwraps DistributedCacheFile
20. Joins and CoGroups
โข Require shuffle and reduce step
โข Some ops force everything to reducersโจ
e.g. mapGroup, mapValueStream
โข CoGroup more flexible for complex logic
โข Scalding flattens a.join(b).join(c)โฆโจ
into MultiJoin(a, b, c, โฆ)
21. Distributed cache
โข Fasterwith off-heap binary files
โข Building cache = more wiring
โข Memory mapping may interfere withYARN
โข E.g. 64GB nodes with 48GB for containers (no cgroup)
โข 12 ร 2GB containers each with 2GB JVM heap + mmap cache
โข OOM and swap!
โข Keep files small (< 1GB) or fallback to joinsโฆ
22. Analyze your jobs
โข Concurrent Driven
โข Visualize job execution
โข Workflow optimization
โข Bottlenecks
โข Data skew
24. Recommending tracks
โข User listened to Rammstein - Du Hast
โข Recommend 10 similartracks
โข 40 dimension feature vectors fortracks
โข Compute cosine similarity between all pairs
โข O(n) lookup per userwhere n โ 30m
โข Trythat with 50m users * 10 seed tracks each
25. ANNOY - cheat by approximation
โข Approximate Nearest Neighbor OhYeah
โข Random projections and binarytree search
โข Build index on single machine
โข Load in mappers via distribute cache
โข O(log n) lookup
https://github.com/spotify/annoy
https://github.com/spotify/annoy-java
27. Filtering candidates
โข Users donโt like seeing artist/album/tracks they already know
โข But may forget what they listened long ago
โข 50m * thousands of items each
โข Over 5 years of streaming logs
โข Need to update daily
โข Need to purge old items per user
28. Options
โข Aggregate all logs daily
โข Aggregate last x days daily
โข CSVof artist/album/track ids
โข Bloom filters
29. Decayed value with cutoff
โข Compute new user-item score daily
โข Weighted on context, e.g. radio, search, playlist
โข scoreโ = score + previous * 0.99
โข half life = log0.99
0.5 = 69 days
โข Cut off at top 2000
โข Items that users might remember seeing recently
30. Bloom filters
โข Probabilistic data structure
โข Encoding set of items with m bits and k hash functions
โข No false negative
โข Tunable false positive probability
โข Size proportional to capacity & FP probability
โข Letโs build one per user-{artists,albums,tracks}
โข Algebird BloomFilterMonoid: z = all zero bits, + = bitwise OR
31. Size versus max items & FP prob
โข User-item distribution is uneven
โข Assuming same setting for all users
โข # items << capacity โ wasting space
โข # items > capacity โ high FP rate
32. Scalable Bloom Filter
โข Growing sequence of standard BFs
โข Increasing capacity and tighter FP probability
โข Most users have few BFs
โข Power users have many
โข Serialization and lookup overhead
33. Scalable Bloom Filter
โข Growing sequence of standard BFs
โข Increasing capacity and tighter FP probability
โข Most users have few BFs
โข Power users have many
โข Serialization and lookup overhead
n=1k
item
34. Scalable Bloom Filter
โข Growing sequence of standard BFs
โข Increasing capacity and tighter FP probability
โข Most users have few BFs
โข Power users have many
โข Serialization and lookup overhead
n=1k n=10k
item
full
35. Scalable Bloom Filter
โข Growing sequence of standard BFs
โข Increasing capacity and tighter FP probability
โข Most users have few BFs
โข Power users have many
โข Serialization and lookup overhead
item
n=1k n=10k n=100k
fullfull
36. Scalable Bloom Filter
โข Growing sequence of standard BFs
โข Increasing capacity and tighter FP probability
โข Most users have few BFs
โข Power users have many
โข Serialization and lookup overhead
n=1k n=10k n=100k n=1m
item
fullfullfull
37. Opportunistic Bloom Filter
โข Building n BFs of increasing capacity in parallel
โข Up to << N max possible items
โข Keep smallest one with capacity > items inserted
โข Expensive to build
โข Cheap to store and lookup
38. Opportunistic Bloom Filter
โข Building n BFs of increasing capacity in parallel
โข Up to << N max possible items
โข Keep smallest one with capacity > items inserted
โข Expensive to build
โข Cheap to store and lookup
n=1k
43. Opportunistic Bloom Filter
โข Building n BFs of increasing capacity in parallel
โข Up to N max possible items
โข Keep smallest one with capacity items inserted
โข Expensive to build
โข Cheap to store and lookup
n=1k
48. Opportunistic Bloom Filter
โข Building n BFs of increasing capacity in parallel
โข Up to N max possible items
โข Keep smallest one with capacity items inserted
โข Expensive to build
โข Cheap to store and lookup
n=1k
53. Opportunistic Bloom Filter
โข Building n BFs of increasing capacity in parallel
โข Up to N max possible items
โข Keep smallest one with capacity items inserted
โข Expensive to build
โข Cheap to store and lookup
n=1k
60. Track metadata
โข Label dump โ content ingestion
โข Third partytrack genres, e.g. GraceNote
โข Audio attributes, e.g. tempo, key, time signature
โข Cultural data, e.g. popularity, tags
โข Latent vectors from collaborative filtering
โข Many sources for album, artist, user metadata too
61. Multiple data sources
โข Big joins
โข Complex dependencies
โข Wide rows with few columns accessed
โข Wasting I/O
62. Apache Parquet
โข Pre-join sources into mega-datasets
โข Store as Parquet columnar storage
โข Column projection
โข Predicate pushdown
โข Avro within Scalding pipelines
63. Projection
pipe.map(a = (a.getName, a.getAmount))
versus
Parquet.project[Account](name, amount)
โข Strings โ unsafe and error prone
โข No IDE auto-completion โ finger injury
โข my_fancy_field_name โ .getMyFancyFieldName
โข Hard to migrate existing code