Successfully reported this slideshow.
Upcoming SlideShare
×

# Scala Data Pipelines @ Spotify

Talk at Big Data Scala, Aug 18, 2015

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### Scala Data Pipelines @ Spotify

1. Scala Data Pipelines @ Spotify Neville Li @sinisa_lyh
2. Who am I? ‣ SpotifyNYCsince2011 ‣ FormerlyYahoo!Search ‣ Musicrecommendations ‣ Datainfrastructure ‣ Scalasince2013
3. Spotify in numbers • Started in 2006, 58 markets • 75M+ active users, 20M+ paying • 30M+ songs, 20K new per day • 1.5 billion playlists • 1 TB logs per day • 1200+ node Hadoop cluster • 10K+ Hadoop jobs per day
4. Music recommendation @ Spotify • Discover Weekly • Radio • RelatedArtists • Discover Page
5. Recommendation systems
6. A little teaser PGroupedTable<K,V>::combineValues(CombineFn<K,V> combineFn, CombineFn<K,V> reduceFn) Crunch: CombineFns are used to represent the associative operations… Grouped[K, +V]::reduce[U >: V](fn: (U, U) U) Scalding: reduce with fn which must be associative and commutative… PairRDDFunctions[K, V]::reduceByKey(fn: (V, V) => V) Spark: Merge the values for each key using an associative reduce function…
7. Monoid! enables map side reduce Actually it’s a semigroup
8. One more teaser Linear equation inAlternate Least Square (ALS) Matrix factorization xu = (YTY + YT(Cu − I)Y)−1YTCup(u) vectors.map { case (id, v) => (id, v * v) }.map(_._2).reduce(_ + _) // YtY ratings.keyBy(fixedKey).join(outerProducts) // YtCuIY .map { case (_, (r, op)) => (solveKey(r), op * (r.rating * alpha)) }.reduceByKey(_ + _) ratings.keyBy(fixedKey).join(vectors) // YtCupu .map { case (_, (r, v)) => val (Cui, pui) = (r.rating * alpha + 1, if (Cui > 0.0) 1.0 else 0.0) (solveKey(r), v * (Cui * pui)) }.reduceByKey(_ + _) http://www.slideshare.net/MrChrisJohnson/scala-data-pipelines-for-music-recommendations
9. Success story • Mid 2013: 100+ Python Luigi M/R jobs, few tests • 10+ new hires since, most fresh grads • Few with Java experience, none with Scala • Now: 300+ Scalding jobs, 400+ tests • More ad-hoc jobs untracked • Spark also taking off
10. First 10 months ……
11. Activity over time
12. Guess how many jobs written by yours truly?
13. Performance vs. Agility https://nicholassterling.wordpress.com/2012/11/16/scala-performance/
14. Let’sdiveinto something technical
15. To join or not to join? val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .join(tgp) .values // (user, genre) .group .mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
16. Hash join val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .hashJoin(tgp.forceToDisk) // tgp replicated to all mappers .values // (user, genre) .group .mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
17. CoGroup val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .cogroup(tgp) { case (_, users, genres) => users.map((_, genres.toSet)) } // (track, (user, genres)) .values // (user, genres)  .group .reduce(_ ++ _) // map-side reduce!
18. CoGroup val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .cogroup(tgp) { case (_, users, genres) => users.map((_, genres.toSet)) } // (track, (user, genres)) .values // (user, genres)  .group .sum // SetMonoid[Set[T]] from Algebird * sum[U >:V](implicit sg: Semigroup[U])
19. Key-value file as distributed cache val streams: TypedPipe[(String, String)] = _ // (gid, user) val tgp: SparkeyManager = _ // tgp replicated to all mappers streams .map { case (track, user) => (user, tgp.get(track).split(",").toSet) } .group .sum https://github.com/spotify/sparkey SparkeyManagerwraps DistributedCacheFile
20. Joins and CoGroups • Require shuffle and reduce step • Some ops force everything to reducers  e.g. mapGroup, mapValueStream • CoGroup more flexible for complex logic • Scalding flattens a.join(b).join(c)…  into MultiJoin(a, b, c, …)
21. Distributed cache • Fasterwith off-heap binary files • Building cache = more wiring • Memory mapping may interfere withYARN • E.g. 64GB nodes with 48GB for containers (no cgroup) • 12 × 2GB containers each with 2GB JVM heap + mmap cache • OOM and swap! • Keep files small (< 1GB) or fallback to joins…
22. Analyze your jobs • Concurrent Driven • Visualize job execution • Workflow optimization • Bottlenecks • Data skew
23. Notenough math?
24. Recommending tracks • User listened to Rammstein - Du Hast • Recommend 10 similartracks • 40 dimension feature vectors fortracks • Compute cosine similarity between all pairs • O(n) lookup per userwhere n ≈ 30m • Trythat with 50m users * 10 seed tracks each
25. ANNOY - cheat by approximation • Approximate Nearest Neighbor OhYeah • Random projections and binarytree search • Build index on single machine • Load in mappers via distribute cache • O(log n) lookup https://github.com/spotify/annoy https://github.com/spotify/annoy-java
26. ANN Benchmark https://github.com/erikbern/ann-benchmarks
27. Filtering candidates • Users don’t like seeing artist/album/tracks they already know • But may forget what they listened long ago • 50m * thousands of items each • Over 5 years of streaming logs • Need to update daily • Need to purge old items per user
28. Options • Aggregate all logs daily • Aggregate last x days daily • CSVof artist/album/track ids • Bloom filters
29. Decayed value with cutoff • Compute new user-item score daily • Weighted on context, e.g. radio, search, playlist • score’ = score + previous * 0.99 • half life = log0.99 0.5 = 69 days • Cut off at top 2000 • Items that users might remember seeing recently
30. Bloom filters • Probabilistic data structure • Encoding set of items with m bits and k hash functions • No false negative • Tunable false positive probability • Size proportional to capacity & FP probability • Let’s build one per user-{artists,albums,tracks} • Algebird BloomFilterMonoid: z = all zero bits, + = bitwise OR
31. Size versus max items & FP prob • User-item distribution is uneven • Assuming same setting for all users • # items << capacity → wasting space • # items > capacity → high FP rate
32. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead
33. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead n=1k item
34. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead n=1k n=10k item full
35. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead item n=1k n=10k n=100k fullfull
36. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead n=1k n=10k n=100k n=1m item fullfullfull
37. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to << N max possible items • Keep smallest one with capacity > items inserted • Expensive to build • Cheap to store and lookup
38. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to << N max possible items • Keep smallest one with capacity > items inserted • Expensive to build • Cheap to store and lookup n=1k

### Be the first to comment

• #### clivart

Jan. 17, 2016
• #### NELLAIVIJAY1

Mar. 30, 2016
• #### vershatrivedi

Apr. 12, 2016
• #### ZhimingYin1

Apr. 27, 2016
• #### JakubBartczuk

May. 11, 2016
• #### SYEDSAMIULLAHSHAH

Aug. 18, 2016

Sep. 4, 2016

Dec. 6, 2016
• #### lvaroGala

Mar. 28, 2017
• #### RogerSmith7

May. 12, 2017
• #### GwendolynErvinPMPCSP

Jul. 20, 2017
• #### pk32911

Sep. 12, 2017

Oct. 7, 2017
• #### JessCaldern4

Oct. 16, 2017
• #### AndrewMiller273

Oct. 16, 2017
• #### KrzysiekKondracki

Jul. 18, 2018

Jun. 2, 2019
• #### BjornHertzberg

Nov. 16, 2019
• #### JamesLiao5

Dec. 26, 2019
• #### anuragprasoon

Sep. 16, 2020

Talk at Big Data Scala, Aug 18, 2015

Total views

48,501

On Slideshare

0

From embeds

0

Number of embeds

8,770

234

Shares

0