SlideShare a Scribd company logo
1 of 51
Download to read offline
Scala Data
Pipelines @
Spotify
Neville Li
@sinisa_lyh
Who am I?
‣ SpotifyNYCsince2011
‣ FormerlyYahoo!Search
‣ Musicrecommendations
‣ Datainfrastructure
‣ Scalasince2013
Spotify in numbers
• Started in 2006, 58 markets
• 75M+ active users, 20M+ paying
• 30M+ songs, 20K new per day
• 1.5 billion playlists
• 1 TB logs per day
• 1200+ node Hadoop cluster
• 10K+ Hadoop jobs per day
Music recommendation @ Spotify
• Discover Weekly
• Radio
• RelatedArtists
• Discover Page
Recommendation systems
A little teaser
PGroupedTable<K,V>::combineValues(CombineFn<K,V> combineFn,
CombineFn<K,V> reduceFn)
Crunch: CombineFns are used to represent the associative operations…
Grouped[K, +V]::reduce[U >: V](fn: (U, U) U)
Scalding: reduce with fn which must be associative and commutative…
PairRDDFunctions[K, V]::reduceByKey(fn: (V, V) => V)
Spark: Merge the values for each key using an associative reduce function…
Monoid!
enables map side reduce
Actually it’s a semigroup
One more teaser
Linear equation inAlternate Least Square (ALS) Matrix factorization
xu = (YTY + YT(Cu − I)Y)−1YTCup(u)
vectors.map { case (id, v) => (id, v * v) }.map(_._2).reduce(_ + _) // YtY
ratings.keyBy(fixedKey).join(outerProducts) // YtCuIY
.map { case (_, (r, op)) =>
(solveKey(r), op * (r.rating * alpha))
}.reduceByKey(_ + _)
ratings.keyBy(fixedKey).join(vectors) // YtCupu
.map { case (_, (r, v)) =>
val (Cui, pui) = (r.rating * alpha + 1, if (Cui > 0.0) 1.0 else 0.0)
(solveKey(r), v * (Cui * pui))
}.reduceByKey(_ + _)
http://www.slideshare.net/MrChrisJohnson/scala-data-pipelines-for-music-recommendations
Success story
• Mid 2013: 100+ Python Luigi M/R jobs, few tests
• 10+ new hires since, most fresh grads
• Few with Java experience, none with Scala
• Now: 300+ Scalding jobs, 400+ tests
• More ad-hoc jobs untracked
• Spark also taking off
First 10 months
……
Activity over time
Guess how many jobs
written by yours truly?
Performance vs. Agility
https://nicholassterling.wordpress.com/2012/11/16/scala-performance/
Let’sdiveinto
something
technical
To join or not to join?
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track, genre)
streams
.join(tgp)
.values // (user, genre)
.group
.mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
Hash join
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track, genre)
streams
.hashJoin(tgp.forceToDisk) // tgp replicated to all mappers
.values // (user, genre)
.group
.mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
CoGroup
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track, genre)
streams
.cogroup(tgp) { case (_, users, genres) =>
users.map((_, genres.toSet))
} // (track, (user, genres))
.values // (user, genres)

.group
.reduce(_ ++ _) // map-side reduce!
CoGroup
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track, genre)
streams
.cogroup(tgp) { case (_, users, genres) =>
users.map((_, genres.toSet))
} // (track, (user, genres))
.values // (user, genres)

.group
.sum // SetMonoid[Set[T]] from Algebird
* sum[U >:V](implicit sg: Semigroup[U])
Key-value file as distributed cache
val streams: TypedPipe[(String, String)] = _ // (gid, user)
val tgp: SparkeyManager = _ // tgp replicated to all mappers
streams
.map { case (track, user) =>
(user, tgp.get(track).split(",").toSet)
}
.group
.sum
https://github.com/spotify/sparkey
SparkeyManagerwraps DistributedCacheFile
Joins and CoGroups
• Require shuffle and reduce step
• Some ops force everything to reducers

e.g. mapGroup, mapValueStream
• CoGroup more flexible for complex logic
• Scalding flattens a.join(b).join(c)…

into MultiJoin(a, b, c, …)
Distributed cache
• Fasterwith off-heap binary files
• Building cache = more wiring
• Memory mapping may interfere withYARN
• E.g. 64GB nodes with 48GB for containers (no cgroup)
• 12 × 2GB containers each with 2GB JVM heap + mmap cache
• OOM and swap!
• Keep files small (< 1GB) or fallback to joins…
Analyze your jobs
• Concurrent Driven
• Visualize job execution
• Workflow optimization
• Bottlenecks
• Data skew
Notenough
math?
Recommending tracks
• User listened to Rammstein - Du Hast
• Recommend 10 similartracks
• 40 dimension feature vectors fortracks
• Compute cosine similarity between all pairs
• O(n) lookup per userwhere n ≈ 30m
• Trythat with 50m users * 10 seed tracks each
ANNOY - cheat by approximation
• Approximate Nearest Neighbor OhYeah
• Random projections and binarytree search
• Build index on single machine
• Load in mappers via distribute cache
• O(log n) lookup
https://github.com/spotify/annoy
https://github.com/spotify/annoy-java
ANN Benchmark
https://github.com/erikbern/ann-benchmarks
Filtering candidates
• Users don’t like seeing artist/album/tracks they already know
• But may forget what they listened long ago
• 50m * thousands of items each
• Over 5 years of streaming logs
• Need to update daily
• Need to purge old items per user
Options
• Aggregate all logs daily
• Aggregate last x days daily
• CSVof artist/album/track ids
• Bloom filters
Decayed value with cutoff
• Compute new user-item score daily
• Weighted on context, e.g. radio, search, playlist
• score’ = score + previous * 0.99
• half life = log0.99
0.5 = 69 days
• Cut off at top 2000
• Items that users might remember seeing recently
Bloom filters
• Probabilistic data structure
• Encoding set of items with m bits and k hash functions
• No false negative
• Tunable false positive probability
• Size proportional to capacity & FP probability
• Let’s build one per user-{artists,albums,tracks}
• Algebird BloomFilterMonoid: z = all zero bits, + = bitwise OR
Size versus max items & FP prob
• User-item distribution is uneven
• Assuming same setting for all users
• # items << capacity → wasting space
• # items > capacity → high FP rate
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users have few BFs
• Power users have many
• Serialization and lookup overhead
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users have few BFs
• Power users have many
• Serialization and lookup overhead
n=1k
item
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users have few BFs
• Power users have many
• Serialization and lookup overhead
n=1k n=10k
item
full
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users have few BFs
• Power users have many
• Serialization and lookup overhead
item
n=1k n=10k n=100k
fullfull
Scalable Bloom Filter
• Growing sequence of standard BFs
• Increasing capacity and tighter FP probability
• Most users have few BFs
• Power users have many
• Serialization and lookup overhead
n=1k n=10k n=100k n=1m
item
fullfullfull
Opportunistic Bloom Filter
• Building n BFs of increasing capacity in parallel
• Up to << N max possible items
• Keep smallest one with capacity > items inserted
• Expensive to build
• Cheap to store and lookup
Opportunistic Bloom Filter
• Building n BFs of increasing capacity in parallel
• Up to << N max possible items
• Keep smallest one with capacity > items inserted
• Expensive to build
• Cheap to store and lookup
n=1k
 
80%
n=10k
 
8%
n=100k
 
0.8%
n=1m
 
0.08%
item
Opportunistic Bloom Filter
• Building n BFs of increasing capacity in parallel
• Up to  N max possible items
• Keep smallest one with capacity  items inserted
• Expensive to build
• Cheap to store and lookup
n=1k
 
100%
n=10k
 
70%
n=100k
 
7%
n=1m
 
0.7%
item
full
Opportunistic Bloom Filter
• Building n BFs of increasing capacity in parallel
• Up to  N max possible items
• Keep smallest one with capacity  items inserted
• Expensive to build
• Cheap to store and lookup
n=1k
 
100%
n=10k
 
100%
n=100k
 
60%
n=1m

More Related Content

What's hot

Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and Spotify
Chris Johnson
 

What's hot (20)

Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014
 
CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At Spotify
 
Personalized Playlists at Spotify
Personalized Playlists at SpotifyPersonalized Playlists at Spotify
Personalized Playlists at Spotify
 
Big data and machine learning @ Spotify
Big data and machine learning @ SpotifyBig data and machine learning @ Spotify
Big data and machine learning @ Spotify
 
How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At Spotify
 
Personalizing the listening experience
Personalizing the listening experiencePersonalizing the listening experience
Personalizing the listening experience
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and Spotify
 
Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data Meetup
 
Engagement, Metrics & Personalisation at Scale
Engagement, Metrics &  Personalisation at ScaleEngagement, Metrics &  Personalisation at Scale
Engagement, Metrics & Personalisation at Scale
 
The Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainThe Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and Pain
 
Recommending and searching @ Spotify
Recommending and searching @ SpotifyRecommending and searching @ Spotify
Recommending and searching @ Spotify
 
Storm at Spotify
Storm at SpotifyStorm at Spotify
Storm at Spotify
 
Recommendation Modeling with Impression Data at Netflix
Recommendation Modeling with Impression Data at NetflixRecommendation Modeling with Impression Data at Netflix
Recommendation Modeling with Impression Data at Netflix
 
Homepage Personalization at Spotify
Homepage Personalization at SpotifyHomepage Personalization at Spotify
Homepage Personalization at Spotify
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive Analytics
 
Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.
 
Calibrated Recommendations
Calibrated RecommendationsCalibrated Recommendations
Calibrated Recommendations
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 

Viewers also liked

Viewers also liked (7)

Playlist Recommendations @ Spotify
Playlist Recommendations @ SpotifyPlaylist Recommendations @ Spotify
Playlist Recommendations @ Spotify
 
Music survey results (2)
Music survey results (2)Music survey results (2)
Music survey results (2)
 
Music & interaction
Music & interactionMusic & interaction
Music & interaction
 
Mugo one pager
Mugo one pagerMugo one pager
Mugo one pager
 
Jackdaw research music survey report
Jackdaw research music survey reportJackdaw research music survey report
Jackdaw research music survey report
 
How We Listen to Music - SXSW 2015
How We Listen to Music - SXSW 2015How We Listen to Music - SXSW 2015
How We Listen to Music - SXSW 2015
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At Spotify
 

Similar to Scala Data Pipelines @ Spotify

London devops logging
London devops loggingLondon devops logging
London devops logging
Tomas Doran
 
Message:Passing - lpw 2012
Message:Passing - lpw 2012Message:Passing - lpw 2012
Message:Passing - lpw 2012
Tomas Doran
 

Similar to Scala Data Pipelines @ Spotify (20)

London devops logging
London devops loggingLondon devops logging
London devops logging
 
Intelligent Search
Intelligent SearchIntelligent Search
Intelligent Search
 
CPANTS: Kwalitative website and its tools
CPANTS: Kwalitative website and its toolsCPANTS: Kwalitative website and its tools
CPANTS: Kwalitative website and its tools
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
 
Let's Get to the Rapids
Let's Get to the RapidsLet's Get to the Rapids
Let's Get to the Rapids
 
Intelligent Search
Intelligent SearchIntelligent Search
Intelligent Search
 
Vertica architecture
Vertica architectureVertica architecture
Vertica architecture
 
Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a cluster
 
Akka streams
Akka streamsAkka streams
Akka streams
 
Tuning Your Engine
Tuning Your EngineTuning Your Engine
Tuning Your Engine
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
 
Scala in practice - 3 years later
Scala in practice - 3 years laterScala in practice - 3 years later
Scala in practice - 3 years later
 
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...
 
Hive at Last.fm
Hive at Last.fmHive at Last.fm
Hive at Last.fm
 
Graphite
GraphiteGraphite
Graphite
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
 
Message:Passing - lpw 2012
Message:Passing - lpw 2012Message:Passing - lpw 2012
Message:Passing - lpw 2012
 

More from Neville Li

More from Neville Li (6)

Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
 
Scio
ScioScio
Scio
 
From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...
 
Why functional why scala
Why functional  why scala Why functional  why scala
Why functional why scala
 

Recently uploaded

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Recently uploaded (20)

A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 

Scala Data Pipelines @ Spotify

  • 2. Who am I? ‣ SpotifyNYCsince2011 ‣ FormerlyYahoo!Search ‣ Musicrecommendations ‣ Datainfrastructure ‣ Scalasince2013
  • 3. Spotify in numbers • Started in 2006, 58 markets • 75M+ active users, 20M+ paying • 30M+ songs, 20K new per day • 1.5 billion playlists • 1 TB logs per day • 1200+ node Hadoop cluster • 10K+ Hadoop jobs per day
  • 4. Music recommendation @ Spotify • Discover Weekly • Radio • RelatedArtists • Discover Page
  • 6. A little teaser PGroupedTable<K,V>::combineValues(CombineFn<K,V> combineFn, CombineFn<K,V> reduceFn) Crunch: CombineFns are used to represent the associative operations… Grouped[K, +V]::reduce[U >: V](fn: (U, U) U) Scalding: reduce with fn which must be associative and commutative… PairRDDFunctions[K, V]::reduceByKey(fn: (V, V) => V) Spark: Merge the values for each key using an associative reduce function…
  • 7. Monoid! enables map side reduce Actually it’s a semigroup
  • 8. One more teaser Linear equation inAlternate Least Square (ALS) Matrix factorization xu = (YTY + YT(Cu − I)Y)−1YTCup(u) vectors.map { case (id, v) => (id, v * v) }.map(_._2).reduce(_ + _) // YtY ratings.keyBy(fixedKey).join(outerProducts) // YtCuIY .map { case (_, (r, op)) => (solveKey(r), op * (r.rating * alpha)) }.reduceByKey(_ + _) ratings.keyBy(fixedKey).join(vectors) // YtCupu .map { case (_, (r, v)) => val (Cui, pui) = (r.rating * alpha + 1, if (Cui > 0.0) 1.0 else 0.0) (solveKey(r), v * (Cui * pui)) }.reduceByKey(_ + _) http://www.slideshare.net/MrChrisJohnson/scala-data-pipelines-for-music-recommendations
  • 9. Success story • Mid 2013: 100+ Python Luigi M/R jobs, few tests • 10+ new hires since, most fresh grads • Few with Java experience, none with Scala • Now: 300+ Scalding jobs, 400+ tests • More ad-hoc jobs untracked • Spark also taking off
  • 12. Guess how many jobs written by yours truly?
  • 15. To join or not to join? val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .join(tgp) .values // (user, genre) .group .mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
  • 16. Hash join val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .hashJoin(tgp.forceToDisk) // tgp replicated to all mappers .values // (user, genre) .group .mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
  • 17. CoGroup val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .cogroup(tgp) { case (_, users, genres) => users.map((_, genres.toSet)) } // (track, (user, genres)) .values // (user, genres)
 .group .reduce(_ ++ _) // map-side reduce!
  • 18. CoGroup val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .cogroup(tgp) { case (_, users, genres) => users.map((_, genres.toSet)) } // (track, (user, genres)) .values // (user, genres)
 .group .sum // SetMonoid[Set[T]] from Algebird * sum[U >:V](implicit sg: Semigroup[U])
  • 19. Key-value file as distributed cache val streams: TypedPipe[(String, String)] = _ // (gid, user) val tgp: SparkeyManager = _ // tgp replicated to all mappers streams .map { case (track, user) => (user, tgp.get(track).split(",").toSet) } .group .sum https://github.com/spotify/sparkey SparkeyManagerwraps DistributedCacheFile
  • 20. Joins and CoGroups • Require shuffle and reduce step • Some ops force everything to reducers
 e.g. mapGroup, mapValueStream • CoGroup more flexible for complex logic • Scalding flattens a.join(b).join(c)…
 into MultiJoin(a, b, c, …)
  • 21. Distributed cache • Fasterwith off-heap binary files • Building cache = more wiring • Memory mapping may interfere withYARN • E.g. 64GB nodes with 48GB for containers (no cgroup) • 12 × 2GB containers each with 2GB JVM heap + mmap cache • OOM and swap! • Keep files small (< 1GB) or fallback to joins…
  • 22. Analyze your jobs • Concurrent Driven • Visualize job execution • Workflow optimization • Bottlenecks • Data skew
  • 24. Recommending tracks • User listened to Rammstein - Du Hast • Recommend 10 similartracks • 40 dimension feature vectors fortracks • Compute cosine similarity between all pairs • O(n) lookup per userwhere n ≈ 30m • Trythat with 50m users * 10 seed tracks each
  • 25. ANNOY - cheat by approximation • Approximate Nearest Neighbor OhYeah • Random projections and binarytree search • Build index on single machine • Load in mappers via distribute cache • O(log n) lookup https://github.com/spotify/annoy https://github.com/spotify/annoy-java
  • 27. Filtering candidates • Users don’t like seeing artist/album/tracks they already know • But may forget what they listened long ago • 50m * thousands of items each • Over 5 years of streaming logs • Need to update daily • Need to purge old items per user
  • 28. Options • Aggregate all logs daily • Aggregate last x days daily • CSVof artist/album/track ids • Bloom filters
  • 29. Decayed value with cutoff • Compute new user-item score daily • Weighted on context, e.g. radio, search, playlist • score’ = score + previous * 0.99 • half life = log0.99 0.5 = 69 days • Cut off at top 2000 • Items that users might remember seeing recently
  • 30. Bloom filters • Probabilistic data structure • Encoding set of items with m bits and k hash functions • No false negative • Tunable false positive probability • Size proportional to capacity & FP probability • Let’s build one per user-{artists,albums,tracks} • Algebird BloomFilterMonoid: z = all zero bits, + = bitwise OR
  • 31. Size versus max items & FP prob • User-item distribution is uneven • Assuming same setting for all users • # items << capacity → wasting space • # items > capacity → high FP rate
  • 32. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead
  • 33. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead n=1k item
  • 34. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead n=1k n=10k item full
  • 35. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead item n=1k n=10k n=100k fullfull
  • 36. Scalable Bloom Filter • Growing sequence of standard BFs • Increasing capacity and tighter FP probability • Most users have few BFs • Power users have many • Serialization and lookup overhead n=1k n=10k n=100k n=1m item fullfullfull
  • 37. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to << N max possible items • Keep smallest one with capacity > items inserted • Expensive to build • Cheap to store and lookup
  • 38. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to << N max possible items • Keep smallest one with capacity > items inserted • Expensive to build • Cheap to store and lookup n=1k
  • 43. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to N max possible items • Keep smallest one with capacity items inserted • Expensive to build • Cheap to store and lookup n=1k
  • 48. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to N max possible items • Keep smallest one with capacity items inserted • Expensive to build • Cheap to store and lookup n=1k
  • 53. Opportunistic Bloom Filter • Building n BFs of increasing capacity in parallel • Up to N max possible items • Keep smallest one with capacity items inserted • Expensive to build • Cheap to store and lookup n=1k
  • 60. Track metadata • Label dump → content ingestion • Third partytrack genres, e.g. GraceNote • Audio attributes, e.g. tempo, key, time signature • Cultural data, e.g. popularity, tags • Latent vectors from collaborative filtering • Many sources for album, artist, user metadata too
  • 61. Multiple data sources • Big joins • Complex dependencies • Wide rows with few columns accessed • Wasting I/O
  • 62. Apache Parquet • Pre-join sources into mega-datasets • Store as Parquet columnar storage • Column projection • Predicate pushdown • Avro within Scalding pipelines
  • 63. Projection pipe.map(a = (a.getName, a.getAmount)) versus Parquet.project[Account](name, amount) • Strings → unsafe and error prone • No IDE auto-completion → finger injury • my_fancy_field_name → .getMyFancyFieldName • Hard to migrate existing code
  • 64. Predicate pipe.filter(a = a.getName == Neville a.getAmount 100) versus FilterApi.and( FilterApi.eq(FilterApi.binaryColumn(name), Binary.fromString(Neville)), FilterApi.gt(FilterApi.floatColumn(amount), 100f.asInstnacesOf[java.lang.Float]))
  • 65. Macro to the rescue Code →AST→ (pattern matching) → (recursion) → (quasi-quotes) → Code Projection[Account](_.getName, _.getAmount) Predicate[Account](x = x.getName == “Neville x.getAmount 100) https://github.com/nevillelyh/parquet-avro-extra http://www.lyh.me/slides/macros.html
  • 66. What else? ‣ Analytics ‣ Adstargeting,prediction ‣ Metadataquality ‣ Zeppelin ‣ Morecoolstuffintheworks