SlideShare a Scribd company logo
1 of 51
Download to read offline
Scala Data
Pipelines @
Spotify
Neville Li
@sinisa_lyh
Who am I?
โ€ฃ SpotifyNYCsince2011
โ€ฃ FormerlyYahoo!Search
โ€ฃ Musicrecommendations
โ€ฃ Datainfrastructure
โ€ฃ Scalasince2013
Spotify in numbers
โ€ข Started in 2006, 58 markets
โ€ข 75M+ active users, 20M+ paying
โ€ข 30M+ songs, 20K new per day
โ€ข 1.5 billion playlists
โ€ข 1 TB logs per day
โ€ข 1200+ node Hadoop cluster
โ€ข 10K+ Hadoop jobs per day
Music recommendation @ Spotify
โ€ข Discover Weekly
โ€ข Radio
โ€ข RelatedArtists
โ€ข Discover Page
Recommendation systems
A little teaser
PGroupedTable<K,V>::combineValues(CombineFn<K,V> combineFn,
CombineFn<K,V> reduceFn)
Crunch: CombineFns are used to represent the associative operationsโ€ฆ
Grouped[K, +V]::reduce[U >: V](fn: (U, U) U)
Scalding: reduce with fn which must be associative and commutativeโ€ฆ
PairRDDFunctions[K, V]::reduceByKey(fn: (V, V) => V)
Spark: Merge the values for each key using an associative reduce functionโ€ฆ
Monoid!
enables map side reduce
Actually itโ€™s a semigroup
One more teaser
Linear equation inAlternate Least Square (ALS) Matrix factorization
xu = (YTY + YT(Cu โˆ’ I)Y)โˆ’1YTCup(u)
vectors.map { case (id, v) => (id, v * v) }.map(_._2).reduce(_ + _) // YtY
ratings.keyBy(fixedKey).join(outerProducts) // YtCuIY
.map { case (_, (r, op)) =>
(solveKey(r), op * (r.rating * alpha))
}.reduceByKey(_ + _)
ratings.keyBy(fixedKey).join(vectors) // YtCupu
.map { case (_, (r, v)) =>
val (Cui, pui) = (r.rating * alpha + 1, if (Cui > 0.0) 1.0 else 0.0)
(solveKey(r), v * (Cui * pui))
}.reduceByKey(_ + _)
http://www.slideshare.net/MrChrisJohnson/scala-data-pipelines-for-music-recommendations
Success story
โ€ข Mid 2013: 100+ Python Luigi M/R jobs, few tests
โ€ข 10+ new hires since, most fresh grads
โ€ข Few with Java experience, none with Scala
โ€ข Now: 300+ Scalding jobs, 400+ tests
โ€ข More ad-hoc jobs untracked
โ€ข Spark also taking off
First 10 months
โ€ฆโ€ฆ
Activity over time
Guess how many jobs
written by yours truly?
Performance vs. Agility
https://nicholassterling.wordpress.com/2012/11/16/scala-performance/
Letโ€™sdiveinto
something
technical
To join or not to join?
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track, genre)
streams
.join(tgp)
.values // (user, genre)
.group
.mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
Hash join
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track, genre)
streams
.hashJoin(tgp.forceToDisk) // tgp replicated to all mappers
.values // (user, genre)
.group
.mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
CoGroup
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track, genre)
streams
.cogroup(tgp) { case (_, users, genres) =>
users.map((_, genres.toSet))
} // (track, (user, genres))
.values // (user, genres)โ€จ
.group
.reduce(_ ++ _) // map-side reduce!
CoGroup
val streams: TypedPipe[(String, String)] = _ // (track, user)
val tgp: TypedPipe[(String, String)] = _ // (track, genre)
streams
.cogroup(tgp) { case (_, users, genres) =>
users.map((_, genres.toSet))
} // (track, (user, genres))
.values // (user, genres)โ€จ
.group
.sum // SetMonoid[Set[T]] from Algebird
* sum[U >:V](implicit sg: Semigroup[U])
Key-value file as distributed cache
val streams: TypedPipe[(String, String)] = _ // (gid, user)
val tgp: SparkeyManager = _ // tgp replicated to all mappers
streams
.map { case (track, user) =>
(user, tgp.get(track).split(",").toSet)
}
.group
.sum
https://github.com/spotify/sparkey
SparkeyManagerwraps DistributedCacheFile
Joins and CoGroups
โ€ข Require shuffle and reduce step
โ€ข Some ops force everything to reducersโ€จ
e.g. mapGroup, mapValueStream
โ€ข CoGroup more flexible for complex logic
โ€ข Scalding flattens a.join(b).join(c)โ€ฆโ€จ
into MultiJoin(a, b, c, โ€ฆ)
Distributed cache
โ€ข Fasterwith off-heap binary files
โ€ข Building cache = more wiring
โ€ข Memory mapping may interfere withYARN
โ€ข E.g. 64GB nodes with 48GB for containers (no cgroup)
โ€ข 12 ร— 2GB containers each with 2GB JVM heap + mmap cache
โ€ข OOM and swap!
โ€ข Keep files small (< 1GB) or fallback to joinsโ€ฆ
Analyze your jobs
โ€ข Concurrent Driven
โ€ข Visualize job execution
โ€ข Workflow optimization
โ€ข Bottlenecks
โ€ข Data skew
Notenough
math?
Recommending tracks
โ€ข User listened to Rammstein - Du Hast
โ€ข Recommend 10 similartracks
โ€ข 40 dimension feature vectors fortracks
โ€ข Compute cosine similarity between all pairs
โ€ข O(n) lookup per userwhere n โ‰ˆ 30m
โ€ข Trythat with 50m users * 10 seed tracks each
ANNOY - cheat by approximation
โ€ข Approximate Nearest Neighbor OhYeah
โ€ข Random projections and binarytree search
โ€ข Build index on single machine
โ€ข Load in mappers via distribute cache
โ€ข O(log n) lookup
https://github.com/spotify/annoy
https://github.com/spotify/annoy-java
ANN Benchmark
https://github.com/erikbern/ann-benchmarks
Filtering candidates
โ€ข Users donโ€™t like seeing artist/album/tracks they already know
โ€ข But may forget what they listened long ago
โ€ข 50m * thousands of items each
โ€ข Over 5 years of streaming logs
โ€ข Need to update daily
โ€ข Need to purge old items per user
Options
โ€ข Aggregate all logs daily
โ€ข Aggregate last x days daily
โ€ข CSVof artist/album/track ids
โ€ข Bloom filters
Decayed value with cutoff
โ€ข Compute new user-item score daily
โ€ข Weighted on context, e.g. radio, search, playlist
โ€ข scoreโ€™ = score + previous * 0.99
โ€ข half life = log0.99
0.5 = 69 days
โ€ข Cut off at top 2000
โ€ข Items that users might remember seeing recently
Bloom filters
โ€ข Probabilistic data structure
โ€ข Encoding set of items with m bits and k hash functions
โ€ข No false negative
โ€ข Tunable false positive probability
โ€ข Size proportional to capacity & FP probability
โ€ข Letโ€™s build one per user-{artists,albums,tracks}
โ€ข Algebird BloomFilterMonoid: z = all zero bits, + = bitwise OR
Size versus max items & FP prob
โ€ข User-item distribution is uneven
โ€ข Assuming same setting for all users
โ€ข # items << capacity โ†’ wasting space
โ€ข # items > capacity โ†’ high FP rate
Scalable Bloom Filter
โ€ข Growing sequence of standard BFs
โ€ข Increasing capacity and tighter FP probability
โ€ข Most users have few BFs
โ€ข Power users have many
โ€ข Serialization and lookup overhead
Scalable Bloom Filter
โ€ข Growing sequence of standard BFs
โ€ข Increasing capacity and tighter FP probability
โ€ข Most users have few BFs
โ€ข Power users have many
โ€ข Serialization and lookup overhead
n=1k
item
Scalable Bloom Filter
โ€ข Growing sequence of standard BFs
โ€ข Increasing capacity and tighter FP probability
โ€ข Most users have few BFs
โ€ข Power users have many
โ€ข Serialization and lookup overhead
n=1k n=10k
item
full
Scalable Bloom Filter
โ€ข Growing sequence of standard BFs
โ€ข Increasing capacity and tighter FP probability
โ€ข Most users have few BFs
โ€ข Power users have many
โ€ข Serialization and lookup overhead
item
n=1k n=10k n=100k
fullfull
Scalable Bloom Filter
โ€ข Growing sequence of standard BFs
โ€ข Increasing capacity and tighter FP probability
โ€ข Most users have few BFs
โ€ข Power users have many
โ€ข Serialization and lookup overhead
n=1k n=10k n=100k n=1m
item
fullfullfull
Opportunistic Bloom Filter
โ€ข Building n BFs of increasing capacity in parallel
โ€ข Up to << N max possible items
โ€ข Keep smallest one with capacity > items inserted
โ€ข Expensive to build
โ€ข Cheap to store and lookup
Opportunistic Bloom Filter
โ€ข Building n BFs of increasing capacity in parallel
โ€ข Up to << N max possible items
โ€ข Keep smallest one with capacity > items inserted
โ€ข Expensive to build
โ€ข Cheap to store and lookup
n=1k
ย 
80%
n=10k
ย 
8%
n=100k
ย 
0.8%
n=1m
ย 
0.08%
item
Opportunistic Bloom Filter
โ€ข Building n BFs of increasing capacity in parallel
โ€ข Up to  N max possible items
โ€ข Keep smallest one with capacity  items inserted
โ€ข Expensive to build
โ€ข Cheap to store and lookup
n=1k
ย 
100%
n=10k
ย 
70%
n=100k
ย 
7%
n=1m
ย 
0.7%
item
full
Opportunistic Bloom Filter
โ€ข Building n BFs of increasing capacity in parallel
โ€ข Up to  N max possible items
โ€ข Keep smallest one with capacity  items inserted
โ€ข Expensive to build
โ€ข Cheap to store and lookup
n=1k
ย 
100%
n=10k
ย 
100%
n=100k
ย 
60%
n=1m

More Related Content

What's hot

Bighead: Airbnbโ€™s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnbโ€™s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnbโ€™s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnbโ€™s End-to-End Machine Learning Platform with Krishna Puttaswa...
Databricks
ย 

What's hot (20)

Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014
ย 
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data MeetupMachine learning @ Spotify - Madison Big Data Meetup
Machine learning @ Spotify - Madison Big Data Meetup
ย 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At Spotify
ย 
Zipline: Airbnbโ€™s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnbโ€™s Machine Learning Data Management Platform with Nikhil Simha...Zipline: Airbnbโ€™s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnbโ€™s Machine Learning Data Management Platform with Nikhil Simha...
ย 
Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20
ย 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
ย 
Machine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyMachine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at Spotify
ย 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer Guide
ย 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
ย 
Bighead: Airbnbโ€™s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnbโ€™s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnbโ€™s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnbโ€™s End-to-End Machine Learning Platform with Krishna Puttaswa...
ย 
Personalized Playlists at Spotify
Personalized Playlists at SpotifyPersonalized Playlists at Spotify
Personalized Playlists at Spotify
ย 
System design for recommendations and search
System design for recommendations and searchSystem design for recommendations and search
System design for recommendations and search
ย 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
ย 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
ย 
The Universal Recommender
The Universal RecommenderThe Universal Recommender
The Universal Recommender
ย 
LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019
ย 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
ย 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover Weekly
ย 
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiReal-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
ย 
Zeus: Uberโ€™s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uberโ€™s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uberโ€™s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uberโ€™s Highly Scalable and Distributed Shuffle as a Service
ย 

Viewers also liked

Music survey results (2)
Music survey results (2)Music survey results (2)
Music survey results (2)
jademarieashworthxx
ย 

Viewers also liked (6)

Playlist Recommendations @ Spotify
Playlist Recommendations @ SpotifyPlaylist Recommendations @ Spotify
Playlist Recommendations @ Spotify
ย 
Music survey results (2)
Music survey results (2)Music survey results (2)
Music survey results (2)
ย 
Music & interaction
Music & interactionMusic & interaction
Music & interaction
ย 
Mugo one pager
Mugo one pagerMugo one pager
Mugo one pager
ย 
Jackdaw research music survey report
Jackdaw research music survey reportJackdaw research music survey report
Jackdaw research music survey report
ย 
How We Listen to Music - SXSW 2015
How We Listen to Music - SXSW 2015How We Listen to Music - SXSW 2015
How We Listen to Music - SXSW 2015
ย 

Similar to Scala Data Pipelines @ Spotify

London devops logging
London devops loggingLondon devops logging
London devops logging
Tomas Doran
ย 
Message:Passing - lpw 2012
Message:Passing - lpw 2012Message:Passing - lpw 2012
Message:Passing - lpw 2012
Tomas Doran
ย 

Similar to Scala Data Pipelines @ Spotify (20)

London devops logging
London devops loggingLondon devops logging
London devops logging
ย 
Intelligent Search
Intelligent SearchIntelligent Search
Intelligent Search
ย 
CPANTS: Kwalitative website and its tools
CPANTS: Kwalitative website and its toolsCPANTS: Kwalitative website and its tools
CPANTS: Kwalitative website and its tools
ย 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
ย 
Let's Get to the Rapids
Let's Get to the RapidsLet's Get to the Rapids
Let's Get to the Rapids
ย 
Intelligent Search
Intelligent SearchIntelligent Search
Intelligent Search
ย 
Vertica architecture
Vertica architectureVertica architecture
Vertica architecture
ย 
Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)
ย 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a cluster
ย 
Akka streams
Akka streamsAkka streams
Akka streams
ย 
Tuning Your Engine
Tuning Your EngineTuning Your Engine
Tuning Your Engine
ย 
Mendeleyโ€™s Research Catalogue: building it, opening it up and making it even ...
Mendeleyโ€™s Research Catalogue: building it, opening it up and making it even ...Mendeleyโ€™s Research Catalogue: building it, opening it up and making it even ...
Mendeleyโ€™s Research Catalogue: building it, opening it up and making it even ...
ย 
Scala in practice - 3 years later
Scala in practice - 3 years laterScala in practice - 3 years later
Scala in practice - 3 years later
ย 
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...
Scala in-practice-3-years by Patric Fornasier, Springr, presented at Pune Sca...
ย 
Hive at Last.fm
Hive at Last.fmHive at Last.fm
Hive at Last.fm
ย 
Graphite
GraphiteGraphite
Graphite
ย 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
ย 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
ย 
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
ย 
Message:Passing - lpw 2012
Message:Passing - lpw 2012Message:Passing - lpw 2012
Message:Passing - lpw 2012
ย 

More from Neville Li

More from Neville Li (7)

Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
ย 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
ย 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
ย 
Scio
ScioScio
Scio
ย 
From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...
ย 
Why functional why scala
Why functional  why scala Why functional  why scala
Why functional why scala
ย 
Storm at Spotify
Storm at SpotifyStorm at Spotify
Storm at Spotify
ย 

Recently uploaded

CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
ย 
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female serviceCALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
anilsa9823
ย 
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online โ˜‚๏ธ
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online  โ˜‚๏ธCALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online  โ˜‚๏ธ
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online โ˜‚๏ธ
anilsa9823
ย 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
bodapatigopi8531
ย 

Recently uploaded (20)

CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
ย 
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...
ย 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
ย 
Vip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS LiveVip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS Live
ย 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
ย 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
ย 
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female serviceCALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
ย 
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online โ˜‚๏ธ
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online  โ˜‚๏ธCALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online  โ˜‚๏ธ
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online โ˜‚๏ธ
ย 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
ย 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
ย 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
ย 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
ย 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
ย 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlanโ€™s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlanโ€™s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlanโ€™s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlanโ€™s ...
ย 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
ย 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
ย 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
ย 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
ย 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
ย 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
ย 

Scala Data Pipelines @ Spotify

  • 2. Who am I? โ€ฃ SpotifyNYCsince2011 โ€ฃ FormerlyYahoo!Search โ€ฃ Musicrecommendations โ€ฃ Datainfrastructure โ€ฃ Scalasince2013
  • 3. Spotify in numbers โ€ข Started in 2006, 58 markets โ€ข 75M+ active users, 20M+ paying โ€ข 30M+ songs, 20K new per day โ€ข 1.5 billion playlists โ€ข 1 TB logs per day โ€ข 1200+ node Hadoop cluster โ€ข 10K+ Hadoop jobs per day
  • 4. Music recommendation @ Spotify โ€ข Discover Weekly โ€ข Radio โ€ข RelatedArtists โ€ข Discover Page
  • 6. A little teaser PGroupedTable<K,V>::combineValues(CombineFn<K,V> combineFn, CombineFn<K,V> reduceFn) Crunch: CombineFns are used to represent the associative operationsโ€ฆ Grouped[K, +V]::reduce[U >: V](fn: (U, U) U) Scalding: reduce with fn which must be associative and commutativeโ€ฆ PairRDDFunctions[K, V]::reduceByKey(fn: (V, V) => V) Spark: Merge the values for each key using an associative reduce functionโ€ฆ
  • 7. Monoid! enables map side reduce Actually itโ€™s a semigroup
  • 8. One more teaser Linear equation inAlternate Least Square (ALS) Matrix factorization xu = (YTY + YT(Cu โˆ’ I)Y)โˆ’1YTCup(u) vectors.map { case (id, v) => (id, v * v) }.map(_._2).reduce(_ + _) // YtY ratings.keyBy(fixedKey).join(outerProducts) // YtCuIY .map { case (_, (r, op)) => (solveKey(r), op * (r.rating * alpha)) }.reduceByKey(_ + _) ratings.keyBy(fixedKey).join(vectors) // YtCupu .map { case (_, (r, v)) => val (Cui, pui) = (r.rating * alpha + 1, if (Cui > 0.0) 1.0 else 0.0) (solveKey(r), v * (Cui * pui)) }.reduceByKey(_ + _) http://www.slideshare.net/MrChrisJohnson/scala-data-pipelines-for-music-recommendations
  • 9. Success story โ€ข Mid 2013: 100+ Python Luigi M/R jobs, few tests โ€ข 10+ new hires since, most fresh grads โ€ข Few with Java experience, none with Scala โ€ข Now: 300+ Scalding jobs, 400+ tests โ€ข More ad-hoc jobs untracked โ€ข Spark also taking off
  • 12. Guess how many jobs written by yours truly?
  • 15. To join or not to join? val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .join(tgp) .values // (user, genre) .group .mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
  • 16. Hash join val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .hashJoin(tgp.forceToDisk) // tgp replicated to all mappers .values // (user, genre) .group .mapValueStream(vs => Iterator(vs.toSet)) // reducer-only
  • 17. CoGroup val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .cogroup(tgp) { case (_, users, genres) => users.map((_, genres.toSet)) } // (track, (user, genres)) .values // (user, genres)โ€จ .group .reduce(_ ++ _) // map-side reduce!
  • 18. CoGroup val streams: TypedPipe[(String, String)] = _ // (track, user) val tgp: TypedPipe[(String, String)] = _ // (track, genre) streams .cogroup(tgp) { case (_, users, genres) => users.map((_, genres.toSet)) } // (track, (user, genres)) .values // (user, genres)โ€จ .group .sum // SetMonoid[Set[T]] from Algebird * sum[U >:V](implicit sg: Semigroup[U])
  • 19. Key-value file as distributed cache val streams: TypedPipe[(String, String)] = _ // (gid, user) val tgp: SparkeyManager = _ // tgp replicated to all mappers streams .map { case (track, user) => (user, tgp.get(track).split(",").toSet) } .group .sum https://github.com/spotify/sparkey SparkeyManagerwraps DistributedCacheFile
  • 20. Joins and CoGroups โ€ข Require shuffle and reduce step โ€ข Some ops force everything to reducersโ€จ e.g. mapGroup, mapValueStream โ€ข CoGroup more flexible for complex logic โ€ข Scalding flattens a.join(b).join(c)โ€ฆโ€จ into MultiJoin(a, b, c, โ€ฆ)
  • 21. Distributed cache โ€ข Fasterwith off-heap binary files โ€ข Building cache = more wiring โ€ข Memory mapping may interfere withYARN โ€ข E.g. 64GB nodes with 48GB for containers (no cgroup) โ€ข 12 ร— 2GB containers each with 2GB JVM heap + mmap cache โ€ข OOM and swap! โ€ข Keep files small (< 1GB) or fallback to joinsโ€ฆ
  • 22. Analyze your jobs โ€ข Concurrent Driven โ€ข Visualize job execution โ€ข Workflow optimization โ€ข Bottlenecks โ€ข Data skew
  • 24. Recommending tracks โ€ข User listened to Rammstein - Du Hast โ€ข Recommend 10 similartracks โ€ข 40 dimension feature vectors fortracks โ€ข Compute cosine similarity between all pairs โ€ข O(n) lookup per userwhere n โ‰ˆ 30m โ€ข Trythat with 50m users * 10 seed tracks each
  • 25. ANNOY - cheat by approximation โ€ข Approximate Nearest Neighbor OhYeah โ€ข Random projections and binarytree search โ€ข Build index on single machine โ€ข Load in mappers via distribute cache โ€ข O(log n) lookup https://github.com/spotify/annoy https://github.com/spotify/annoy-java
  • 27. Filtering candidates โ€ข Users donโ€™t like seeing artist/album/tracks they already know โ€ข But may forget what they listened long ago โ€ข 50m * thousands of items each โ€ข Over 5 years of streaming logs โ€ข Need to update daily โ€ข Need to purge old items per user
  • 28. Options โ€ข Aggregate all logs daily โ€ข Aggregate last x days daily โ€ข CSVof artist/album/track ids โ€ข Bloom filters
  • 29. Decayed value with cutoff โ€ข Compute new user-item score daily โ€ข Weighted on context, e.g. radio, search, playlist โ€ข scoreโ€™ = score + previous * 0.99 โ€ข half life = log0.99 0.5 = 69 days โ€ข Cut off at top 2000 โ€ข Items that users might remember seeing recently
  • 30. Bloom filters โ€ข Probabilistic data structure โ€ข Encoding set of items with m bits and k hash functions โ€ข No false negative โ€ข Tunable false positive probability โ€ข Size proportional to capacity & FP probability โ€ข Letโ€™s build one per user-{artists,albums,tracks} โ€ข Algebird BloomFilterMonoid: z = all zero bits, + = bitwise OR
  • 31. Size versus max items & FP prob โ€ข User-item distribution is uneven โ€ข Assuming same setting for all users โ€ข # items << capacity โ†’ wasting space โ€ข # items > capacity โ†’ high FP rate
  • 32. Scalable Bloom Filter โ€ข Growing sequence of standard BFs โ€ข Increasing capacity and tighter FP probability โ€ข Most users have few BFs โ€ข Power users have many โ€ข Serialization and lookup overhead
  • 33. Scalable Bloom Filter โ€ข Growing sequence of standard BFs โ€ข Increasing capacity and tighter FP probability โ€ข Most users have few BFs โ€ข Power users have many โ€ข Serialization and lookup overhead n=1k item
  • 34. Scalable Bloom Filter โ€ข Growing sequence of standard BFs โ€ข Increasing capacity and tighter FP probability โ€ข Most users have few BFs โ€ข Power users have many โ€ข Serialization and lookup overhead n=1k n=10k item full
  • 35. Scalable Bloom Filter โ€ข Growing sequence of standard BFs โ€ข Increasing capacity and tighter FP probability โ€ข Most users have few BFs โ€ข Power users have many โ€ข Serialization and lookup overhead item n=1k n=10k n=100k fullfull
  • 36. Scalable Bloom Filter โ€ข Growing sequence of standard BFs โ€ข Increasing capacity and tighter FP probability โ€ข Most users have few BFs โ€ข Power users have many โ€ข Serialization and lookup overhead n=1k n=10k n=100k n=1m item fullfullfull
  • 37. Opportunistic Bloom Filter โ€ข Building n BFs of increasing capacity in parallel โ€ข Up to << N max possible items โ€ข Keep smallest one with capacity > items inserted โ€ข Expensive to build โ€ข Cheap to store and lookup
  • 38. Opportunistic Bloom Filter โ€ข Building n BFs of increasing capacity in parallel โ€ข Up to << N max possible items โ€ข Keep smallest one with capacity > items inserted โ€ข Expensive to build โ€ข Cheap to store and lookup n=1k
  • 43. Opportunistic Bloom Filter โ€ข Building n BFs of increasing capacity in parallel โ€ข Up to N max possible items โ€ข Keep smallest one with capacity items inserted โ€ข Expensive to build โ€ข Cheap to store and lookup n=1k
  • 48. Opportunistic Bloom Filter โ€ข Building n BFs of increasing capacity in parallel โ€ข Up to N max possible items โ€ข Keep smallest one with capacity items inserted โ€ข Expensive to build โ€ข Cheap to store and lookup n=1k
  • 53. Opportunistic Bloom Filter โ€ข Building n BFs of increasing capacity in parallel โ€ข Up to N max possible items โ€ข Keep smallest one with capacity items inserted โ€ข Expensive to build โ€ข Cheap to store and lookup n=1k
  • 60. Track metadata โ€ข Label dump โ†’ content ingestion โ€ข Third partytrack genres, e.g. GraceNote โ€ข Audio attributes, e.g. tempo, key, time signature โ€ข Cultural data, e.g. popularity, tags โ€ข Latent vectors from collaborative filtering โ€ข Many sources for album, artist, user metadata too
  • 61. Multiple data sources โ€ข Big joins โ€ข Complex dependencies โ€ข Wide rows with few columns accessed โ€ข Wasting I/O
  • 62. Apache Parquet โ€ข Pre-join sources into mega-datasets โ€ข Store as Parquet columnar storage โ€ข Column projection โ€ข Predicate pushdown โ€ข Avro within Scalding pipelines
  • 63. Projection pipe.map(a = (a.getName, a.getAmount)) versus Parquet.project[Account](name, amount) โ€ข Strings โ†’ unsafe and error prone โ€ข No IDE auto-completion โ†’ finger injury โ€ข my_fancy_field_name โ†’ .getMyFancyFieldName โ€ข Hard to migrate existing code
  • 64. Predicate pipe.filter(a = a.getName == Neville a.getAmount 100) versus FilterApi.and( FilterApi.eq(FilterApi.binaryColumn(name), Binary.fromString(Neville)), FilterApi.gt(FilterApi.floatColumn(amount), 100f.asInstnacesOf[java.lang.Float]))
  • 65. Macro to the rescue Code โ†’ASTโ†’ (pattern matching) โ†’ (recursion) โ†’ (quasi-quotes) โ†’ Code Projection[Account](_.getName, _.getAmount) Predicate[Account](x = x.getName == โ€œNeville x.getAmount 100) https://github.com/nevillelyh/parquet-avro-extra http://www.lyh.me/slides/macros.html
  • 66. What else? โ€ฃ Analytics โ€ฃ Adstargeting,prediction โ€ฃ Metadataquality โ€ฃ Zeppelin โ€ฃ Morecoolstuffintheworks