SlideShare a Scribd company logo
1 of 75
Download to read offline
Adam Kawa
Data Engineer @ Spotify
(Big) Data At Spotify
At Spotify,
important questions are being
asked all the time!
Some of these questions
are ”relatively easy”
to answer…
1. How many times has Coldplay been streamed this month?
2. Who was the most popular artist in NYC last week?
3. How many times was “Get Lucky” streamed during first 24h?
Labels, Licensor, Partners, Advertisers
■■ Very granular reports are required
-- DividedDivided by gender, age, location and more
■■ We have been delivering various reportsWe have been delivering various reports from day 1
-- Too much data for traditional solutionsToo much data for traditional solutions
Reporting
QUIZ!
Question
Who was the most frequently streamed female artist in 2013?
Answer?
A) Katy Perry
B) Lady Gaga
C) Madonna
D) Rihanna
Popular Artists
Question
Who was the most frequently streamed female artist in 2013?
Popular Artists
■■ The Most Popular Male Artist - Macklemore
■■ The Most Popular Band - Imagine Dragons
■■ The Most Popular Track - “Can't Hold Us”
Popular Artists In 2013
■■ UsersUsers love local artists!
-- Berlin - Sido
-- London - Coldplay
-- Singapore – Vanessa-Mae
-- Stockholm - Avicii
Popular Artists In 2013
■■ UsersUsers love local artists!love local artists!
-- NYC listens to Jay-Z 88% more than rest of the world
-- Stockholm listens to ABBA 110% more than the rest of the world
Popular Artists In 2013
Question
What was the most “viral” track in 2013?
Popular Tracks
Question
What was the most “viral” track in 2013?
Answer
“Get Lucky” by Daft Punk feat. Pharrell Williams
Popular Tracks
Artist Analytics – Daft Punk
“Get Lucky” was
released on April, 19th
2013.
Artist Analytics – Daft Punk
Around 5x more
streams comparing a
day “before” and
“after” “Get Lucky”
Artist Analytics – Daft Punk
What happened that
day?
Artist Analytics – Daft Punk
“Random Access Memories” was
released on May, 17th
2013.
■■ 09.08.63 – 11.02.2012
Artist Analytics – Whitney Houston
■■ One of the most popular Polish rock bands ever
Artist Analytics – Budka Suflera
What happened?
■■ One of the most popular Polish rock bands ever
Artist Analytics – Budka Suflera
Information about
the retirement was
announced...
1. What was the number of daily active users (DAU) yesterday?
2. How many users have signed up this week?
3. Which country to launch Spotify next?
Management And Investors
■■ AnalyzingAnalyzing growthgrowth
-- Number ofNumber of aactive usersctive users,, streamed songsstreamed songs, sign-ups and more
-- Where to launch Spotify nextWhere to launch Spotify next
■■ Company KPIs
Business Analytics
However,
some of the questions are
really tricky to answer!
1. What song to stream to Jay-Z when he wakes up?
2. Is Adam Kawa bored with Timbuktu today?
3. How to encourage Jeff to go for the Premium Account?
Data Scientists, Researchers
■■ Recommendations
-- Powering features likePowering features like Discover, Radio
-- ““Perfect music for every moment ♪♫ ♬ ♯Perfect music for every moment ♪♫ ♬ ♯””
■■ Classification of songs and playlists
by genre or mood
■■ Top listsTop lists per country
Product Features
■■ Overall, in 2013Overall, in 2013
-- Best Hangover Cure - “The Lazy Song”
-- Best Song To Get Over An Ex - “Someone like you”
-- Best Party Starter - “Levels”
-- Best Driving Song – “Bohemian Rhapsody”
-- Best Work Out Song - “Eye of the Tiger”
Perfect Music For Every Moment
1. Is this button nicer that the previous one?
2. How to personalize the messages displayed to users?
3. How should the results of search be displayed?
Designers, Feature's Owners
■■ A/B Test
-- Come with promising “look-and-feels” and do A/B testsCome with promising “look-and-feels” and do A/B tests
■■ ExplicitExplicit ffeedback from users
-- ButBut users usually do not like to rateusers usually do not like to rate thingsthings
-- ButBut users usually do not like to customizeusers usually do not like to customize thingsthings
Designers, Feature's Owners
■■ Sign-up Button On FacebookSign-up Button On Facebook
A/B Test Use Case
Sign-up button on the
landing page
Sign-up Button On FacebookSign-up Button On Facebook
Layouts of
sign-up
button
B – Test Group (50%)
A – Control Group (50%)
Sign-up Button On FacebookSign-up Button On Facebook
Which one
performed
better?
B – Test Group (50%)
A – Control Group (50%)
Layouts of
sign-up
button
Sign-up Button On FacebookSign-up Button On Facebook
Layouts of
sign-up button
Much more
sign-ups!
A – Control Group (50%)
B – Test Group (50%)
■■ “Only 10% are likely to cause a true uplif” - Google after 12K tests
-- Be able to iBe able to iterate fast!
■■ “80% of the times, we are wrong about what consumers want”
-- The truth is in data!The truth is in data!
A/B Tests
In the past,
we guesstimated a bit
(common sense, intuition,
gut feeling, observations,
inspirations)
Isn't it inspired
by the Window's
Menu Start button? ;)
Isn't it inspired
by the Window's
Menu Start button? ;)
“KöP!” means “BUY!”“KöP!” means “BUY!”
Today,
we make data-driven decisions
To make data-driven decision
data and data-infrastructure
are required (among the others)
■■ OverOver 6 million of paying subscribers6 million of paying subscribers
■■ OverOver 24 million of MAU24 million of MAU (monthly active users)(monthly active users)
■■ 1.5 billion playlists1.5 billion playlists created so farcreated so far
■■ Available inAvailable in 55 countries55 countries
■■ OverOver 20 million of songs20 million of songs
■■ 4,5 billion hours streamed4,5 billion hours streamed in 2013in 2013
Users At Spotify
■■ Data generatedData generated by usersby users andand for usersfor users!!
-- 1.51.5 TB of compressed data from users per dayTB of compressed data from users per day
-- 64 TB of data generated in Hadoop each day (triplicated)64 TB of data generated in Hadoop each day (triplicated)
(Big) Data At Spotify
■■ ApacheApache Hadoop YARNHadoop YARN
■■ Many other systems including:Many other systems including:
-- KafkaKafka,, LuigiLuigi,, Cassandra,Cassandra, PostgreSQLPostgreSQL in productionin production
-- Giraph, Tez, Spark in the evaluation modeGiraph, Tez, Spark in the evaluation mode
Data Infrastructure At Spotify
■■ ProbablyProbably the largest commercial Hadoop cluster in Europe!the largest commercial Hadoop cluster in Europe!
-- 694 heterogeneous nodes
-- 12.63 PB of data used12.63 PB of data used
-- ~7.000 job each day~7.000 job each day
Apache Hadoop
■■ Used forUsed for “off-line” processing“off-line” processing
-- When Hadoop is down, Spotify still plays music!When Hadoop is down, Spotify still plays music!
-- When Hadoop is down, Data Analysts play FIFA, table tennisWhen Hadoop is down, Data Analysts play FIFA, table tennis
or … run queries locallyor … run queries locally
■■ WeWe mostly analyze logsmostly analyze logs from users' activityfrom users' activity
Apache Hadoop
■■ Get insights toGet insights to offer a better productoffer a better product
-- “More data usually beats better algorithms”“More data usually beats better algorithms”
■■ Get insights toGet insights to make better decisionsmake better decisions
-- Avoid “guesstimates”Avoid “guesstimates”
■■ Take a competitive advantageTake a competitive advantage
-- More companies have started offering music streamingMore companies have started offering music streaming
What Does Hadoop Allow Us To Do?
■■ WeWe use multiple tools and languagesuse multiple tools and languages
-- HiveHive is very popular among our data analystsis very popular among our data analysts
-- CrunchCrunch for core pipeline jobsfor core pipeline jobs
-- SomeSome legacy code in Hadoop Streaminglegacy code in Hadoop Streaming with Pythonwith Python
-- A number ofA number of PigPig,, Java MapReduceJava MapReduce jobsjobs
-- AvroAvro as storage format (but we start considering columnaras storage format (but we start considering columnar
formats)formats)
How Do We Use Hadoop?
■■ PrimarilyPrimarily uused to transport logs
-- from multiple servers
-- to a central location for storage and analysis
■■ A better fit for us than FlumeA better fit for us than Flume
-- We got higher throughput with KafkaWe got higher throughput with Kafka
■■ We added more features to KafkaWe added more features to Kafka
-- EEnd-to-end deliverynd-to-end delivery
-- EncryptionEncryption
Apache Kafka
■■ A scalable and distributed key-value
store
■■ Provides fast read-write access for manyProvides fast read-write access for many
small pieces of datasmall pieces of data
-- We use it for playlists, user profiles,We use it for playlists, user profiles,
popularity countpopularity count
■■ Was a better fit for us than HBaseWas a better fit for us than HBase
-- The NN was the SPOF at that timeThe NN was the SPOF at that time
Apache Cassandra
■■ Allows us to build complex pipelines of batch jobs
■■ HHandles dependency resolution, workflow management,
visualization and more
■■ Our alternative to Oozie and AzkabanOur alternative to Oozie and Azkaban
-- Spotify,Spotify, Foursquare, Bitly and more contributeFoursquare, Bitly and more contribute
Luigi
We still use them!
■■ Powering features that requirePowering features that require transactions support, integritytransactions support, integrity
constraintsconstraints
-- e.g.e.g. ordering Spotify gift-cardsordering Spotify gift-cards
■■ Semi-aggregated data forSemi-aggregated data for dashboardsdashboards
■■ Semi-aggregated data forSemi-aggregated data for quick analysisquick analysis
RDBMS
March 2013
Tricky questions were asked!
1. How many servers do you need to buy to survive one year?
2. If we agree, what will you do to use them efficiently?
3. If we agree, do not come back to us this year, OK?
Finance Department
■ Partially responsible for answering these questions!
■ One of Data Engineers who
- takes care of 694-node Hadoop-YARN cluster
- implements and troubleshoots users' jobs
- works in a team with Josh, Marcin, Rafal, Fabian and Wouter
■ Hadoop instructor for almost 2 years
■ Co-organizer of Warsaw and Stockholm HUGs
■ Blogger at HakunaMapData.com
Adam Kawa
■■ Latency analysis
- msec to wait for music after pressing the “Play” button
■■ CCapacity planning
- servers, bandwidth, data-center space and more
Operational Metrics
■■ Hadoop provides tons of metrics, logs and files
■■ They can beThey can be analyzed by … Hadoop
Operational Metrics For Hadoop
■ This knowledge can be useful to learn how to
- measure how fast our HDFS is growing
- calculate the empirical retention policy for datasets
- optimize the scheduler
- benchmark the cluster
- and more
What Hadoop Can Tell About Itself
Let's see
a couple of examples
5.000 TB of data
created before
October 1, 2013
Could we
Archive data accessed
before this day?
■ You can analyze FsImage file to learn how fast you grow
■ You can even correlate this data with
- number of DAU
- total size of logs generated by users
- activity of users e.g. hours streamed
- number of queries / day run by analysts
Advanced HDFS Capacity Planning
■ You can also use ''trend feature'' in Ganglia
Simplified HDFS Capacity Planning
If we do
NOTHING, we will
fill the cluster in
September...
What will we do
to surviver longer than September?
■ We introduced an automatic retention policy
- An owner of the dataset specifies a retention period
- If needed, a retention period can be calculated empirically
We continuously improve
our MapReduce jobs
■ We schedule some jobs each hour, day or week e.g.:
- Top lists for each country
- Reports for the labels, partners, advertisers
Idea
■ Use job statistics from the previous executions of a job
- to optimize the current execution of this job
- to learn about the history of performance of a given job
Recurring MapReduce Jobs
Even perfect manual setting
may become eventually outdated
when an input dataset grows!
■ A tiny PoC ;)
■ The average task time set to 10 minutes (inspired by LinkedIn)
■ It should help in extreme cases: very short and long living tasks
type # map # reduce avg map time avg reduce time job execution time
old_1 4826 25 46sec 1hrs, 52mins, 14sec 2hrs, 52mins, 16sec
new_1 391 294 4mins, 46sec 8mins, 24sec 23mins, 12sec
type # map # reduce avg map time avg reduce time job execution time
old_2 4936 800 7mins, 30sec 22mins, 18sec 5hrs, 20mins, 1sec
new_2 4936 1893 8mins, 52sec 7mins, 35sec 1hrs, 18mins, 29sec
MapReduce Jobs Autotuning
■ We make data-driven decisions to improve our product
■ Scalable and open-source projects allows us to do that
■ Hadoop, Cassandra, Kafka need love and care
- And passionate people who give it to them
■ Hadoop is like a salutary virus
- It quickly spreads across people and projects
Summary
Questions?
BONUS!
One Question:One Question:
What could happen after some time of simultaneousWhat could happen after some time of simultaneous
development of MapReduce jobs,development of MapReduce jobs,
maintenance of a large cluster,maintenance of a large cluster,
and listening to perfect music for every moment?and listening to perfect music for every moment?
A Possible Answer:A Possible Answer:
You may discover Hadoop in the lyrics of many popular songs!You may discover Hadoop in the lyrics of many popular songs!
Check out spotify.com/jobs or @Spotifyjobs
for more information
kawaa@spotify.com
Check out my blog: HakunaMapData.com
Want to join the band?
Thank you!

More Related Content

What's hot

Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At SpotifyVidhya Murali
 
Big data and machine learning @ Spotify
Big data and machine learning @ SpotifyBig data and machine learning @ Spotify
Big data and machine learning @ SpotifyOscar Carlsson
 
Scala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyScala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyNeville Li
 
Personalized Playlists at Spotify
Personalized Playlists at SpotifyPersonalized Playlists at Spotify
Personalized Playlists at SpotifyRohan Agrawal
 
Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Erik Bernhardsson
 
Spotify: Dominating Music In Real Time
Spotify: Dominating Music In Real TimeSpotify: Dominating Music In Real Time
Spotify: Dominating Music In Real TimeLHBS
 
Storm at Spotify
Storm at SpotifyStorm at Spotify
Storm at SpotifyNeville Li
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with SparkChris Johnson
 
Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at SpotifyErik Bernhardsson
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyChris Johnson
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyChris Johnson
 
Apple presentation.ppt
Apple presentation.pptApple presentation.ppt
Apple presentation.pptRakesh Kumar
 
Airbyte - Series-B deck
Airbyte - Series-B deckAirbyte - Series-B deck
Airbyte - Series-B deckAirbyte
 
Building Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyBuilding Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyVidhya Murali
 
Engagement, Metrics & Personalisation at Scale
Engagement, Metrics &  Personalisation at ScaleEngagement, Metrics &  Personalisation at Scale
Engagement, Metrics & Personalisation at ScaleMounia Lalmas-Roelleke
 
Personalizing the listening experience
Personalizing the listening experiencePersonalizing the listening experience
Personalizing the listening experienceMounia Lalmas-Roelleke
 
Product School - Spotify presentation
Product School - Spotify presentationProduct School - Spotify presentation
Product School - Spotify presentationSuleiman Younossi
 
Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Esh Vckay
 
BI and Dashboarding Best Practices
 BI and Dashboarding Best Practices BI and Dashboarding Best Practices
BI and Dashboarding Best PracticesRocket Software
 

What's hot (20)

Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At Spotify
 
Big data and machine learning @ Spotify
Big data and machine learning @ SpotifyBig data and machine learning @ Spotify
Big data and machine learning @ Spotify
 
Scala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyScala Data Pipelines @ Spotify
Scala Data Pipelines @ Spotify
 
Personalized Playlists at Spotify
Personalized Playlists at SpotifyPersonalized Playlists at Spotify
Personalized Playlists at Spotify
 
Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014
 
Spotify: Dominating Music In Real Time
Spotify: Dominating Music In Real TimeSpotify: Dominating Music In Real Time
Spotify: Dominating Music In Real Time
 
Storm at Spotify
Storm at SpotifyStorm at Spotify
Storm at Spotify
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with Spark
 
Recommending and searching @ Spotify
Recommending and searching @ SpotifyRecommending and searching @ Spotify
Recommending and searching @ Spotify
 
Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at Spotify
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover Weekly
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at Spotify
 
Apple presentation.ppt
Apple presentation.pptApple presentation.ppt
Apple presentation.ppt
 
Airbyte - Series-B deck
Airbyte - Series-B deckAirbyte - Series-B deck
Airbyte - Series-B deck
 
Building Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyBuilding Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at Spotify
 
Engagement, Metrics & Personalisation at Scale
Engagement, Metrics &  Personalisation at ScaleEngagement, Metrics &  Personalisation at Scale
Engagement, Metrics & Personalisation at Scale
 
Personalizing the listening experience
Personalizing the listening experiencePersonalizing the listening experience
Personalizing the listening experience
 
Product School - Spotify presentation
Product School - Spotify presentationProduct School - Spotify presentation
Product School - Spotify presentation
 
Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.
 
BI and Dashboarding Best Practices
 BI and Dashboarding Best Practices BI and Dashboarding Best Practices
BI and Dashboarding Best Practices
 

Viewers also liked

Spotify architecture - Pressing play
Spotify architecture - Pressing playSpotify architecture - Pressing play
Spotify architecture - Pressing playNiklas Gustavsson
 
Activation: From thinking to tweaking it, how we do it at Spotify
Activation: From thinking to tweaking it, how we do it at Spotify Activation: From thinking to tweaking it, how we do it at Spotify
Activation: From thinking to tweaking it, how we do it at Spotify TheFamily
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyChris Johnson
 
Growing up with agile - how the Spotify 'model' has evolved
Growing up with agile - how the Spotify 'model' has evolved Growing up with agile - how the Spotify 'model' has evolved
Growing up with agile - how the Spotify 'model' has evolved Peter Antman
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
A Spotify Presentation - Case studies
A Spotify Presentation - Case studiesA Spotify Presentation - Case studies
A Spotify Presentation - Case studiesEmily Wilkinson
 
Africa DevOps Day 2015
Africa DevOps Day 2015Africa DevOps Day 2015
Africa DevOps Day 2015Danielle Jabin
 
A/B Testing Pitfalls and Lessons Learned at Spotify
A/B Testing Pitfalls and Lessons Learned at SpotifyA/B Testing Pitfalls and Lessons Learned at Spotify
A/B Testing Pitfalls and Lessons Learned at SpotifyDanielle Jabin
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeAdam Kawa
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Adam Kawa
 
Making Better Mistakes Tomorrow
Making Better Mistakes TomorrowMaking Better Mistakes Tomorrow
Making Better Mistakes TomorrowDanielle Jabin
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationAdam Kawa
 
How spotify makes product
How spotify makes productHow spotify makes product
How spotify makes productAli Sarrafi
 
How Spotify Builds Products (Organization. Architecture, Autonomy, Accountabi...
How Spotify Builds Products (Organization. Architecture, Autonomy, Accountabi...How Spotify Builds Products (Organization. Architecture, Autonomy, Accountabi...
How Spotify Builds Products (Organization. Architecture, Autonomy, Accountabi...Kevin Goldsmith
 
Spotify for Brands
Spotify for BrandsSpotify for Brands
Spotify for BrandsDT
 

Viewers also liked (19)

Spotify: Data center & Backend buildout
Spotify: Data center & Backend buildoutSpotify: Data center & Backend buildout
Spotify: Data center & Backend buildout
 
Spotify architecture - Pressing play
Spotify architecture - Pressing playSpotify architecture - Pressing play
Spotify architecture - Pressing play
 
Activation: From thinking to tweaking it, how we do it at Spotify
Activation: From thinking to tweaking it, how we do it at Spotify Activation: From thinking to tweaking it, how we do it at Spotify
Activation: From thinking to tweaking it, how we do it at Spotify
 
Scaling Operations At Spotify
Scaling Operations At SpotifyScaling Operations At Spotify
Scaling Operations At Spotify
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and Spotify
 
The Spotify Playbook
The Spotify Playbook The Spotify Playbook
The Spotify Playbook
 
Growing up with agile - how the Spotify 'model' has evolved
Growing up with agile - how the Spotify 'model' has evolved Growing up with agile - how the Spotify 'model' has evolved
Growing up with agile - how the Spotify 'model' has evolved
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
A Spotify Presentation - Case studies
A Spotify Presentation - Case studiesA Spotify Presentation - Case studies
A Spotify Presentation - Case studies
 
Africa DevOps Day 2015
Africa DevOps Day 2015Africa DevOps Day 2015
Africa DevOps Day 2015
 
A/B Testing Pitfalls and Lessons Learned at Spotify
A/B Testing Pitfalls and Lessons Learned at SpotifyA/B Testing Pitfalls and Lessons Learned at Spotify
A/B Testing Pitfalls and Lessons Learned at Spotify
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
 
Making Better Mistakes Tomorrow
Making Better Mistakes TomorrowMaking Better Mistakes Tomorrow
Making Better Mistakes Tomorrow
 
DevOps at Spotify: There and Back Again
DevOps at Spotify: There and Back AgainDevOps at Spotify: There and Back Again
DevOps at Spotify: There and Back Again
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
How spotify makes product
How spotify makes productHow spotify makes product
How spotify makes product
 
How Spotify Builds Products (Organization. Architecture, Autonomy, Accountabi...
How Spotify Builds Products (Organization. Architecture, Autonomy, Accountabi...How Spotify Builds Products (Organization. Architecture, Autonomy, Accountabi...
How Spotify Builds Products (Organization. Architecture, Autonomy, Accountabi...
 
Spotify for Brands
Spotify for BrandsSpotify for Brands
Spotify for Brands
 

Similar to Big Data At Spotify

Last.fm API workshop - Stockholm
Last.fm API workshop - StockholmLast.fm API workshop - Stockholm
Last.fm API workshop - StockholmMatthew Ogle
 
Drupal case study: ABC Dig Music
Drupal case study: ABC Dig MusicDrupal case study: ABC Dig Music
Drupal case study: ABC Dig MusicDavid Peterson
 
API-Driven Development with OpenAPI Specification Testing
API-Driven Development with OpenAPI Specification TestingAPI-Driven Development with OpenAPI Specification Testing
API-Driven Development with OpenAPI Specification TestingNordic APIs
 
How to build desktop apps that help your web app succeed
How to build desktop apps that help your web app succeedHow to build desktop apps that help your web app succeed
How to build desktop apps that help your web app succeedMatthew Ogle
 
Ell Podcast
Ell PodcastEll Podcast
Ell PodcastLori Roe
 
QCon SP - recommended for you
QCon SP - recommended for youQCon SP - recommended for you
QCon SP - recommended for youTatiana Al-Chueyr
 
Last.fm - Lessons from building the World's largest social music platform
Last.fm - Lessons from building the World's largest social music platform Last.fm - Lessons from building the World's largest social music platform
Last.fm - Lessons from building the World's largest social music platform randomfromtheweb
 
So, What Does a Data Scientist do?
So, What Does a Data Scientist do?So, What Does a Data Scientist do?
So, What Does a Data Scientist do?Jameel Syed
 
All Hail the Committee! Typo 2016 Talk Berlin
All Hail the Committee! Typo 2016 Talk BerlinAll Hail the Committee! Typo 2016 Talk Berlin
All Hail the Committee! Typo 2016 Talk BerlinPrarthana Johnson
 
A Data Scientist in the Music Industry
A Data Scientist in the Music IndustryA Data Scientist in the Music Industry
A Data Scientist in the Music IndustryData Science London
 
Podcasting 201: My First Episode
Podcasting 201: My First EpisodePodcasting 201: My First Episode
Podcasting 201: My First EpisodeHeather Marie Wells
 
Robert Kaye, Musicbrainz : Capturing and sharing the data
Robert Kaye, Musicbrainz  : Capturing and sharing the data Robert Kaye, Musicbrainz  : Capturing and sharing the data
Robert Kaye, Musicbrainz : Capturing and sharing the data MME 4.5 / Music 4.5 / 2Pears
 
Toward a Free Press: An Online Publisher's Toolkit
Toward a Free Press: An Online Publisher's ToolkitToward a Free Press: An Online Publisher's Toolkit
Toward a Free Press: An Online Publisher's ToolkitChristopher Spencer
 
SplunkLive! London 2016 - Shazam
SplunkLive! London 2016 - ShazamSplunkLive! London 2016 - Shazam
SplunkLive! London 2016 - ShazamSplunk
 
Deezer - Big data as a streaming service
Deezer - Big data as a streaming serviceDeezer - Big data as a streaming service
Deezer - Big data as a streaming serviceJulie Knibbe
 
Spotify's Ad Targeting Infrastructure: Achieving Real-time Personalization fo...
Spotify's Ad Targeting Infrastructure: Achieving Real-time Personalization fo...Spotify's Ad Targeting Infrastructure: Achieving Real-time Personalization fo...
Spotify's Ad Targeting Infrastructure: Achieving Real-time Personalization fo...Hakka Labs
 
Real time ads personalization @ Spotify
Real time ads personalization @ SpotifyReal time ads personalization @ Spotify
Real time ads personalization @ SpotifyKinshuk Mishra
 

Similar to Big Data At Spotify (20)

Last.fm API workshop - Stockholm
Last.fm API workshop - StockholmLast.fm API workshop - Stockholm
Last.fm API workshop - Stockholm
 
Drupal case study: ABC Dig Music
Drupal case study: ABC Dig MusicDrupal case study: ABC Dig Music
Drupal case study: ABC Dig Music
 
API-Driven Development with OpenAPI Specification Testing
API-Driven Development with OpenAPI Specification TestingAPI-Driven Development with OpenAPI Specification Testing
API-Driven Development with OpenAPI Specification Testing
 
How to build desktop apps that help your web app succeed
How to build desktop apps that help your web app succeedHow to build desktop apps that help your web app succeed
How to build desktop apps that help your web app succeed
 
Music 4.5: Robert Kaye, Founder, Metabrainz
Music 4.5: Robert Kaye, Founder, Metabrainz Music 4.5: Robert Kaye, Founder, Metabrainz
Music 4.5: Robert Kaye, Founder, Metabrainz
 
Ell Podcast
Ell PodcastEll Podcast
Ell Podcast
 
QCon SP - recommended for you
QCon SP - recommended for youQCon SP - recommended for you
QCon SP - recommended for you
 
Last.fm - Lessons from building the World's largest social music platform
Last.fm - Lessons from building the World's largest social music platform Last.fm - Lessons from building the World's largest social music platform
Last.fm - Lessons from building the World's largest social music platform
 
So, What Does a Data Scientist do?
So, What Does a Data Scientist do?So, What Does a Data Scientist do?
So, What Does a Data Scientist do?
 
All Hail the Committee! Typo 2016 Talk Berlin
All Hail the Committee! Typo 2016 Talk BerlinAll Hail the Committee! Typo 2016 Talk Berlin
All Hail the Committee! Typo 2016 Talk Berlin
 
A Data Scientist in the Music Industry
A Data Scientist in the Music IndustryA Data Scientist in the Music Industry
A Data Scientist in the Music Industry
 
Podcasting 201: My First Episode
Podcasting 201: My First EpisodePodcasting 201: My First Episode
Podcasting 201: My First Episode
 
Robert Kaye, Musicbrainz : Capturing and sharing the data
Robert Kaye, Musicbrainz  : Capturing and sharing the data Robert Kaye, Musicbrainz  : Capturing and sharing the data
Robert Kaye, Musicbrainz : Capturing and sharing the data
 
Toward a Free Press: An Online Publisher's Toolkit
Toward a Free Press: An Online Publisher's ToolkitToward a Free Press: An Online Publisher's Toolkit
Toward a Free Press: An Online Publisher's Toolkit
 
Project overview eng
Project overview engProject overview eng
Project overview eng
 
SplunkLive! London 2016 - Shazam
SplunkLive! London 2016 - ShazamSplunkLive! London 2016 - Shazam
SplunkLive! London 2016 - Shazam
 
Deezer - Big data as a streaming service
Deezer - Big data as a streaming serviceDeezer - Big data as a streaming service
Deezer - Big data as a streaming service
 
Spotify's Ad Targeting Infrastructure: Achieving Real-time Personalization fo...
Spotify's Ad Targeting Infrastructure: Achieving Real-time Personalization fo...Spotify's Ad Targeting Infrastructure: Achieving Real-time Personalization fo...
Spotify's Ad Targeting Infrastructure: Achieving Real-time Personalization fo...
 
Real time ads personalization @ Spotify
Real time ads personalization @ SpotifyReal time ads personalization @ Spotify
Real time ads personalization @ Spotify
 
Big Data and Hadoop in the Cloud
Big Data and Hadoop in the CloudBig Data and Hadoop in the Cloud
Big Data and Hadoop in the Cloud
 

More from Adam Kawa

Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Adam Kawa
 
Apache Hadoop Java API
Apache Hadoop Java APIApache Hadoop Java API
Apache Hadoop Java APIAdam Kawa
 
Apache Hadoop Ecosystem (based on an exemplary data-driven…
Apache Hadoop Ecosystem (based on an exemplary data-driven…Apache Hadoop Ecosystem (based on an exemplary data-driven…
Apache Hadoop Ecosystem (based on an exemplary data-driven…Adam Kawa
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARNAdam Kawa
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Adam Kawa
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
Systemy rekomendacji
Systemy rekomendacjiSystemy rekomendacji
Systemy rekomendacjiAdam Kawa
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGAdam Kawa
 

More from Adam Kawa (8)

Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
 
Apache Hadoop Java API
Apache Hadoop Java APIApache Hadoop Java API
Apache Hadoop Java API
 
Apache Hadoop Ecosystem (based on an exemplary data-driven…
Apache Hadoop Ecosystem (based on an exemplary data-driven…Apache Hadoop Ecosystem (based on an exemplary data-driven…
Apache Hadoop Ecosystem (based on an exemplary data-driven…
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Systemy rekomendacji
Systemy rekomendacjiSystemy rekomendacji
Systemy rekomendacji
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUG
 

Recently uploaded

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 

Big Data At Spotify

  • 1. Adam Kawa Data Engineer @ Spotify (Big) Data At Spotify
  • 2.
  • 3. At Spotify, important questions are being asked all the time!
  • 4. Some of these questions are ”relatively easy” to answer…
  • 5. 1. How many times has Coldplay been streamed this month? 2. Who was the most popular artist in NYC last week? 3. How many times was “Get Lucky” streamed during first 24h? Labels, Licensor, Partners, Advertisers
  • 6. ■■ Very granular reports are required -- DividedDivided by gender, age, location and more ■■ We have been delivering various reportsWe have been delivering various reports from day 1 -- Too much data for traditional solutionsToo much data for traditional solutions Reporting
  • 8. Question Who was the most frequently streamed female artist in 2013? Answer? A) Katy Perry B) Lady Gaga C) Madonna D) Rihanna Popular Artists
  • 9. Question Who was the most frequently streamed female artist in 2013? Popular Artists
  • 10. ■■ The Most Popular Male Artist - Macklemore ■■ The Most Popular Band - Imagine Dragons ■■ The Most Popular Track - “Can't Hold Us” Popular Artists In 2013
  • 11. ■■ UsersUsers love local artists! -- Berlin - Sido -- London - Coldplay -- Singapore – Vanessa-Mae -- Stockholm - Avicii Popular Artists In 2013
  • 12. ■■ UsersUsers love local artists!love local artists! -- NYC listens to Jay-Z 88% more than rest of the world -- Stockholm listens to ABBA 110% more than the rest of the world Popular Artists In 2013
  • 13. Question What was the most “viral” track in 2013? Popular Tracks
  • 14. Question What was the most “viral” track in 2013? Answer “Get Lucky” by Daft Punk feat. Pharrell Williams Popular Tracks
  • 15. Artist Analytics – Daft Punk “Get Lucky” was released on April, 19th 2013.
  • 16. Artist Analytics – Daft Punk Around 5x more streams comparing a day “before” and “after” “Get Lucky”
  • 17. Artist Analytics – Daft Punk What happened that day?
  • 18. Artist Analytics – Daft Punk “Random Access Memories” was released on May, 17th 2013.
  • 19. ■■ 09.08.63 – 11.02.2012 Artist Analytics – Whitney Houston
  • 20. ■■ One of the most popular Polish rock bands ever Artist Analytics – Budka Suflera What happened?
  • 21. ■■ One of the most popular Polish rock bands ever Artist Analytics – Budka Suflera Information about the retirement was announced...
  • 22. 1. What was the number of daily active users (DAU) yesterday? 2. How many users have signed up this week? 3. Which country to launch Spotify next? Management And Investors
  • 23. ■■ AnalyzingAnalyzing growthgrowth -- Number ofNumber of aactive usersctive users,, streamed songsstreamed songs, sign-ups and more -- Where to launch Spotify nextWhere to launch Spotify next ■■ Company KPIs Business Analytics
  • 24. However, some of the questions are really tricky to answer!
  • 25. 1. What song to stream to Jay-Z when he wakes up? 2. Is Adam Kawa bored with Timbuktu today? 3. How to encourage Jeff to go for the Premium Account? Data Scientists, Researchers
  • 26. ■■ Recommendations -- Powering features likePowering features like Discover, Radio -- ““Perfect music for every moment ♪♫ ♬ ♯Perfect music for every moment ♪♫ ♬ ♯”” ■■ Classification of songs and playlists by genre or mood ■■ Top listsTop lists per country Product Features
  • 27. ■■ Overall, in 2013Overall, in 2013 -- Best Hangover Cure - “The Lazy Song” -- Best Song To Get Over An Ex - “Someone like you” -- Best Party Starter - “Levels” -- Best Driving Song – “Bohemian Rhapsody” -- Best Work Out Song - “Eye of the Tiger” Perfect Music For Every Moment
  • 28. 1. Is this button nicer that the previous one? 2. How to personalize the messages displayed to users? 3. How should the results of search be displayed? Designers, Feature's Owners
  • 29. ■■ A/B Test -- Come with promising “look-and-feels” and do A/B testsCome with promising “look-and-feels” and do A/B tests ■■ ExplicitExplicit ffeedback from users -- ButBut users usually do not like to rateusers usually do not like to rate thingsthings -- ButBut users usually do not like to customizeusers usually do not like to customize thingsthings Designers, Feature's Owners
  • 30. ■■ Sign-up Button On FacebookSign-up Button On Facebook A/B Test Use Case Sign-up button on the landing page
  • 31. Sign-up Button On FacebookSign-up Button On Facebook Layouts of sign-up button B – Test Group (50%) A – Control Group (50%)
  • 32. Sign-up Button On FacebookSign-up Button On Facebook Which one performed better? B – Test Group (50%) A – Control Group (50%) Layouts of sign-up button
  • 33. Sign-up Button On FacebookSign-up Button On Facebook Layouts of sign-up button Much more sign-ups! A – Control Group (50%) B – Test Group (50%)
  • 34. ■■ “Only 10% are likely to cause a true uplif” - Google after 12K tests -- Be able to iBe able to iterate fast! ■■ “80% of the times, we are wrong about what consumers want” -- The truth is in data!The truth is in data! A/B Tests
  • 35. In the past, we guesstimated a bit (common sense, intuition, gut feeling, observations, inspirations)
  • 36. Isn't it inspired by the Window's Menu Start button? ;) Isn't it inspired by the Window's Menu Start button? ;) “KöP!” means “BUY!”“KöP!” means “BUY!”
  • 38.
  • 39. To make data-driven decision data and data-infrastructure are required (among the others)
  • 40. ■■ OverOver 6 million of paying subscribers6 million of paying subscribers ■■ OverOver 24 million of MAU24 million of MAU (monthly active users)(monthly active users) ■■ 1.5 billion playlists1.5 billion playlists created so farcreated so far ■■ Available inAvailable in 55 countries55 countries ■■ OverOver 20 million of songs20 million of songs ■■ 4,5 billion hours streamed4,5 billion hours streamed in 2013in 2013 Users At Spotify
  • 41. ■■ Data generatedData generated by usersby users andand for usersfor users!! -- 1.51.5 TB of compressed data from users per dayTB of compressed data from users per day -- 64 TB of data generated in Hadoop each day (triplicated)64 TB of data generated in Hadoop each day (triplicated) (Big) Data At Spotify
  • 42. ■■ ApacheApache Hadoop YARNHadoop YARN ■■ Many other systems including:Many other systems including: -- KafkaKafka,, LuigiLuigi,, Cassandra,Cassandra, PostgreSQLPostgreSQL in productionin production -- Giraph, Tez, Spark in the evaluation modeGiraph, Tez, Spark in the evaluation mode Data Infrastructure At Spotify
  • 43. ■■ ProbablyProbably the largest commercial Hadoop cluster in Europe!the largest commercial Hadoop cluster in Europe! -- 694 heterogeneous nodes -- 12.63 PB of data used12.63 PB of data used -- ~7.000 job each day~7.000 job each day Apache Hadoop
  • 44. ■■ Used forUsed for “off-line” processing“off-line” processing -- When Hadoop is down, Spotify still plays music!When Hadoop is down, Spotify still plays music! -- When Hadoop is down, Data Analysts play FIFA, table tennisWhen Hadoop is down, Data Analysts play FIFA, table tennis or … run queries locallyor … run queries locally ■■ WeWe mostly analyze logsmostly analyze logs from users' activityfrom users' activity Apache Hadoop
  • 45. ■■ Get insights toGet insights to offer a better productoffer a better product -- “More data usually beats better algorithms”“More data usually beats better algorithms” ■■ Get insights toGet insights to make better decisionsmake better decisions -- Avoid “guesstimates”Avoid “guesstimates” ■■ Take a competitive advantageTake a competitive advantage -- More companies have started offering music streamingMore companies have started offering music streaming What Does Hadoop Allow Us To Do?
  • 46. ■■ WeWe use multiple tools and languagesuse multiple tools and languages -- HiveHive is very popular among our data analystsis very popular among our data analysts -- CrunchCrunch for core pipeline jobsfor core pipeline jobs -- SomeSome legacy code in Hadoop Streaminglegacy code in Hadoop Streaming with Pythonwith Python -- A number ofA number of PigPig,, Java MapReduceJava MapReduce jobsjobs -- AvroAvro as storage format (but we start considering columnaras storage format (but we start considering columnar formats)formats) How Do We Use Hadoop?
  • 47. ■■ PrimarilyPrimarily uused to transport logs -- from multiple servers -- to a central location for storage and analysis ■■ A better fit for us than FlumeA better fit for us than Flume -- We got higher throughput with KafkaWe got higher throughput with Kafka ■■ We added more features to KafkaWe added more features to Kafka -- EEnd-to-end deliverynd-to-end delivery -- EncryptionEncryption Apache Kafka
  • 48. ■■ A scalable and distributed key-value store ■■ Provides fast read-write access for manyProvides fast read-write access for many small pieces of datasmall pieces of data -- We use it for playlists, user profiles,We use it for playlists, user profiles, popularity countpopularity count ■■ Was a better fit for us than HBaseWas a better fit for us than HBase -- The NN was the SPOF at that timeThe NN was the SPOF at that time Apache Cassandra
  • 49. ■■ Allows us to build complex pipelines of batch jobs ■■ HHandles dependency resolution, workflow management, visualization and more ■■ Our alternative to Oozie and AzkabanOur alternative to Oozie and Azkaban -- Spotify,Spotify, Foursquare, Bitly and more contributeFoursquare, Bitly and more contribute Luigi
  • 50. We still use them! ■■ Powering features that requirePowering features that require transactions support, integritytransactions support, integrity constraintsconstraints -- e.g.e.g. ordering Spotify gift-cardsordering Spotify gift-cards ■■ Semi-aggregated data forSemi-aggregated data for dashboardsdashboards ■■ Semi-aggregated data forSemi-aggregated data for quick analysisquick analysis RDBMS
  • 52. 1. How many servers do you need to buy to survive one year? 2. If we agree, what will you do to use them efficiently? 3. If we agree, do not come back to us this year, OK? Finance Department
  • 53. ■ Partially responsible for answering these questions! ■ One of Data Engineers who - takes care of 694-node Hadoop-YARN cluster - implements and troubleshoots users' jobs - works in a team with Josh, Marcin, Rafal, Fabian and Wouter ■ Hadoop instructor for almost 2 years ■ Co-organizer of Warsaw and Stockholm HUGs ■ Blogger at HakunaMapData.com Adam Kawa
  • 54. ■■ Latency analysis - msec to wait for music after pressing the “Play” button ■■ CCapacity planning - servers, bandwidth, data-center space and more Operational Metrics
  • 55. ■■ Hadoop provides tons of metrics, logs and files ■■ They can beThey can be analyzed by … Hadoop Operational Metrics For Hadoop
  • 56. ■ This knowledge can be useful to learn how to - measure how fast our HDFS is growing - calculate the empirical retention policy for datasets - optimize the scheduler - benchmark the cluster - and more What Hadoop Can Tell About Itself
  • 57. Let's see a couple of examples
  • 58. 5.000 TB of data created before October 1, 2013
  • 59.
  • 60. Could we Archive data accessed before this day?
  • 61. ■ You can analyze FsImage file to learn how fast you grow ■ You can even correlate this data with - number of DAU - total size of logs generated by users - activity of users e.g. hours streamed - number of queries / day run by analysts Advanced HDFS Capacity Planning
  • 62. ■ You can also use ''trend feature'' in Ganglia Simplified HDFS Capacity Planning If we do NOTHING, we will fill the cluster in September...
  • 63. What will we do to surviver longer than September?
  • 64. ■ We introduced an automatic retention policy - An owner of the dataset specifies a retention period - If needed, a retention period can be calculated empirically
  • 65. We continuously improve our MapReduce jobs
  • 66. ■ We schedule some jobs each hour, day or week e.g.: - Top lists for each country - Reports for the labels, partners, advertisers Idea ■ Use job statistics from the previous executions of a job - to optimize the current execution of this job - to learn about the history of performance of a given job Recurring MapReduce Jobs Even perfect manual setting may become eventually outdated when an input dataset grows!
  • 67. ■ A tiny PoC ;) ■ The average task time set to 10 minutes (inspired by LinkedIn) ■ It should help in extreme cases: very short and long living tasks type # map # reduce avg map time avg reduce time job execution time old_1 4826 25 46sec 1hrs, 52mins, 14sec 2hrs, 52mins, 16sec new_1 391 294 4mins, 46sec 8mins, 24sec 23mins, 12sec type # map # reduce avg map time avg reduce time job execution time old_2 4936 800 7mins, 30sec 22mins, 18sec 5hrs, 20mins, 1sec new_2 4936 1893 8mins, 52sec 7mins, 35sec 1hrs, 18mins, 29sec MapReduce Jobs Autotuning
  • 68. ■ We make data-driven decisions to improve our product ■ Scalable and open-source projects allows us to do that ■ Hadoop, Cassandra, Kafka need love and care - And passionate people who give it to them ■ Hadoop is like a salutary virus - It quickly spreads across people and projects Summary
  • 71. One Question:One Question: What could happen after some time of simultaneousWhat could happen after some time of simultaneous development of MapReduce jobs,development of MapReduce jobs, maintenance of a large cluster,maintenance of a large cluster, and listening to perfect music for every moment?and listening to perfect music for every moment?
  • 72. A Possible Answer:A Possible Answer: You may discover Hadoop in the lyrics of many popular songs!You may discover Hadoop in the lyrics of many popular songs!
  • 73.
  • 74. Check out spotify.com/jobs or @Spotifyjobs for more information kawaa@spotify.com Check out my blog: HakunaMapData.com Want to join the band?