Data at Spotify

•

23 likes•22,258 views

Danielle Jabin

Data infrastructure at Spotify

Technology Education

June 12, 2014
Danielle Jabin
Data Engineer, A/B Testing
Data at Spotify

I’m Danielle Jabin
•  Data Engineer in the Stockholm oﬃce
•  A/B testing infrastructure
•  California born & raised
•  If I can survive a Swedish winter, so can you!
•  Studied Computer Science, Statistics, and Real Estate
through the M&T program at the University of
Pennsylvania

3
Over 40 million active
users
As of June 9, 2014

4
Access to more than 20
million songs
As of June 9, 2014

Big Data
•  40 million Monthly Active Users
•  20+ million tracks
•  1.5 TB of compressed data from users per day
•  64 TB of data generated in Hadoop each day (including
replication factor of 3)
As of June 9, 2014

Let’s compare: 64 TB
•  293, 203, 072 books (200 pages or 240,000
characters)
•  16,777,216 MP3 files (with 4MB average file size)
•  22,369,600 images (with 3MB average file size)

Use Cases
•  Reporting
•  Business Analytics
•  Operational Analytics
•  Product Features

Reporting
•  Reporting to labels, licensors, partners, and advertisers
•  We support our partners

Business Analytics
•  Analyzing growth, user behavior, sign-up funnels, etc
•  Company KPIs
•  NPS analysis

Operational Metrics
•  Root cause analysis
•  Latency analysis
•  Better capacity planning (servers, people, bandwidth)

Product Features
•  Discover and Radio
•  Top lists
•  Personalized recommendations
•  A/B Testing

The three pillars of our Data Infrastructure:
Kafka
Collection
Hadoop
Processing
Databases
Analytics/Visualization

This is Dave. Data Engineer at
Spotify by day…

…chiptune DJ Demoscene Time
Machine by night.

Kafka
•  High volume pub-sub
system
•  “Producers publish messages to
Kafka topics, and consumers
subscribe to these topics and
consume the messages.”

Kafka
•  Robust and scalable solution for collection of logs
•  Fast data transfer
•  Low CPU overhead
•  Built-in partitioning, replication, and fault-tolerance
•  Consumers can pull data at diﬀerent rates
•  Able to handle extremely high volumes

Hadoop
•  Process and store massive amounts of unstructured data
across a distributed cluster
•  One cluster with 37 nodes to 690 nodes today
•  28 PB of storage
•  The largest Hadoop cluster in Europe

Hadoop
•  Entering the land of optimizations
•  Data retention policy
•  Move to JVM-based languages
•  MapReduce languages
•  Moving to Crunch, JVM-based, for speed and scalability
•  Python with Hadoop Streaming, Java, Hive, PIG, Scala
•  Sprunch: Crunch wrapper for Scala, open sourced by Spotify
•  Spotify open-sourced scheduler, Luigi, written in Python
•  Simple and easy way to chain jobs

Databases
•  Aggregates from Hadoop put into PostgreSQL or
Cassandra
•  Sqoop
•  Core data can be used and manipulated for various needs
•  Ad hoc queries
•  Dashboards

Databases
•  Aggregates from Hadoop put into PostgreSQL or
Cassandra
•  Sqoop
•  Ad hoc queries
•  Dashboards

A/B testing questions? Find me!
Contr
ol
vs

What's hot

The Evolution of Big Data at SpotifyJosh Baer

Music Personalization At SpotifyVidhya Murali

Collaborative Filtering at SpotifyErik Bernhardsson

ML+Hadoop at NYC Predictive AnalyticsErik Bernhardsson

Spotify: Data center & Backend buildoutDavid Poblador i Garcia

Big data and machine learning @ SpotifyOscar Carlsson

Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingSameera Horawalavithana

How Apache Drives Music Recommendations At SpotifyJosh Baer

Distributed "Web Scale" SystemsRicardo Vice Santos

Music Personalization : Real time Platforms.Esh Vckay

From Idea to Execution: Spotify's Discover WeeklyChris Johnson

Scala Data Pipelines @ SpotifyNeville Li

CF Models for Music Recommendations At SpotifyVidhya Murali

Playlist Recommendations @ SpotifyNikhil Tibrewal

Music recommendations @ MLConf 2014Erik Bernhardsson

Real time stock processing with apache nifi, apache flink and apache kafkaTimothy Spann

Recommending and searching @ SpotifyMounia Lalmas-Roelleke

Spotify: Dominating Music In Real TimeLHBS

[한국IBM] Watson AI 소개 및 도입사례 (201904)Sejeong Kim 김세정

Observability at SpotifyAleksandr Kuboskin, CFA

What's hot (20)

The Evolution of Big Data at Spotify

Music Personalization At Spotify

Collaborative Filtering at Spotify

ML+Hadoop at NYC Predictive Analytics

Spotify: Data center & Backend buildout

Big data and machine learning @ Spotify

Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming

How Apache Drives Music Recommendations At Spotify

Distributed "Web Scale" Systems

Music Personalization : Real time Platforms.

From Idea to Execution: Spotify's Discover Weekly

Scala Data Pipelines @ Spotify

CF Models for Music Recommendations At Spotify

Playlist Recommendations @ Spotify

Music recommendations @ MLConf 2014

Real time stock processing with apache nifi, apache flink and apache kafka

Recommending and searching @ Spotify

Spotify: Dominating Music In Real Time

[한국IBM] Watson AI 소개 및 도입사례 (201904)

Observability at Spotify

Viewers also liked

Making Better Mistakes TomorrowDanielle Jabin

NAMP Conference - A/B Testing Your Way to SuccessDevon Smith

A/B Testing - In data we trustPedro Marques

A Spotify Presentation - Case studiesEmily Wilkinson

Learning a Personalized HomepageJustin Basilico

4 Steps Toward Scientific A/B TestingJanessa Lantz

Scaling Agile at Spotify (representation)Vlad Mysla

Growing up with agile - how the Spotify 'model' has evolved Peter Antman

Viewers also liked (8)

Making Better Mistakes Tomorrow

NAMP Conference - A/B Testing Your Way to Success

A/B Testing - In data we trust

A Spotify Presentation - Case studies

Learning a Personalized Homepage

4 Steps Toward Scientific A/B Testing

Scaling Agile at Spotify (representation)

Growing up with agile - how the Spotify 'model' has evolved

Similar to Data at Spotify

Liferay and Big DataMiguel Pastor

Architecting Your First Big Data ImplementationAdaryl "Bob" Wakefield, MBA

Hadoop HDFS.ppt6535ANURAGANURAG

Big Data & Hadoop IntroductionJayant Mukherjee

Big Data Open Source Technologiesneeraj rathore

Big dataroysonli

Liferay & Big Data Dev Con 2014Miguel Pastor

Introduction to BIg Data and HadoopAmir Shaikh

Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn

Hands On: Introduction to the Hadoop EcosystemAdaryl "Bob" Wakefield, MBA

Big Data + Sentiment Analysis = AwesomeAdel Rahimi

Data analytics & its TrendsDr.K.Sreenivas Rao

From Big Data to Fast DataSina Sheikholeslami

HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems

Big Data in Action : Operations, Analytics and moreSoftweb Solutions

Streaming data miningAnkit Solanki

Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...Lucidworks

An Introduction of Apache HadoopKMS Technology

Getting Started with Big Data in the CloudRightScale

Spotify's Ad Targeting Infrastructure: Achieving Real-time Personalization fo...Hakka Labs

Similar to Data at Spotify (20)

Liferay and Big Data

Architecting Your First Big Data Implementation

Hadoop HDFS.ppt

Big Data & Hadoop Introduction

Big Data Open Source Technologies

Big data

Liferay & Big Data Dev Con 2014

Introduction to BIg Data and Hadoop

Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...

Hands On: Introduction to the Hadoop Ecosystem

Big Data + Sentiment Analysis = Awesome

Data analytics & its Trends

From Big Data to Fast Data

HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...

Big Data in Action : Operations, Analytics and more

Streaming data mining

Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...

An Introduction of Apache Hadoop

Getting Started with Big Data in the Cloud

Spotify's Ad Targeting Infrastructure: Achieving Real-time Personalization fo...

Recently uploaded

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal

2024 April Patch TuesdayIvanti

Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González

QCon London: Mastering long-running processes in modern architecturesBernd Ruecker

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes

Top 10 Hubspot Development Companies in 2024TopCSSGallery

A Framework for Development in the AI AgeCprime

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein

[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq

MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll

Recently uploaded (20)

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...

2024 April Patch Tuesday

Generative Artificial Intelligence: How generative AI works.pdf

QCon London: Mastering long-running processes in modern architectures

The State of Passkeys with FIDO Alliance.pptx

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes

Top 10 Hubspot Development Companies in 2024

A Framework for Development in the AI Age

How AI, OpenAI, and ChatGPT impact business and software.

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Zeshan Sattar- Assessing the skill requirements and industry expectations for...

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24

[Webinar] SpiraTest - Setting New Standards in Quality Assurance

The Ultimate Guide to Choosing WordPress Pros and Cons

Genislab builds better products and faster go-to-market with Lean project man...

MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

Glenn Lazarus- Why Your Observability Strategy Needs Security Observability

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

Emixa Mendix Meetup 11 April 2024 about Mendix Native development

Data at Spotify

1. June 12, 2014 Danielle Jabin Data Engineer, A/B Testing Data at Spotify

2. I’m Danielle Jabin •  Data Engineer in the Stockholm oﬃce •  A/B testing infrastructure •  California born & raised •  If I can survive a Swedish winter, so can you! •  Studied Computer Science, Statistics, and Real Estate through the M&T program at the University of Pennsylvania

3. 3 Over 40 million active users As of June 9, 2014

4. 4 Access to more than 20 million songs As of June 9, 2014

5. Big Data •  40 million Monthly Active Users •  20+ million tracks •  1.5 TB of compressed data from users per day •  64 TB of data generated in Hadoop each day (including replication factor of 3) As of June 9, 2014

6. 6 So how much data is that?

7. Let’s compare: 64 TB •  293, 203, 072 books (200 pages or 240,000 characters) •  16,777,216 MP3 files (with 4MB average file size) •  22,369,600 images (with 3MB average file size)

8. 8 That’s a lot of selfies

9. 9 How do we use this data?

10. Use Cases •  Reporting •  Business Analytics •  Operational Analytics •  Product Features

11. Reporting •  Reporting to labels, licensors, partners, and advertisers •  We support our partners

12. Business Analytics •  Analyzing growth, user behavior, sign-up funnels, etc •  Company KPIs •  NPS analysis

13. Operational Metrics •  Root cause analysis •  Latency analysis •  Better capacity planning (servers, people, bandwidth)

14. Product Features •  Discover and Radio •  Top lists •  Personalized recommendations •  A/B Testing

15. 15 How do we collect this data?

16. The three pillars of our Data Infrastructure: Kafka Collection Hadoop Processing Databases Analytics/Visualization

17. This is Dave. Data Engineer at Spotify by day…

18. …chiptune DJ Demoscene Time Machine by night.

19. Let’s listen to Dave’s song

20. Kafka •  High volume pub-sub system •  “Producers publish messages to Kafka topics, and consumers subscribe to these topics and consume the messages.”

21. Kafka •  Robust and scalable solution for collection of logs •  Fast data transfer •  Low CPU overhead •  Built-in partitioning, replication, and fault-tolerance •  Consumers can pull data at diﬀerent rates •  Able to handle extremely high volumes

22. Other people listened too!

23. Hadoop •  Process and store massive amounts of unstructured data across a distributed cluster •  One cluster with 37 nodes to 690 nodes today •  28 PB of storage •  The largest Hadoop cluster in Europe

24. Hadoop •  Entering the land of optimizations •  Data retention policy •  Move to JVM-based languages •  MapReduce languages •  Moving to Crunch, JVM-based, for speed and scalability •  Python with Hadoop Streaming, Java, Hive, PIG, Scala •  Sprunch: Crunch wrapper for Scala, open sourced by Spotify •  Spotify open-sourced scheduler, Luigi, written in Python •  Simple and easy way to chain jobs

25. What if we want to know more? vs

26. Databases •  Aggregates from Hadoop put into PostgreSQL or Cassandra •  Sqoop •  Core data can be used and manipulated for various needs •  Ad hoc queries •  Dashboards

27. Databases •  Aggregates from Hadoop put into PostgreSQL or Cassandra •  Sqoop •  Ad hoc queries •  Dashboards

28. Databases •  Aggregates from Hadoop put into PostgreSQL or Cassandra •  Sqoop •  Ad hoc queries •  Dashboards

29. Questions?

30. A/B testing questions? Find me! Contr ol vs

31. Thank you!

Data at Spotify

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Data at Spotify

Similar to Data at Spotify (20)

Recently uploaded

Recently uploaded (20)

Data at Spotify