SlideShare a Scribd company logo
1 of 31
Download to read offline
June 12, 2014
Danielle Jabin
Data Engineer, A/B Testing
Data at Spotify
I’m Danielle Jabin
•  Data Engineer in the Stockholm office
•  A/B testing infrastructure
•  California born & raised
•  If I can survive a Swedish winter, so can you!
•  Studied Computer Science, Statistics, and Real Estate
through the M&T program at the University of
Pennsylvania
3
Over 40 million active
users
As of June 9, 2014	
  
4
Access to more than 20
million songs
As of June 9, 2014	
  
Big Data
•  40 million Monthly Active Users
•  20+ million tracks
•  1.5 TB of compressed data from users per day
•  64 TB of data generated in Hadoop each day (including
replication factor of 3)
As of June 9, 2014	
  
6
So how much data is that?
Let’s compare: 64 TB
•  293, 203, 072 books (200 pages or 240,000
characters)
•  16,777,216 MP3 files (with 4MB average file size)
•  22,369,600 images (with 3MB average file size)
8
That’s a lot of selfies
9
How do we use this data?
Use Cases
•  Reporting
•  Business Analytics
•  Operational Analytics
•  Product Features
Reporting
•  Reporting to labels, licensors, partners, and advertisers
•  We support our partners
Business Analytics
•  Analyzing growth, user behavior, sign-up funnels, etc
•  Company KPIs
•  NPS analysis
Operational Metrics
•  Root cause analysis
•  Latency analysis
•  Better capacity planning (servers, people, bandwidth)
Product Features
•  Discover and Radio
•  Top lists
•  Personalized recommendations
•  A/B Testing
15
How do we collect this
data?
The three pillars of our Data Infrastructure:
Kafka
Collection
Hadoop
Processing
Databases
Analytics/Visualization
This is Dave. Data Engineer at
Spotify by day…
…chiptune DJ Demoscene Time
Machine by night.
Let’s listen to Dave’s song
Kafka
•  High volume pub-sub
system
•  “Producers publish messages to
Kafka topics, and consumers
subscribe to these topics and
consume the messages.”
Kafka
•  Robust and scalable solution for collection of logs
•  Fast data transfer
•  Low CPU overhead
•  Built-in partitioning, replication, and fault-tolerance
•  Consumers can pull data at different rates
•  Able to handle extremely high volumes
Other people listened too!
Hadoop
•  Process and store massive amounts of unstructured data
across a distributed cluster
•  One cluster with 37 nodes to 690 nodes today
•  28 PB of storage
•  The largest Hadoop cluster in Europe
Hadoop
•  Entering the land of optimizations
•  Data retention policy
•  Move to JVM-based languages
•  MapReduce languages
•  Moving to Crunch, JVM-based, for speed and scalability
•  Python with Hadoop Streaming, Java, Hive, PIG, Scala
•  Sprunch: Crunch wrapper for Scala, open sourced by Spotify
•  Spotify open-sourced scheduler, Luigi, written in Python
•  Simple and easy way to chain jobs
What if we want to know more?
vs
Databases
•  Aggregates from Hadoop put into PostgreSQL or
Cassandra
•  Sqoop
•  Core data can be used and manipulated for various needs
•  Ad hoc queries
•  Dashboards
Databases
•  Aggregates from Hadoop put into PostgreSQL or
Cassandra
•  Sqoop
•  Ad hoc queries
•  Dashboards
Databases
•  Aggregates from Hadoop put into PostgreSQL or
Cassandra
•  Sqoop
•  Ad hoc queries
•  Dashboards
Questions?
A/B testing questions? Find me!
Contr
ol
vs
Thank you!

More Related Content

What's hot

The Evolution of Big Data at Spotify
The Evolution of Big Data at SpotifyThe Evolution of Big Data at Spotify
The Evolution of Big Data at SpotifyJosh Baer
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At SpotifyVidhya Murali
 
Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at SpotifyErik Bernhardsson
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsErik Bernhardsson
 
Big data and machine learning @ Spotify
Big data and machine learning @ SpotifyBig data and machine learning @ Spotify
Big data and machine learning @ SpotifyOscar Carlsson
 
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingTalk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingSameera Horawalavithana
 
How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyJosh Baer
 
Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Esh Vckay
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyChris Johnson
 
Scala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyScala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyNeville Li
 
CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyVidhya Murali
 
Playlist Recommendations @ Spotify
Playlist Recommendations @ SpotifyPlaylist Recommendations @ Spotify
Playlist Recommendations @ SpotifyNikhil Tibrewal
 
Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Erik Bernhardsson
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaTimothy Spann
 
Spotify: Dominating Music In Real Time
Spotify: Dominating Music In Real TimeSpotify: Dominating Music In Real Time
Spotify: Dominating Music In Real TimeLHBS
 
[한국IBM] Watson AI 소개 및 도입사례 (201904)
[한국IBM] Watson AI 소개 및 도입사례 (201904)[한국IBM] Watson AI 소개 및 도입사례 (201904)
[한국IBM] Watson AI 소개 및 도입사례 (201904)Sejeong Kim 김세정
 

What's hot (20)

The Evolution of Big Data at Spotify
The Evolution of Big Data at SpotifyThe Evolution of Big Data at Spotify
The Evolution of Big Data at Spotify
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At Spotify
 
Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at Spotify
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive Analytics
 
Spotify: Data center & Backend buildout
Spotify: Data center & Backend buildoutSpotify: Data center & Backend buildout
Spotify: Data center & Backend buildout
 
Big data and machine learning @ Spotify
Big data and machine learning @ SpotifyBig data and machine learning @ Spotify
Big data and machine learning @ Spotify
 
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingTalk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
 
How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At Spotify
 
Distributed "Web Scale" Systems
Distributed "Web Scale" SystemsDistributed "Web Scale" Systems
Distributed "Web Scale" Systems
 
Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.Music Personalization : Real time Platforms.
Music Personalization : Real time Platforms.
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover Weekly
 
Scala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyScala Data Pipelines @ Spotify
Scala Data Pipelines @ Spotify
 
CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At Spotify
 
Playlist Recommendations @ Spotify
Playlist Recommendations @ SpotifyPlaylist Recommendations @ Spotify
Playlist Recommendations @ Spotify
 
Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafka
 
Recommending and searching @ Spotify
Recommending and searching @ SpotifyRecommending and searching @ Spotify
Recommending and searching @ Spotify
 
Spotify: Dominating Music In Real Time
Spotify: Dominating Music In Real TimeSpotify: Dominating Music In Real Time
Spotify: Dominating Music In Real Time
 
[한국IBM] Watson AI 소개 및 도입사례 (201904)
[한국IBM] Watson AI 소개 및 도입사례 (201904)[한국IBM] Watson AI 소개 및 도입사례 (201904)
[한국IBM] Watson AI 소개 및 도입사례 (201904)
 
Observability at Spotify
Observability at SpotifyObservability at Spotify
Observability at Spotify
 

Viewers also liked

Making Better Mistakes Tomorrow
Making Better Mistakes TomorrowMaking Better Mistakes Tomorrow
Making Better Mistakes TomorrowDanielle Jabin
 
NAMP Conference - A/B Testing Your Way to Success
NAMP Conference - A/B Testing Your Way to SuccessNAMP Conference - A/B Testing Your Way to Success
NAMP Conference - A/B Testing Your Way to SuccessDevon Smith
 
A/B Testing - In data we trust
A/B Testing - In data we trustA/B Testing - In data we trust
A/B Testing - In data we trustPedro Marques
 
A Spotify Presentation - Case studies
A Spotify Presentation - Case studiesA Spotify Presentation - Case studies
A Spotify Presentation - Case studiesEmily Wilkinson
 
Learning a Personalized Homepage
Learning a Personalized HomepageLearning a Personalized Homepage
Learning a Personalized HomepageJustin Basilico
 
4 Steps Toward Scientific A/B Testing
4 Steps Toward Scientific A/B Testing4 Steps Toward Scientific A/B Testing
4 Steps Toward Scientific A/B TestingJanessa Lantz
 
Scaling Agile at Spotify (representation)
Scaling Agile at Spotify (representation)Scaling Agile at Spotify (representation)
Scaling Agile at Spotify (representation)Vlad Mysla
 
Growing up with agile - how the Spotify 'model' has evolved
Growing up with agile - how the Spotify 'model' has evolved Growing up with agile - how the Spotify 'model' has evolved
Growing up with agile - how the Spotify 'model' has evolved Peter Antman
 

Viewers also liked (8)

Making Better Mistakes Tomorrow
Making Better Mistakes TomorrowMaking Better Mistakes Tomorrow
Making Better Mistakes Tomorrow
 
NAMP Conference - A/B Testing Your Way to Success
NAMP Conference - A/B Testing Your Way to SuccessNAMP Conference - A/B Testing Your Way to Success
NAMP Conference - A/B Testing Your Way to Success
 
A/B Testing - In data we trust
A/B Testing - In data we trustA/B Testing - In data we trust
A/B Testing - In data we trust
 
A Spotify Presentation - Case studies
A Spotify Presentation - Case studiesA Spotify Presentation - Case studies
A Spotify Presentation - Case studies
 
Learning a Personalized Homepage
Learning a Personalized HomepageLearning a Personalized Homepage
Learning a Personalized Homepage
 
4 Steps Toward Scientific A/B Testing
4 Steps Toward Scientific A/B Testing4 Steps Toward Scientific A/B Testing
4 Steps Toward Scientific A/B Testing
 
Scaling Agile at Spotify (representation)
Scaling Agile at Spotify (representation)Scaling Agile at Spotify (representation)
Scaling Agile at Spotify (representation)
 
Growing up with agile - how the Spotify 'model' has evolved
Growing up with agile - how the Spotify 'model' has evolved Growing up with agile - how the Spotify 'model' has evolved
Growing up with agile - how the Spotify 'model' has evolved
 

Similar to Data at Spotify

Liferay and Big Data
Liferay and Big DataLiferay and Big Data
Liferay and Big DataMiguel Pastor
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologiesneeraj rathore
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Miguel Pastor
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
 
Big Data + Sentiment Analysis = Awesome
Big Data + Sentiment Analysis = AwesomeBig Data + Sentiment Analysis = Awesome
Big Data + Sentiment Analysis = AwesomeAdel Rahimi
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems
 
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreSoftweb Solutions
 
Streaming data mining
Streaming data miningStreaming data mining
Streaming data miningAnkit Solanki
 
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...Lucidworks
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache HadoopKMS Technology
 
Getting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudGetting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudRightScale
 
Spotify's Ad Targeting Infrastructure: Achieving Real-time Personalization fo...
Spotify's Ad Targeting Infrastructure: Achieving Real-time Personalization fo...Spotify's Ad Targeting Infrastructure: Achieving Real-time Personalization fo...
Spotify's Ad Targeting Infrastructure: Achieving Real-time Personalization fo...Hakka Labs
 

Similar to Data at Spotify (20)

Liferay and Big Data
Liferay and Big DataLiferay and Big Data
Liferay and Big Data
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologies
 
Big data
Big dataBig data
Big data
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
 
Big Data + Sentiment Analysis = Awesome
Big Data + Sentiment Analysis = AwesomeBig Data + Sentiment Analysis = Awesome
Big Data + Sentiment Analysis = Awesome
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
From Big Data to Fast Data
From Big Data to Fast DataFrom Big Data to Fast Data
From Big Data to Fast Data
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
 
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and more
 
Streaming data mining
Streaming data miningStreaming data mining
Streaming data mining
 
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
Getting Started with Big Data in the Cloud
Getting Started with Big Data in the CloudGetting Started with Big Data in the Cloud
Getting Started with Big Data in the Cloud
 
Spotify's Ad Targeting Infrastructure: Achieving Real-time Personalization fo...
Spotify's Ad Targeting Infrastructure: Achieving Real-time Personalization fo...Spotify's Ad Targeting Infrastructure: Achieving Real-time Personalization fo...
Spotify's Ad Targeting Infrastructure: Achieving Real-time Personalization fo...
 

Recently uploaded

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 

Recently uploaded (20)

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 

Data at Spotify

  • 1. June 12, 2014 Danielle Jabin Data Engineer, A/B Testing Data at Spotify
  • 2. I’m Danielle Jabin •  Data Engineer in the Stockholm office •  A/B testing infrastructure •  California born & raised •  If I can survive a Swedish winter, so can you! •  Studied Computer Science, Statistics, and Real Estate through the M&T program at the University of Pennsylvania
  • 3. 3 Over 40 million active users As of June 9, 2014  
  • 4. 4 Access to more than 20 million songs As of June 9, 2014  
  • 5. Big Data •  40 million Monthly Active Users •  20+ million tracks •  1.5 TB of compressed data from users per day •  64 TB of data generated in Hadoop each day (including replication factor of 3) As of June 9, 2014  
  • 6. 6 So how much data is that?
  • 7. Let’s compare: 64 TB •  293, 203, 072 books (200 pages or 240,000 characters) •  16,777,216 MP3 files (with 4MB average file size) •  22,369,600 images (with 3MB average file size)
  • 8. 8 That’s a lot of selfies
  • 9. 9 How do we use this data?
  • 10. Use Cases •  Reporting •  Business Analytics •  Operational Analytics •  Product Features
  • 11. Reporting •  Reporting to labels, licensors, partners, and advertisers •  We support our partners
  • 12. Business Analytics •  Analyzing growth, user behavior, sign-up funnels, etc •  Company KPIs •  NPS analysis
  • 13. Operational Metrics •  Root cause analysis •  Latency analysis •  Better capacity planning (servers, people, bandwidth)
  • 14. Product Features •  Discover and Radio •  Top lists •  Personalized recommendations •  A/B Testing
  • 15. 15 How do we collect this data?
  • 16. The three pillars of our Data Infrastructure: Kafka Collection Hadoop Processing Databases Analytics/Visualization
  • 17. This is Dave. Data Engineer at Spotify by day…
  • 18. …chiptune DJ Demoscene Time Machine by night.
  • 19. Let’s listen to Dave’s song
  • 20. Kafka •  High volume pub-sub system •  “Producers publish messages to Kafka topics, and consumers subscribe to these topics and consume the messages.”
  • 21. Kafka •  Robust and scalable solution for collection of logs •  Fast data transfer •  Low CPU overhead •  Built-in partitioning, replication, and fault-tolerance •  Consumers can pull data at different rates •  Able to handle extremely high volumes
  • 23. Hadoop •  Process and store massive amounts of unstructured data across a distributed cluster •  One cluster with 37 nodes to 690 nodes today •  28 PB of storage •  The largest Hadoop cluster in Europe
  • 24. Hadoop •  Entering the land of optimizations •  Data retention policy •  Move to JVM-based languages •  MapReduce languages •  Moving to Crunch, JVM-based, for speed and scalability •  Python with Hadoop Streaming, Java, Hive, PIG, Scala •  Sprunch: Crunch wrapper for Scala, open sourced by Spotify •  Spotify open-sourced scheduler, Luigi, written in Python •  Simple and easy way to chain jobs
  • 25. What if we want to know more? vs
  • 26. Databases •  Aggregates from Hadoop put into PostgreSQL or Cassandra •  Sqoop •  Core data can be used and manipulated for various needs •  Ad hoc queries •  Dashboards
  • 27. Databases •  Aggregates from Hadoop put into PostgreSQL or Cassandra •  Sqoop •  Ad hoc queries •  Dashboards
  • 28. Databases •  Aggregates from Hadoop put into PostgreSQL or Cassandra •  Sqoop •  Ad hoc queries •  Dashboards
  • 30. A/B testing questions? Find me! Contr ol vs