SlideShare a Scribd company logo
1 of 56
Download to read offline
The Evolution of
Hadoop at Spotify
Through Failures and Pain
Josh Baer (jbx@spotify.com)
Rafal Wojdyla (rav@spotify.com) 1
Note: Our views are our own and don't necessarily represent those of Spotify.
2
• Growing Pains (2009-2012)
• Gaining Focus (2013 - 2014)
• The Future (2015+)
Overview
3
Our First Major
Hadoop Bug
4
Cluster 1.0
What is Spotify?
• Music Streaming Service
• Browse and Discover Millions of
Songs, Artists and Albums
• Launched in October 2008
• December 2014:
• 60 Million Monthly Users
• 15 Million Paid Subscribers
5
What is Spotify?
• Data Infrastructure:
• 1300 Hadoop Nodes
• 42 PB Storage
• 20 TB data ingested via Kafka/day
• 200 TB generated by Hadoop/day
6
7
select artist_id, count(1)
from user_activities
where play_seconds > 30
group by artist_id;
7
select artist_id, count(1)
from user_activities
where play_seconds > 30
group by artist_id;
7
0	
  *	
  *	
  *	
  *	
  	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  hourly_import.jar	
  
15	
  *	
  *	
  *	
  *	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  hourly_listeners.jar	
  
30	
  *	
  *	
  *	
  *	
  	
  	
  spotify-­‐analytics	
  hadoop	
  jar	
  user_funnel_hourly.jar	
  
*	
  1	
  *	
  *	
  *	
  	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  daily_aggregate.jar	
  
*	
  2	
  *	
  *	
  *	
  	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  calculate_royalties.jar	
  
*/2	
  22	
  *	
  *	
  *	
  spotify-­‐radio	
  	
  	
  	
  	
  hadoop	
  jar	
  generate_radio.jar	
  
8
0	
  *	
  *	
  *	
  *	
  	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  hourly_import.jar	
  
15	
  *	
  *	
  *	
  *	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  hourly_listeners.jar	
  
30	
  *	
  *	
  *	
  *	
  	
  	
  spotify-­‐analytics	
  hadoop	
  jar	
  user_funnel_hourly.jar	
  
*	
  1	
  *	
  *	
  *	
  	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  daily_aggregate.jar	
  
*	
  2	
  *	
  *	
  *	
  	
  	
  	
  spotify-­‐core	
  	
  	
  	
  	
  	
  hadoop	
  jar	
  calculate_royalties.jar	
  
*/2	
  22	
  *	
  *	
  *	
  spotify-­‐radio	
  	
  	
  	
  	
  hadoop	
  jar	
  generate_radio.jar	
  
8
9
Handles the ‘plumbing’ for Hadoop jobs
https://github.com/spotify/luigi
10
10
1111
To the Cloud!
1111
To the Cloud!
1111
12
#	
  sudo	
  addgroup	
  hadoop	
  
#	
  sudo	
  adduser	
  —ingroup	
  hadoop	
  hdfs	
  
#	
  sudo	
  adduser	
  —ingroup	
  hadoop	
  yarn	
  
#	
  cp	
  /tmp/configs/*.xml	
  /etc/hadoop/conf/	
  
#	
  apt-­‐get	
  update	
  
…	
  
[hdfs@sj-­‐hadoop-­‐b20	
  ~]	
  $	
  apt-­‐get	
  install	
  hadoop-­‐hdfs-­‐datanode	
  
…	
  
[yarn@sj-­‐hadoop-­‐b20	
  ~]	
  $	
  apt-­‐get	
  install	
  hadoop-­‐yarn-­‐nodemanager	
  
12
#	
  sudo	
  addgroup	
  hadoop	
  
#	
  sudo	
  adduser	
  —ingroup	
  hadoop	
  hdfs	
  
#	
  sudo	
  adduser	
  —ingroup	
  hadoop	
  yarn	
  
#	
  cp	
  /tmp/configs/*.xml	
  /etc/hadoop/conf/	
  
#	
  apt-­‐get	
  update	
  
…	
  
[hdfs@sj-­‐hadoop-­‐b20	
  ~]	
  $	
  apt-­‐get	
  install	
  hadoop-­‐hdfs-­‐datanode	
  
…	
  
[yarn@sj-­‐hadoop-­‐b20	
  ~]	
  $	
  apt-­‐get	
  install	
  hadoop-­‐yarn-­‐nodemanager	
  
13
Automated
Config Management
(via Puppet)
14
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  hdfs	
  dfs	
  -­‐ls	
  /data	
  
Found	
  3	
  items	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  0	
  2015-­‐01-­‐01	
  12:00	
  lake	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  0	
  2015-­‐01-­‐01	
  12:00	
  pond	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  0	
  2015-­‐01-­‐01	
  12:00	
  ocean	
  
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  hdfs	
  dfs	
  -­‐ls	
  /data/lake	
  
Found	
  1	
  items	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  1321451	
  2015-­‐01-­‐01	
  12:00	
  boats.txt	
  
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  hdfs	
  dfs	
  -­‐cat	
  /data/lake/boats.txt	
  
…
14
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  hdfs	
  dfs	
  -­‐ls	
  /data	
  
Found	
  3	
  items	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  0	
  2015-­‐01-­‐01	
  12:00	
  lake	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  0	
  2015-­‐01-­‐01	
  12:00	
  pond	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  0	
  2015-­‐01-­‐01	
  12:00	
  ocean	
  
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  hdfs	
  dfs	
  -­‐ls	
  /data/lake	
  
Found	
  1	
  items	
  
drwxr-­‐xr-­‐x	
  	
  	
  -­‐	
  hdfs	
  hadoop	
  	
  	
  	
  	
  1321451	
  2015-­‐01-­‐01	
  12:00	
  boats.txt	
  
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  hdfs	
  dfs	
  -­‐cat	
  /data/lake/boats.txt	
  
…
15
$	
  time	
  for	
  i	
  in	
  {1..100};	
  do	
  hadoop	
  fs	
  -­‐ls	
  /	
  >	
  /dev/null;	
  done	
  
real	
   3m32.014s	
  
user	
   6m15.891s	
  
sys	
  	
  	
  	
  0m18.821s	
  
$	
  time	
  for	
  i	
  in	
  {1..100};	
  do	
  snakebite	
  ls	
  /	
  >	
  /dev/null;	
  done	
  
real	
   0m34.760s	
  
user	
   0m29.962s	
  
sys	
  	
  	
  	
  0m4.512s	
  
16
17
Gaining Focus
(2013-2014)
18
• In 2013, expanded to 200 nodes
• Hadoop critical
• Needed a team totally focused on it
• Created a ‘squad’ with two missions:
• Migrate to a new distribution with Yarn
• Make Hadoop reliable
Forming a team
19
19
Hadoop ownerless
19
Hadoop ownerless
Squad
19
Hadoop ownerless Upgrades
Squad
19
Hadoop ownerless Upgrades Getting there
Squad
20
• Alert on service level problems (i.e. no jobs running)
• Keep your alarm channel clean. Beware of alert fatigue.
Alerting
21
Uhh ohh…..
I think I made a mistake
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  snakebite	
  rm	
  -­‐R	
  /team/disco/	
  CF/test-­‐10/	
  
22
Goodbye Data (1PB)
[data-­‐sci@sj-­‐edge-­‐a1	
  ~]	
  $	
  snakebite	
  rm	
  -­‐R	
  /team/disco/	
  CF/test-­‐10/	
  
22
OK:	
  Deleted	
  /team/disco	
  
23
• “Sit on your hands before you type” - Wouter de Bie
• Users will always want to retain data!
• Remove superusers from ‘edgenodes’
• Moving to trash = client-side implementation
Lessons Learned
24
The Wild Wild West
25
• Same hardware profile as production cluster
• Similar configuration
• Staging environment
• Reliable
Pre-Production Cluster
26
Automated Testing
27
28
Moving Data
29
• Features:
• Data discovery
• Lineage
• Lifecycle management
• More
• We use it for data movement
• Uses Oozie behind the scenes
Apache Falcon
• Most of our jobs were Hadoop (python) Streaming
• Lots of failures, slow performance
• Had to find a better way….
30
Improving Performance
31
• Investigated several frameworks
• Selected Crunch:
• Real types!
• Higher level API
• Easier to test
• Better performance #JVM_FTW
*Dave Whiting’s analysis of systems: http://thewit.ch/scalding_crunchy_pig
Improving Performance
32
33
Let’s Review
34
The Future
(2015+)
35
Note: Spotify users is based on publicly released numbers only
36
Explosive Growth
• Increased Spotify Users
• Increased use cases
• Increased Engineers
37
http://everynoise.com/engenremap.html
38
Scaling machines: easy
Scaling people: hard
39
User Feedback:
Automate IT!
40
Data Management
41
• Data-discovery tool
• Luigi Integration
• Find and browse datasets
• View schemas
• Trace lineage
• Open-source plans? :-(
Raynor
42
Two Takeaways
• Automate Everything
• More time to play FIFA build cool tools
• Listen to your users
• Fail fast, don’t be afraid to scrap work
43
Join the Band!
Engineers wanted in NYC & Stockholm
http://spotify.com/jobs

More Related Content

What's hot

Playlist Recommendations @ Spotify
Playlist Recommendations @ SpotifyPlaylist Recommendations @ Spotify
Playlist Recommendations @ SpotifyNikhil Tibrewal
 
How data drives spotify
How data drives spotifyHow data drives spotify
How data drives spotifyAli Sarrafi
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyChris Johnson
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with SparkChris Johnson
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At SpotifyVidhya Murali
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with SparkChris Johnson
 
Spotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music DiscoverySpotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music DiscoveryKarthik Murugesan
 
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
Whoops, The Numbers Are Wrong! Scaling Data Quality @ NetflixWhoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
Whoops, The Numbers Are Wrong! Scaling Data Quality @ NetflixDataWorks Summit
 
Scala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyScala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyNeville Li
 
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...confluent
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyChris Johnson
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...Hakka Labs
 
Personalized Playlists at Spotify
Personalized Playlists at SpotifyPersonalized Playlists at Spotify
Personalized Playlists at SpotifyRohan Agrawal
 
Building Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyBuilding Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyVidhya Murali
 
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...DataWorks Summit
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at SpotifyErik Bernhardsson
 
CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyVidhya Murali
 
Interactive Analytics in Human Time
Interactive Analytics in Human TimeInteractive Analytics in Human Time
Interactive Analytics in Human TimeDataWorks Summit
 

What's hot (20)

Playlist Recommendations @ Spotify
Playlist Recommendations @ SpotifyPlaylist Recommendations @ Spotify
Playlist Recommendations @ Spotify
 
How data drives spotify
How data drives spotifyHow data drives spotify
How data drives spotify
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover Weekly
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with Spark
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At Spotify
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with Spark
 
Spotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music DiscoverySpotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music Discovery
 
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
Whoops, The Numbers Are Wrong! Scaling Data Quality @ NetflixWhoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
 
Scala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyScala Data Pipelines @ Spotify
Scala Data Pipelines @ Spotify
 
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3...
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at Spotify
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
 
Personalized Playlists at Spotify
Personalized Playlists at SpotifyPersonalized Playlists at Spotify
Personalized Playlists at Spotify
 
Building Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at SpotifyBuilding Data Pipelines for Music Recommendations at Spotify
Building Data Pipelines for Music Recommendations at Spotify
 
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Distributed "Web Scale" Systems
Distributed "Web Scale" SystemsDistributed "Web Scale" Systems
Distributed "Web Scale" Systems
 
Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at Spotify
 
CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At Spotify
 
Interactive Analytics in Human Time
Interactive Analytics in Human TimeInteractive Analytics in Human Time
Interactive Analytics in Human Time
 

Viewers also liked

Moving into movies - using video in E-Learning
Moving into movies - using video in E-Learning Moving into movies - using video in E-Learning
Moving into movies - using video in E-Learning Aurion Learning
 
We’ve created a monster! Truth and fiction in SOA
We’ve created a monster! Truth and fiction in SOAWe’ve created a monster! Truth and fiction in SOA
We’ve created a monster! Truth and fiction in SOAJon Collins
 
A thousand fronts: on the architectures I like
A thousand fronts: on the architectures I likeA thousand fronts: on the architectures I like
A thousand fronts: on the architectures I likeDavide Tommaso Ferrando
 
360i Idea Safari: The Hunt of the Mysterious BIG IDEA (Presented at Cannes 2012)
360i Idea Safari: The Hunt of the Mysterious BIG IDEA (Presented at Cannes 2012)360i Idea Safari: The Hunt of the Mysterious BIG IDEA (Presented at Cannes 2012)
360i Idea Safari: The Hunt of the Mysterious BIG IDEA (Presented at Cannes 2012)360i
 
4.4 mb portfolio print 2012-2016
4.4 mb portfolio print 2012-20164.4 mb portfolio print 2012-2016
4.4 mb portfolio print 2012-2016Meghan Garnett
 
Story, Sci-Fi & Transmedia to develop Corporate Technology Strategies.
Story, Sci-Fi & Transmedia to develop Corporate Technology Strategies.Story, Sci-Fi & Transmedia to develop Corporate Technology Strategies.
Story, Sci-Fi & Transmedia to develop Corporate Technology Strategies.Hubbub Media
 
Big Idea: FIction & Non-Fiction
Big Idea: FIction & Non-FictionBig Idea: FIction & Non-Fiction
Big Idea: FIction & Non-FictionAngela Maiers
 
Spotify's Music Recommendations Lambda Architecture
Spotify's Music Recommendations Lambda ArchitectureSpotify's Music Recommendations Lambda Architecture
Spotify's Music Recommendations Lambda ArchitectureEsh Vckay
 
When Brands Attack
When Brands AttackWhen Brands Attack
When Brands AttackDirk Herbert
 
Nature, Architecture And Music (V M )
Nature, Architecture And Music (V M )Nature, Architecture And Music (V M )
Nature, Architecture And Music (V M )Valeriu Margescu
 
exploring architecture and music
exploring architecture and musicexploring architecture and music
exploring architecture and musicmichielmoyaert
 
Structural Project 1 Group 1
Structural Project 1 Group 1Structural Project 1 Group 1
Structural Project 1 Group 1guestb2749c7
 
But Today We Collect Bullshit: Architecture and Storytelling in the Age of So...
But Today We Collect Bullshit: Architecture and Storytelling in the Age of So...But Today We Collect Bullshit: Architecture and Storytelling in the Age of So...
But Today We Collect Bullshit: Architecture and Storytelling in the Age of So...Davide Tommaso Ferrando
 
Architectural structures world Wide
Architectural structures   world WideArchitectural structures   world Wide
Architectural structures world WideSagun Rakibe
 
Notaable Buildings Around The World
Notaable Buildings Around The WorldNotaable Buildings Around The World
Notaable Buildings Around The WorldFakhrul Afifi
 

Viewers also liked (20)

Postmodernism
PostmodernismPostmodernism
Postmodernism
 
Barnes and Noble
Barnes and NobleBarnes and Noble
Barnes and Noble
 
Moving into movies - using video in E-Learning
Moving into movies - using video in E-Learning Moving into movies - using video in E-Learning
Moving into movies - using video in E-Learning
 
We’ve created a monster! Truth and fiction in SOA
We’ve created a monster! Truth and fiction in SOAWe’ve created a monster! Truth and fiction in SOA
We’ve created a monster! Truth and fiction in SOA
 
A thousand fronts: on the architectures I like
A thousand fronts: on the architectures I likeA thousand fronts: on the architectures I like
A thousand fronts: on the architectures I like
 
360i Idea Safari: The Hunt of the Mysterious BIG IDEA (Presented at Cannes 2012)
360i Idea Safari: The Hunt of the Mysterious BIG IDEA (Presented at Cannes 2012)360i Idea Safari: The Hunt of the Mysterious BIG IDEA (Presented at Cannes 2012)
360i Idea Safari: The Hunt of the Mysterious BIG IDEA (Presented at Cannes 2012)
 
Math music and architecture
Math music and architectureMath music and architecture
Math music and architecture
 
4.4 mb portfolio print 2012-2016
4.4 mb portfolio print 2012-20164.4 mb portfolio print 2012-2016
4.4 mb portfolio print 2012-2016
 
Story, Sci-Fi & Transmedia to develop Corporate Technology Strategies.
Story, Sci-Fi & Transmedia to develop Corporate Technology Strategies.Story, Sci-Fi & Transmedia to develop Corporate Technology Strategies.
Story, Sci-Fi & Transmedia to develop Corporate Technology Strategies.
 
Big Idea: FIction & Non-Fiction
Big Idea: FIction & Non-FictionBig Idea: FIction & Non-Fiction
Big Idea: FIction & Non-Fiction
 
Hamburg 2012
Hamburg 2012Hamburg 2012
Hamburg 2012
 
Spotify's Music Recommendations Lambda Architecture
Spotify's Music Recommendations Lambda ArchitectureSpotify's Music Recommendations Lambda Architecture
Spotify's Music Recommendations Lambda Architecture
 
When Brands Attack
When Brands AttackWhen Brands Attack
When Brands Attack
 
Nature, Architecture And Music (V M )
Nature, Architecture And Music (V M )Nature, Architecture And Music (V M )
Nature, Architecture And Music (V M )
 
exploring architecture and music
exploring architecture and musicexploring architecture and music
exploring architecture and music
 
Structural Project 1 Group 1
Structural Project 1 Group 1Structural Project 1 Group 1
Structural Project 1 Group 1
 
Urban Design
Urban DesignUrban Design
Urban Design
 
But Today We Collect Bullshit: Architecture and Storytelling in the Age of So...
But Today We Collect Bullshit: Architecture and Storytelling in the Age of So...But Today We Collect Bullshit: Architecture and Storytelling in the Age of So...
But Today We Collect Bullshit: Architecture and Storytelling in the Age of So...
 
Architectural structures world Wide
Architectural structures   world WideArchitectural structures   world Wide
Architectural structures world Wide
 
Notaable Buildings Around The World
Notaable Buildings Around The WorldNotaable Buildings Around The World
Notaable Buildings Around The World
 

Similar to The Evolution of Hadoop at Spotify - Through Failures and Pain

Unleash your cluster with YARN
Unleash your cluster with YARNUnleash your cluster with YARN
Unleash your cluster with YARNFerran Galí Reniu
 
Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Ferran Galí Reniu
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...DataStax
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & HadoopJeffrey Breen
 
Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤Toshihiro Suzuki
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoopfvanvollenhoven
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparktrihug
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an exampleNikita Kesharwani
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Adam Kawa
 
RMLL 2013 : Build Your Personal Search Engine using Crawlzilla
RMLL 2013 : Build Your Personal Search Engine using CrawlzillaRMLL 2013 : Build Your Personal Search Engine using Crawlzilla
RMLL 2013 : Build Your Personal Search Engine using CrawlzillaJazz Yao-Tsung Wang
 
Druid at Strata Conf NY 2016.pdf
Druid at Strata Conf NY 2016.pdfDruid at Strata Conf NY 2016.pdf
Druid at Strata Conf NY 2016.pdfHimanshuGupta936
 
Why Managed Service Providers Should Embrace Container Technology
Why Managed Service Providers Should Embrace Container TechnologyWhy Managed Service Providers Should Embrace Container Technology
Why Managed Service Providers Should Embrace Container TechnologySagi Brody
 

Similar to The Evolution of Hadoop at Spotify - Through Failures and Pain (20)

Unleash your cluster with YARN
Unleash your cluster with YARNUnleash your cluster with YARN
Unleash your cluster with YARN
 
Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)Yarn by default (Spark on YARN)
Yarn by default (Spark on YARN)
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
 
Hadoop bootcamp getting started
Hadoop bootcamp getting startedHadoop bootcamp getting started
Hadoop bootcamp getting started
 
Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤Amebaサービスのログ解析基盤
Amebaサービスのログ解析基盤
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an example
 
20080529dublinpt1
20080529dublinpt120080529dublinpt1
20080529dublinpt1
 
Backups
BackupsBackups
Backups
 
Big Data
Big DataBig Data
Big Data
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
RMLL 2013 : Build Your Personal Search Engine using Crawlzilla
RMLL 2013 : Build Your Personal Search Engine using CrawlzillaRMLL 2013 : Build Your Personal Search Engine using Crawlzilla
RMLL 2013 : Build Your Personal Search Engine using Crawlzilla
 
Druid at Strata Conf NY 2016.pdf
Druid at Strata Conf NY 2016.pdfDruid at Strata Conf NY 2016.pdf
Druid at Strata Conf NY 2016.pdf
 
#WeSpeakLinux Session
#WeSpeakLinux Session#WeSpeakLinux Session
#WeSpeakLinux Session
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Why Managed Service Providers Should Embrace Container Technology
Why Managed Service Providers Should Embrace Container TechnologyWhy Managed Service Providers Should Embrace Container Technology
Why Managed Service Providers Should Embrace Container Technology
 

Recently uploaded

Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 

Recently uploaded (20)

Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

The Evolution of Hadoop at Spotify - Through Failures and Pain

  • 1. The Evolution of Hadoop at Spotify Through Failures and Pain Josh Baer (jbx@spotify.com) Rafal Wojdyla (rav@spotify.com) 1 Note: Our views are our own and don't necessarily represent those of Spotify.
  • 2. 2 • Growing Pains (2009-2012) • Gaining Focus (2013 - 2014) • The Future (2015+) Overview
  • 5. What is Spotify? • Music Streaming Service • Browse and Discover Millions of Songs, Artists and Albums • Launched in October 2008 • December 2014: • 60 Million Monthly Users • 15 Million Paid Subscribers 5
  • 6. What is Spotify? • Data Infrastructure: • 1300 Hadoop Nodes • 42 PB Storage • 20 TB data ingested via Kafka/day • 200 TB generated by Hadoop/day 6
  • 7. 7 select artist_id, count(1) from user_activities where play_seconds > 30 group by artist_id;
  • 8. 7 select artist_id, count(1) from user_activities where play_seconds > 30 group by artist_id;
  • 9. 7
  • 10. 0  *  *  *  *        spotify-­‐core            hadoop  jar  hourly_import.jar   15  *  *  *  *      spotify-­‐core            hadoop  jar  hourly_listeners.jar   30  *  *  *  *      spotify-­‐analytics  hadoop  jar  user_funnel_hourly.jar   *  1  *  *  *        spotify-­‐core            hadoop  jar  daily_aggregate.jar   *  2  *  *  *        spotify-­‐core            hadoop  jar  calculate_royalties.jar   */2  22  *  *  *  spotify-­‐radio          hadoop  jar  generate_radio.jar   8
  • 11. 0  *  *  *  *        spotify-­‐core            hadoop  jar  hourly_import.jar   15  *  *  *  *      spotify-­‐core            hadoop  jar  hourly_listeners.jar   30  *  *  *  *      spotify-­‐analytics  hadoop  jar  user_funnel_hourly.jar   *  1  *  *  *        spotify-­‐core            hadoop  jar  daily_aggregate.jar   *  2  *  *  *        spotify-­‐core            hadoop  jar  calculate_royalties.jar   */2  22  *  *  *  spotify-­‐radio          hadoop  jar  generate_radio.jar   8
  • 12. 9 Handles the ‘plumbing’ for Hadoop jobs https://github.com/spotify/luigi
  • 13. 10
  • 14. 10
  • 15. 1111
  • 18. 12 #  sudo  addgroup  hadoop   #  sudo  adduser  —ingroup  hadoop  hdfs   #  sudo  adduser  —ingroup  hadoop  yarn   #  cp  /tmp/configs/*.xml  /etc/hadoop/conf/   #  apt-­‐get  update   …   [hdfs@sj-­‐hadoop-­‐b20  ~]  $  apt-­‐get  install  hadoop-­‐hdfs-­‐datanode   …   [yarn@sj-­‐hadoop-­‐b20  ~]  $  apt-­‐get  install  hadoop-­‐yarn-­‐nodemanager  
  • 19. 12 #  sudo  addgroup  hadoop   #  sudo  adduser  —ingroup  hadoop  hdfs   #  sudo  adduser  —ingroup  hadoop  yarn   #  cp  /tmp/configs/*.xml  /etc/hadoop/conf/   #  apt-­‐get  update   …   [hdfs@sj-­‐hadoop-­‐b20  ~]  $  apt-­‐get  install  hadoop-­‐hdfs-­‐datanode   …   [yarn@sj-­‐hadoop-­‐b20  ~]  $  apt-­‐get  install  hadoop-­‐yarn-­‐nodemanager  
  • 21. 14 [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐ls  /data   Found  3  items   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  lake   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  pond   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  ocean   [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐ls  /data/lake   Found  1  items   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          1321451  2015-­‐01-­‐01  12:00  boats.txt   [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐cat  /data/lake/boats.txt   …
  • 22. 14 [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐ls  /data   Found  3  items   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  lake   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  pond   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          0  2015-­‐01-­‐01  12:00  ocean   [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐ls  /data/lake   Found  1  items   drwxr-­‐xr-­‐x      -­‐  hdfs  hadoop          1321451  2015-­‐01-­‐01  12:00  boats.txt   [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  hdfs  dfs  -­‐cat  /data/lake/boats.txt   …
  • 23. 15
  • 24. $  time  for  i  in  {1..100};  do  hadoop  fs  -­‐ls  /  >  /dev/null;  done   real   3m32.014s   user   6m15.891s   sys        0m18.821s   $  time  for  i  in  {1..100};  do  snakebite  ls  /  >  /dev/null;  done   real   0m34.760s   user   0m29.962s   sys        0m4.512s   16
  • 26. 18 • In 2013, expanded to 200 nodes • Hadoop critical • Needed a team totally focused on it • Created a ‘squad’ with two missions: • Migrate to a new distribution with Yarn • Make Hadoop reliable Forming a team
  • 27. 19
  • 31. 19 Hadoop ownerless Upgrades Getting there Squad
  • 32. 20 • Alert on service level problems (i.e. no jobs running) • Keep your alarm channel clean. Beware of alert fatigue. Alerting
  • 33. 21 Uhh ohh….. I think I made a mistake
  • 34. [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  snakebite  rm  -­‐R  /team/disco/  CF/test-­‐10/   22
  • 35. Goodbye Data (1PB) [data-­‐sci@sj-­‐edge-­‐a1  ~]  $  snakebite  rm  -­‐R  /team/disco/  CF/test-­‐10/   22 OK:  Deleted  /team/disco  
  • 36. 23 • “Sit on your hands before you type” - Wouter de Bie • Users will always want to retain data! • Remove superusers from ‘edgenodes’ • Moving to trash = client-side implementation Lessons Learned
  • 38. 25 • Same hardware profile as production cluster • Similar configuration • Staging environment • Reliable Pre-Production Cluster
  • 40. 27
  • 42. 29 • Features: • Data discovery • Lineage • Lifecycle management • More • We use it for data movement • Uses Oozie behind the scenes Apache Falcon
  • 43. • Most of our jobs were Hadoop (python) Streaming • Lots of failures, slow performance • Had to find a better way…. 30 Improving Performance
  • 44. 31 • Investigated several frameworks • Selected Crunch: • Real types! • Higher level API • Easier to test • Better performance #JVM_FTW *Dave Whiting’s analysis of systems: http://thewit.ch/scalding_crunchy_pig Improving Performance
  • 45. 32
  • 48. 35 Note: Spotify users is based on publicly released numbers only
  • 49. 36 Explosive Growth • Increased Spotify Users • Increased use cases • Increased Engineers
  • 54. 41 • Data-discovery tool • Luigi Integration • Find and browse datasets • View schemas • Trace lineage • Open-source plans? :-( Raynor
  • 55. 42 Two Takeaways • Automate Everything • More time to play FIFA build cool tools • Listen to your users • Fail fast, don’t be afraid to scrap work
  • 56. 43 Join the Band! Engineers wanted in NYC & Stockholm http://spotify.com/jobs