SlideShare a Scribd company logo
1 of 102
Download to read offline
Adam Kawa
Data Engineer @ Spotify
Hadoop Operations
Powered By …
Hadoop
1. How many times has Coldplay been streamed this
month?
2. How many times was “Get Lucky” streamed during
first 24h?
3. Who was the most popular artist in NYC last week?
Labels, Advertisers, Partners
1. What song to recommend Jay-Z when he wakes
up?
2. Is Adam Kawa bored with Coldplay today?
3. How to get Arun to subscribe to Spotify
Premium?
Data Scientists
(Big) Data At Spotify
■ Data generated by +24M monthly active users
and for users!
- 2.2 TB of compressed data from users per day
- 64 TB of data generated in Hadoop each day
(triplicated)
Data Infrastructure At Spotify
■ Apache Hadoop YARN
■ Many other systems including
- Kafka, Cassandra, Storm, Luigi in production
- Giraph, Tez, Spark in the evaluation mode
■ Probably the largest commercial Hadoop cluster in
Europe!
- 694 heterogeneous nodes
- 14.25 PB of data consumed
- ~12.000 jobs each day
Apache Hadoop
March 2013
Tricky questions were asked!
1. How many servers do you need to buy to survive
one year?
2. What will you do to use them efficiently?
3. If we agree, don’t come back to us this year! OK?
Finance Department
■ One of Data Engineers responsible for answering
these questions!
Adam Kawa
■ Examples of how to analyze various metrics, logs
and files
- generated by Hadoop
- using Hadoop
- to understand Hadoop
- to avoid guesstimates!
The Topic Of This Talk
■ This knowledge can be useful to
- measure how fast HDFS is growing
- define an empirical retention policy
- measure the performance of jobs
- optimize the scheduler
- and more
What To Use It For
1. Analyzing HDFS
2. Analyzing MapReduce and YARN
Agenda
HDFS
Garbage Collection On
The NameNode
“ We don’t have any full GC pauses on the NN.
Our GC stops the NN for less than 100 msec,
on average!
:) ”
Adam Kawa @ Hadoop User Mailing List
December 16th, 2013
“ Today, between 12:05 and 13:00
we had 5 full GC pauses on the NN.
They stopped the NN for 34min47sec in total!
:( ”
Adam Kawa @ Spotify office, Stockholm
January 13th, 2014
What happened
between 12:05 and 13:00?
The NameNode was receiving the block reports from
all the DataNodes
Quick Answer!
1. We started the NN when the DNs were running
Detailed Answer
1. We started the NN when the DNs were running
2. 502 DNs immediately registered to the NN
■ Within 1.2 sec (based on logs from the DNs)
Detailed Answer
1. We started the NN when the DNs were running
2. 502 DNs immediately registered to the NN
■ Within 1.2 sec (based on logs from the DNs)
3. 502 DNs started sending the block reports
■ dfs.blockreport.initialDelay = 30 minutes
■ 17 block reports per minute (on average)
■ +831K blocks in each block report (on average)
Detailed Answer
1. We started the NN when the DNs were running
2. 502 DNs immediately registered to the NN
■ Within 1.2 sec (based on logs from the DNs)
3. 502 DNs started sending the block reports
■ dfs.blockreport.initialDelay = 30 minutes
■ 17 block reports per minute (on average)
■ +831K blocks in each block report (on average)
4. This generated a high memory pressure on the NN
■ The NN ran into Full GC !!!
Detailed Answer
Hadoop told us everything!
■ Enable GC logging for the NameNode
■ Visualize e.g. GCViewer
■ Analyze memory usage patterns, GC pauses,
misconfiguration
Collecting The GC Stats
Time
This blue line
shows the heap
used by the NN
Loading
FsImage
Start replaying
Edit logs
First block report
processed
25 block reports
processed
131 block reports
processed
5min 39sec of
Full GC
40 block reports
processed
Next Full GC
Next Full GC
!!!
CMS collector starts
at 98.5% of heap…
We fixed that !
What happened in HDFS
between mid-December 2013
and mid-January 2014?
HDFS
HDFS Metadata
■ A persistent checkpoint of HDFS metadata
■ It contains information about files + directories
■ A binary file
HDFS FsImage File
■ Converts the content of FsImage to text formats
- e.g. a tab-separated file or XML
■ Output is easily analyzed by any tools
- e.g. Pig, Hive
HDFS Offline Image Viewer
50% of the data
created during last 3
months
Anything interesting?
1. NO data added that day
2. Many more files added after
The migration to YARN
Where
did
the small files
come from?
■ An interactive
visualization of
data in HDFS
Twitter's
HDFS-DU
/app-logs
avg. file size = 253 KB
no. of dirs = 595K
no. of files = 60.6M
■ Statistics broken down by user/group name
■ Candidates for duplicate datasets
■ Inefficient MapReduce jobs
- Small files
- Skewed files
More Uses Of FsImage File
■ You can analyze FsImage to learn how fast HDFS
grows
■ You can combine it with “external” datasets
- number of daily/monthly active users
- total size of logs generated by users
- number of queries / day run by data analysts
Advanced HDFS Capacity Planning
■ You can also use ''trend button'' in Ganglia
Simplified HDFS Capacity Planning
If we do
NOTHING,
we might fill
the cluster
in
September
...
What will we do
to survive longer
than September?
HDFS
Retention
Question
How many days after creation, a dataset is not
accessed anymore?
Retention Policy
Question
How many days after creation, a dataset is not
accessed anymore?
Possible Solution
■ You can use modification_time and access_time
from FsImage
Empirical Retention Policy
■ Logs and core datasets are accessed even many
years after creation
■ Many reports are not accessed even a hour after
creation
■ Most intermediate datasets needed less than a
week
■ 10% of data has not been accessed for a year
Our Retention Facts
HDFS
Hot Datasets
■ Some files/directories will be accessed more often
than others e.g.:
- fresh logs, core datasets, dictionary files
Idea
■ To process it faster, increase
its replication factor while it’s “hot”
■ To save disk space, decrease
its replication factor when it becomes “cold”
Hot Dataset
How to find them?
■ Logs all filesystem access requests sent to the NN
■ Easy to parse and aggregate
- a tab-separated line for each request
HDFS Audit Log
2014-01-18 15:16:12,023
INFO FSNamesystem.audit: allowed=true
ugi=kawaa (auth:SIMPLE) ip=/10.254.28.4
cmd=open
src=/metadata/artist/2013-11-27/part-00061.avro
dst=null perm=null
■ JAR files stored in HDFS and used by Pig scripts
■ A dictionary file with metadata about log messages
■ Core datasets: playlists, users, top tracks
Our Hot Datasets
YARN
MapReduce Jobs
Autotuning
■ There are jobs that we schedule regularly
- e.g. top lists for each country
Idea
■ Before submitting it next time, use statistics from the
previous executions of a job
- To learn about its historical performance
- To tweak its configuration settings
Recurring MapReduce Jobs
We implemented
■ A pre-execution hook that automatically sets
- Maximum size of an input split
- Number of Reduce tasks
■ More settings can be tweaked
- Memory
- Combiner
Jobs Autotuning
■ Here, the goal is that a task runs approx. 10 min, on
average
- Inspired by LinkedIn at Hadoop Summit 2013
- Helpful in extreme cases (short/long running tasks)
A Small PoC ;)
Another Example - Job Optimized Over Time
Even perfect manual settings
may become outdated
when an input dataset grows!
YARN
MapReduce Statistics
■ Extracts the statistics from historical MapReduce jobs
- Supports MRv1 and YARN
■ Stores them as Avro files
- Enables easy analysis using e.g. Pig and Hive
■ Similar projects
- Replephant, hRaven
Zlatanitor = Zlatan + Monitor
Zlatanitor
Low Medium High
A Slow Node
- 40% lower throughput than the average
Low Medium High
NIC negotiated 100MbE
instead of 1GbE
Low Medium High
According to Facebook
■ ”Small percentage of machines are responsible for
large percentage of failures”
- Worse performance
- More alerts
- More manual intervention
Repeat Offenders
Adding nodes to the cluster
increases performance.
Sometimes, removing (crappy) nodes
does too !
Fixing
slow and failing
tasks as well !
YARN
Application Logs
■ YARN - can be moved to HDFS
- They are stored as TFiles … :(
- Small and many of them!
Location Of Application Logs
■ Frequent exceptions and bugs
- Just looking at the last line of stderr shows a lot!
■ Possible optimizations
- Memory and size of map input buffer
What Might Be Checked
a) AttributeError: 'int' object has no attribute 'iteritems'
b) ValueError: invalid literal for int() with base 10: 'spotify'
c) ValueError: Expecting , delimiter: line 1 column 3257 (char 3257)
d) ImportError: No module named db_statistics
YARN
The Capacity Scheduler
■ We specified capacities and elasticity based on a
combination of
- “some” data
- intuition
- desire to shape future usage (!)
Our Initial Capacities
■ Basic information available on the Scheduler Web UI
■ Take print-screens!
- Otherwise, you will lose the history of what you saw :(
Overutilization And Underutilization
■ Capacity Scheduler exposes these metrics via JMX
■ Ganglia does NOT display the metrics related to
utilization of queues (by default)
Visualizing Utilization Of Queue
■ It collects JMX metrics from Java processes
■ It can send metrics to multiple destinations
- Graphite, cacti/rrdtool, Ganglia
- tab-separated text file
- STDOUT
- and more
Jmxtrans
■ Our Production queue often borrows resources
- Usually from the Queue3 and Queue4 queues
Overutilization And Underutilization
The Best Time For The Downtime?
Three Crowns
Three Crowns = Sweden
BONUS
Some Cool Stuff
From The Community
■ Aggregates and visualizes Hadoop cluster
utilization across users
LinkedIn's White Elephant
■ Collects run-time statistics from MR jobs
- Stores them in HBase
■ Does not provide built-in visualization layer
- The picture below comes from Twitter's blog
Twitter's hRaven
That’s all!
■ Analyzing Hadoop is also a “business” problem
- Save money
- Iterate faster
- Avoid downtimes
Summary
Thank you!
■ To my awesome colleagues for great technical
review:
Piotr Krewski, Josh Baer, Rafal Wojdyla,
Anna Dackiewicz, Magnus Runesson, Gustav Landén,
Guido Urdaneta, Uldis Barbans
More Thanks
Section name
Questions?
Check out spotify.com/jobs or
@Spotifyjobs for more information
kawaa@spotify.com
Check out my blog: HakunaMapData.com
Want to join the band?
Backup
■ Tricky question!
■ Use production jobs that represent your workload
■ Use a metric that is independent from size of data
that you process
■ Optimize one setting at the time
Benchmarking
Benchmarking
Benchmarking
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)

More Related Content

What's hot

Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationSameer Tiwari
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesDataWorks Summit
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesMithun Radhakrishnan
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooMithun Radhakrishnan
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practicesHadoop User Group
 

What's hot (20)

Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animation
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
 
Hadoop sqoop
Hadoop sqoop Hadoop sqoop
Hadoop sqoop
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Pptx present
Pptx presentPptx present
Pptx present
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at Yahoo
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practices
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 

Similar to Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)

Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP vinoth kumar
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationGeorge Long
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to SchoolAdam Doyle
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and DeploymentCisco Canada
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of MusicLars Albertsson
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with SparkArjen de Vries
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingSam Ng
 

Similar to Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam) (20)

Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Training
TrainingTraining
Training
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
 

More from Adam Kawa

Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At SpotifyAdam Kawa
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeAdam Kawa
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Adam Kawa
 
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Adam Kawa
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationAdam Kawa
 
Apache Hadoop Java API
Apache Hadoop Java APIApache Hadoop Java API
Apache Hadoop Java APIAdam Kawa
 
Apache Hadoop Ecosystem (based on an exemplary data-driven…
Apache Hadoop Ecosystem (based on an exemplary data-driven…Apache Hadoop Ecosystem (based on an exemplary data-driven…
Apache Hadoop Ecosystem (based on an exemplary data-driven…Adam Kawa
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARNAdam Kawa
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Adam Kawa
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
Systemy rekomendacji
Systemy rekomendacjiSystemy rekomendacji
Systemy rekomendacjiAdam Kawa
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGAdam Kawa
 

More from Adam Kawa (12)

Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At Spotify
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
 
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
Apache Hadoop Java API
Apache Hadoop Java APIApache Hadoop Java API
Apache Hadoop Java API
 
Apache Hadoop Ecosystem (based on an exemplary data-driven…
Apache Hadoop Ecosystem (based on an exemplary data-driven…Apache Hadoop Ecosystem (based on an exemplary data-driven…
Apache Hadoop Ecosystem (based on an exemplary data-driven…
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Systemy rekomendacji
Systemy rekomendacjiSystemy rekomendacji
Systemy rekomendacji
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUG
 

Recently uploaded

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)

  • 1. Adam Kawa Data Engineer @ Spotify Hadoop Operations Powered By … Hadoop
  • 2. 1. How many times has Coldplay been streamed this month? 2. How many times was “Get Lucky” streamed during first 24h? 3. Who was the most popular artist in NYC last week? Labels, Advertisers, Partners
  • 3. 1. What song to recommend Jay-Z when he wakes up? 2. Is Adam Kawa bored with Coldplay today? 3. How to get Arun to subscribe to Spotify Premium? Data Scientists
  • 4.
  • 5. (Big) Data At Spotify ■ Data generated by +24M monthly active users and for users! - 2.2 TB of compressed data from users per day - 64 TB of data generated in Hadoop each day (triplicated)
  • 6. Data Infrastructure At Spotify ■ Apache Hadoop YARN ■ Many other systems including - Kafka, Cassandra, Storm, Luigi in production - Giraph, Tez, Spark in the evaluation mode
  • 7. ■ Probably the largest commercial Hadoop cluster in Europe! - 694 heterogeneous nodes - 14.25 PB of data consumed - ~12.000 jobs each day Apache Hadoop
  • 9. 1. How many servers do you need to buy to survive one year? 2. What will you do to use them efficiently? 3. If we agree, don’t come back to us this year! OK? Finance Department
  • 10. ■ One of Data Engineers responsible for answering these questions! Adam Kawa
  • 11. ■ Examples of how to analyze various metrics, logs and files - generated by Hadoop - using Hadoop - to understand Hadoop - to avoid guesstimates! The Topic Of This Talk
  • 12. ■ This knowledge can be useful to - measure how fast HDFS is growing - define an empirical retention policy - measure the performance of jobs - optimize the scheduler - and more What To Use It For
  • 13. 1. Analyzing HDFS 2. Analyzing MapReduce and YARN Agenda
  • 15. “ We don’t have any full GC pauses on the NN. Our GC stops the NN for less than 100 msec, on average! :) ” Adam Kawa @ Hadoop User Mailing List December 16th, 2013
  • 16. “ Today, between 12:05 and 13:00 we had 5 full GC pauses on the NN. They stopped the NN for 34min47sec in total! :( ” Adam Kawa @ Spotify office, Stockholm January 13th, 2014
  • 18. The NameNode was receiving the block reports from all the DataNodes Quick Answer!
  • 19. 1. We started the NN when the DNs were running Detailed Answer
  • 20. 1. We started the NN when the DNs were running 2. 502 DNs immediately registered to the NN ■ Within 1.2 sec (based on logs from the DNs) Detailed Answer
  • 21. 1. We started the NN when the DNs were running 2. 502 DNs immediately registered to the NN ■ Within 1.2 sec (based on logs from the DNs) 3. 502 DNs started sending the block reports ■ dfs.blockreport.initialDelay = 30 minutes ■ 17 block reports per minute (on average) ■ +831K blocks in each block report (on average) Detailed Answer
  • 22. 1. We started the NN when the DNs were running 2. 502 DNs immediately registered to the NN ■ Within 1.2 sec (based on logs from the DNs) 3. 502 DNs started sending the block reports ■ dfs.blockreport.initialDelay = 30 minutes ■ 17 block reports per minute (on average) ■ +831K blocks in each block report (on average) 4. This generated a high memory pressure on the NN ■ The NN ran into Full GC !!! Detailed Answer
  • 23. Hadoop told us everything!
  • 24. ■ Enable GC logging for the NameNode ■ Visualize e.g. GCViewer ■ Analyze memory usage patterns, GC pauses, misconfiguration Collecting The GC Stats
  • 25. Time
  • 26. This blue line shows the heap used by the NN
  • 36. CMS collector starts at 98.5% of heap… We fixed that !
  • 37. What happened in HDFS between mid-December 2013 and mid-January 2014?
  • 39. ■ A persistent checkpoint of HDFS metadata ■ It contains information about files + directories ■ A binary file HDFS FsImage File
  • 40. ■ Converts the content of FsImage to text formats - e.g. a tab-separated file or XML ■ Output is easily analyzed by any tools - e.g. Pig, Hive HDFS Offline Image Viewer
  • 41. 50% of the data created during last 3 months
  • 42.
  • 44. 1. NO data added that day 2. Many more files added after
  • 46.
  • 48. ■ An interactive visualization of data in HDFS Twitter's HDFS-DU /app-logs avg. file size = 253 KB no. of dirs = 595K no. of files = 60.6M
  • 49. ■ Statistics broken down by user/group name ■ Candidates for duplicate datasets ■ Inefficient MapReduce jobs - Small files - Skewed files More Uses Of FsImage File
  • 50. ■ You can analyze FsImage to learn how fast HDFS grows ■ You can combine it with “external” datasets - number of daily/monthly active users - total size of logs generated by users - number of queries / day run by data analysts Advanced HDFS Capacity Planning
  • 51. ■ You can also use ''trend button'' in Ganglia Simplified HDFS Capacity Planning If we do NOTHING, we might fill the cluster in September ...
  • 52. What will we do to survive longer than September?
  • 54. Question How many days after creation, a dataset is not accessed anymore? Retention Policy
  • 55. Question How many days after creation, a dataset is not accessed anymore? Possible Solution ■ You can use modification_time and access_time from FsImage Empirical Retention Policy
  • 56. ■ Logs and core datasets are accessed even many years after creation ■ Many reports are not accessed even a hour after creation ■ Most intermediate datasets needed less than a week ■ 10% of data has not been accessed for a year Our Retention Facts
  • 58. ■ Some files/directories will be accessed more often than others e.g.: - fresh logs, core datasets, dictionary files Idea ■ To process it faster, increase its replication factor while it’s “hot” ■ To save disk space, decrease its replication factor when it becomes “cold” Hot Dataset
  • 59. How to find them?
  • 60. ■ Logs all filesystem access requests sent to the NN ■ Easy to parse and aggregate - a tab-separated line for each request HDFS Audit Log 2014-01-18 15:16:12,023 INFO FSNamesystem.audit: allowed=true ugi=kawaa (auth:SIMPLE) ip=/10.254.28.4 cmd=open src=/metadata/artist/2013-11-27/part-00061.avro dst=null perm=null
  • 61. ■ JAR files stored in HDFS and used by Pig scripts ■ A dictionary file with metadata about log messages ■ Core datasets: playlists, users, top tracks Our Hot Datasets
  • 63. ■ There are jobs that we schedule regularly - e.g. top lists for each country Idea ■ Before submitting it next time, use statistics from the previous executions of a job - To learn about its historical performance - To tweak its configuration settings Recurring MapReduce Jobs
  • 64. We implemented ■ A pre-execution hook that automatically sets - Maximum size of an input split - Number of Reduce tasks ■ More settings can be tweaked - Memory - Combiner Jobs Autotuning
  • 65. ■ Here, the goal is that a task runs approx. 10 min, on average - Inspired by LinkedIn at Hadoop Summit 2013 - Helpful in extreme cases (short/long running tasks) A Small PoC ;)
  • 66. Another Example - Job Optimized Over Time
  • 67. Even perfect manual settings may become outdated when an input dataset grows!
  • 69. ■ Extracts the statistics from historical MapReduce jobs - Supports MRv1 and YARN ■ Stores them as Avro files - Enables easy analysis using e.g. Pig and Hive ■ Similar projects - Replephant, hRaven Zlatanitor = Zlatan + Monitor Zlatanitor
  • 71. A Slow Node - 40% lower throughput than the average Low Medium High
  • 72. NIC negotiated 100MbE instead of 1GbE Low Medium High
  • 73. According to Facebook ■ ”Small percentage of machines are responsible for large percentage of failures” - Worse performance - More alerts - More manual intervention Repeat Offenders
  • 74. Adding nodes to the cluster increases performance. Sometimes, removing (crappy) nodes does too !
  • 77. ■ YARN - can be moved to HDFS - They are stored as TFiles … :( - Small and many of them! Location Of Application Logs
  • 78. ■ Frequent exceptions and bugs - Just looking at the last line of stderr shows a lot! ■ Possible optimizations - Memory and size of map input buffer What Might Be Checked a) AttributeError: 'int' object has no attribute 'iteritems' b) ValueError: invalid literal for int() with base 10: 'spotify' c) ValueError: Expecting , delimiter: line 1 column 3257 (char 3257) d) ImportError: No module named db_statistics
  • 80. ■ We specified capacities and elasticity based on a combination of - “some” data - intuition - desire to shape future usage (!) Our Initial Capacities
  • 81. ■ Basic information available on the Scheduler Web UI ■ Take print-screens! - Otherwise, you will lose the history of what you saw :( Overutilization And Underutilization
  • 82. ■ Capacity Scheduler exposes these metrics via JMX ■ Ganglia does NOT display the metrics related to utilization of queues (by default) Visualizing Utilization Of Queue
  • 83. ■ It collects JMX metrics from Java processes ■ It can send metrics to multiple destinations - Graphite, cacti/rrdtool, Ganglia - tab-separated text file - STDOUT - and more Jmxtrans
  • 84. ■ Our Production queue often borrows resources - Usually from the Queue3 and Queue4 queues Overutilization And Underutilization
  • 85.
  • 86. The Best Time For The Downtime?
  • 88. Three Crowns = Sweden
  • 89. BONUS Some Cool Stuff From The Community
  • 90. ■ Aggregates and visualizes Hadoop cluster utilization across users LinkedIn's White Elephant
  • 91. ■ Collects run-time statistics from MR jobs - Stores them in HBase ■ Does not provide built-in visualization layer - The picture below comes from Twitter's blog Twitter's hRaven
  • 93. ■ Analyzing Hadoop is also a “business” problem - Save money - Iterate faster - Avoid downtimes Summary
  • 95. ■ To my awesome colleagues for great technical review: Piotr Krewski, Josh Baer, Rafal Wojdyla, Anna Dackiewicz, Magnus Runesson, Gustav Landén, Guido Urdaneta, Uldis Barbans More Thanks
  • 97. Check out spotify.com/jobs or @Spotifyjobs for more information kawaa@spotify.com Check out my blog: HakunaMapData.com Want to join the band?
  • 99. ■ Tricky question! ■ Use production jobs that represent your workload ■ Use a metric that is independent from size of data that you process ■ Optimize one setting at the time Benchmarking