SlideShare a Scribd company logo
1 of 34
Download to read offline
The state of in the cloud
Nicolas Poggi
May 2017
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_
/_/
Outline
1. Intro to BSC and ALOJA
2. Motivation and background
3. BigBench and PaaS
4. Sequential tests 1GB – 1TB
1. Data scales
2. Cost
5. Concurrency tests
6. Summary
2
Barcelona Supercomputing Center (BSC)
• Spanish national supercomputing center 22 years history in:
• Computer Architecture, networking and distributed systems
research
• Based at BarcelonaTech University (UPC)
• Large ongoing life science computational projects
• Prominent body of research activity around Hadoop
• 2008-2013: SLA Adaptive Scheduler, Accelerators, Locality
Awareness, Performance Management. 7 publications
• 2013-Present: Cost-efficient upcoming Big Data architectures
(ALOJA) 8+ publications
ALOJA: towards cost-effective Big Data
• Research project for automating characterization and
optimization of Big Data deployments
• Open source Benchmarking-to-Insights platform and tools
• Largest Big Data public repository (70,000+ jobs)
• Community collaboration with industry and academia
http://aloja.bsc.es
Big Data
Benchmarking
Online
Repository
Web / ML
Analytics
Platform-as-a-Service Spark
• Cloud-based managed Hadoop services
• Ready to use Spark, Hive, …
• Simplified management
• Deploys in minutes, on-demand, elastic
• You select the instance and
• the number of processing nodes
• Decoupled compute and storage
• Pay-as-you-go pricing model
• Optimized for general purpose
• Fined tuned to the cloud provider architecture
5
Motivation
• 2016 SQL-on-Hadoop paper and presentations
• Focused on Hive, due to SparkSQL not being ready to use
• Different versions (1.3, 1.5, 1.6)
• Some in preview mode
• Not carefully tuned
• Used TCP-H SQL-only benchmark
• Early 2017, BigBench on Hive and Spark work testing more than SQL
• FOSDEM and HadoopSummit EU presentations
• New code available this month for MLlib2 compatibility
• Goal: evaluate the current out-of-the-box experience of Spark in PaaS cloud
• Readiness, scalability, price, and performance
6
Surveyed Hadoop/Hive PaaS services
• Amazon Elastic Map Reduce (EMR)
• Released: Apr 2009
• OS: Amazon Linux AMI (RHEL-like)
• SW stack: EMR 5.5.0
• Spark 2.1.0 and Hive 2.1
• Google Cloud DataProc (CDP)
• Released: Feb 2016
• OS: Debian GNU/Linux 8.4
• SW stack: Preview version Spark 2.1.0
• V 1.1 with Spark 2.0.2
• Both with Hive 2.1
• Azure HDInsight (HDI)
• Released: Oct 2013
• OS: Windows Server and Ubuntu 16.04
• SW stack: HDP 2.6 based
• Spark 2.1.0 and 1.6.3
• Hive 1.2
• Target deployment:
• 16 data nodes with 8-cores each
• Master node with 16-cores
• Decoupled storage only
• Object store / elastic stores
7
VM instances and characteristics
Amazon Elastic Map Reduce (EMR)
• 16x M4.2xlarge (datanodes)
• 8-core, 32GB RAM
• 1x M4.4xlarge (master)
• 16-core, 64 GB RAM
• Storage: 2x EBS GP2 volumes
• Price/hr: $10.96 (billed by the hour)
Azure HDInsight (HDI)
• 16x D4v2 (datanodes)
• 8-core, 28GB RAM
• 2x D14v2 (master)
• 16-core, 112GB RAM
• Storage: WASB (Azure Blob Store)
• Price/hr: $20.68 (billed by the minute)
8
Google Cloud DataProc (CDP)
• 16x n1-standard-8 (datanodes)
• 8-core, 30GB RAM
• 1x n1-standard-16 (master)
• 16-core, 60GB RAM
• Storage GCS
• Price/hr: $10.38 (billed by the minute)
Disclaimer: snapshot of the out-of-the-box price and performance
during May 2017. Performance and especially costs change
often. We use non-discounted pricing. I/O costs are complex to
estimate for a single benchmark, using per second billing.
What is BigBench (TPCx-BB)?
• End-to-end application level benchmark specification
• result of many years of collaboration of industry and academia
• Covers most Big Data Analytical properties (3Vs)
• Covers 30 business use cases for a retailer company
• Defines data scale factors: 1GB to PBs
10
2012
• Launched at WBDB
2013
• Published at SIGMOD
2014
• First implementation on github
2016
• Standardized by TPC (Feb)
2016
• TCPx-BB Version 1.2 (Nov)
2017
• Spark MLlib v2 compatibility
(under testing - May)
BigBench history
BigBench use cases and process overview
• 30 business uses cases covering:
• Merchandising,
• Pricing Optimization
• Product Return
• Customers...
• Implementation resulted in:
• 14 Declarative queries (SQL)
• 7 with Natural Language Processing
• 4 with data preprocessing with M/R jobs
• 5 with Machine Learning jobs
11
1 Data generation
2 Data loading
3 Power test
4 Throughput test 1
5 Data refresh
6 Throughput test 2
Result
• BB queries / hour
BigBench v1.2 – Reference Implementation
HDFS
Hive Metastore
MapReduce Tez Spark
Yarn
Hive Spark SQL
Mahout ML Custom Spark MLlibMachine Learning
SQL Engine
Table Metastore
Execution Engine
Filesystem
Combination options:
• Hive + MapReduce + Mahout
• Hive + MapReduce + Spark_Mllib
• v1 and v2
• Hive + Tez + Mahout
• Hive + Tez + Spark_MLlib
• Spark SQL + Mahout
• Spark SQL + Spark_MLlib v1
• Spark 2 SQL + Mahout
• Spark 2 SQL + Spark_MLlib
• v1 and v2
• (also Hive-on-Spark… etc)
Previous results: M/R vs Tez and Mahout vs. MLlib v1
13Average of three executions using 100 GB Scale Factor
M/R
Tez
Mahout
MLlib v1
3x 2x
Sequential Spark 2.1 runs
Queries 1-30 on Spark 2.1 (power runs)
Per provider and combined
Query 1 Query 2 …. Query 30
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 2.1.0
/_/
BB 1GB-1TB: Spark 2.1 - Cloud Dataproc (CDP)
Notes:
• Chart shows time
increase as we move from
1GB to 1TB in data scale
• From 1 to 100GB, there is
less than 2x increase in
time for 100x more data.
• Indicating over-
provisioning
• From 10 to 1000, the
increase is 3x in time
Scale factor
BB 1GB-1TB: Spark 2.1 – Elastic Map Reduce
(EMR)
Notes:
• Chart shows time
increase as we move from
1GB to 1TB in data scale
for EMR
• From 1 to 100GB, there is
less than 2x increase in
time for 100x more data.
• Indicating over-
provisioning
• From 10 to 1000, the
increase is 4x in time,
while CDP was 3x only
• The M/R jobs take a
higher proportion of the
run
Scale factor
BB 1GB-1TB: Spark 2.1 – HDInsight (HDI)
Notes:
• Chart shows time
increase as we move from
1GB to 1TB in data scale
for HDI
• From 1 to 10GB, there is
only 5% increase in time
• From 10 to 1000, the
increase is 2.5x in time,
less than the other
providers
Scale factor
Notes:
• Chart shows time
increase as we move
from 1GB to 1TB in data
scale for all providers
• EMR is the fastest up to
100
• At 1TB HDI is the fasters
and EMR the slowest
• It has the largest
increase in M/R
queries
Fastest EMR
Slowest at
1TB
BB 1GB-1TB: Spark 2.1 – All providers
Errors…
• Everything was run out-of-the-box, except for:
• Q 14 17 requires cross joins to be enabled in Spark v2
• At 10TB,
• spark.sql.broadcastTimeout (default 300) had to be increased in HDI
• Timeout in seconds for the broadcast wait time in broadcast joins
• At 1TB memory issues
• Queries 3, 4, 8
• TimSort java.lang.OutOfMemoryError: Java heap space at
org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate
• Queries 2, and 30
• 17/05/15 16:57:46 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for
exceeding memory limits. 5.6 GB of 5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.
• Configs
• spark.yarn.driver and executor memoryOverhead
• spark.yarn.executor.memory
Versions and Spark config
EMR CDP HDI
Java version OpenJDK 1.8.0_121 OpenJDK 1.8.0_121 OpenJDK 1.8.0_131
Spark version 2.1.0 2.1 2.1.0.2.6.0.2-76
Driver memory 5G 5G 5G
Executor memory 5G 10G 4G
Executor cores 4 4 3
Executor instances Dynamic Dynamic 20
dynamicAllocation
enabled
TRUE TRUE FALSE
Executor
memoryOverhead
Default (384MB) 1,117 MB 384 MB
20
BB 1TB M/R-only: Spark 2.1 – All providers
Notes:
• When zooming by query,
we can see that query 2 is
the slowest on ERM
• While on CDP and HDI is
within proportions
BB 1TB Q2: Spark 2.1 – CPU Util % EMR and HDI
Notes:
• Job was CPU bounded. Log showed:
• WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory
limits. 5.6 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
• Solution: Increased memory for executors and time was lowered from 6,417s to 1,501
Q2: Find the top 30 products that are mostly viewed together with a given product in online store
CREATE TEMPORARY FUNCTION makePairs AS io.bigdatabenchmark.v1.queries.udf.PairwiseUDTF';
The sky 10TB is the limit…
and a Price / perf comparison
Average of three executios of 100 GB Scale Factor 23
BigBench 10TB SQL-only: All providers
Notes:
• At 10TB, only
SQL part ran
correctly in
Spark
• EMR got the
fastest results
• Rest still needs
tuning to
complete
• But reaching
the limit of the
cluster / PaaS
config
BB 1GB-1TB: Spark 2.1 – Cost/Performance overview
Notes:
• Chart show execution
time-to-cost plot
• Costs calculated to the
second (not to the billing
fractions).
• EMR cheapest to run for
all sizes. Also the fastest
up to 100GB
• CDP second, but at 1TB,
HDI becomes more cost-
effective
Faster and cheaper
Other Spark comparisons:
2.0.2 vs 2.1.0
1.6.3 vs 2.1.0
MLlib v1 vs v2
Hive vs. Spark
Average of three executios of 100 GB Scale Factor 26
BigBench 1GB-1TB: Spark 2.0.2 vs 2.1.0 (CDP)
Notes:
Spark 2.1 a bit faster at
small scales, slower at
100 GB and 1 TB on the
UDF/NLP queries
2.1 faster up
to 100GB
Slower at 1TB
BigBench 1GB-1TB: Spark 1.6.3 vs 2.1.0
MLlib 1 vs 2.1 MLlib 2(HDI)
Notes:
• Spark 2.1 is always
faster than 1.6.3 in
HDI
• MLlilb 2 using
dataframes over RDDs
is only slightly faster
than V1.
BigBench 10GB and 1TB: Hive (+MLlib2) vs. Spark 2.1
Notes:
• Hive is faster in both HDI
and EMR at 10GB.
Slower in CDP.
• CDP shows a scalability
problem at 1TB with
Hive.
• As it doesn’t
enables Tez by
default
• This was observed
on previous study
as well
• At 1TB both CDP and HDI
are faster than Hive (HDI)
10 GB 1 TB
Hive much
slower on CDP
(doesn’t enable
Tez by default)
Concurrency runs (throughput)
2 to 32 parallel streams
3030
BigBench 1-32 streams Spark 2.1 1GB scale
Notes:
• From 16 streams on,
the bottleneck is the
CPU utilization on the
master
• HDI faster at
concurrency,
• But also showed the
worst number
(variability)
High variability in HDI
Conclusions
• All providers have up to date (2.1.0) and well tuned versions of Spark
• They could run BigBench up to 1TB on medium-sized cluster
• [Almost] Out-of-the box
• Performance similar among providers for similar cluster types and disk configs
• Difference according to scale (and pricing)
• Spark 2.1.0 is faster than previous versions
• Also MLlib 2 with dataframes
• But improvements within the 30% range
• Hive (+Tez + MLlib) are still slightly faster than Spark at lower scales
• But very similar at larger scales
• And using mainly Spark simplifies the pipeline
• BigBench has been useful to stress a cluster with different workloads
• Highlights config problems fast and stresses scale limits
• Helpful for tuning the clusters
• And yes, Spark is now production ready and performant in PaaS in the cloud
32
Future work / WiP
• Compare Hive versions 1 and 2
• HDI still on v1
• Test LLAP with different settings
• Variability study for spark workloads in the cloud
• Fix 10TB runs to complete results
• Compare to on-prem runs
• optimizations
• Test G1 GC
• Fat vs. thin executors configs
Resources and references
BigBench and ALOJA
• BigBench Spark 2 branch (thanks Christoph and
Michael from bankmark.de):
• https://github.com/carabolic/Big-Data-Benchmark-for-
Big-Bench/tree/spark2
• Original BigBench Implementation repository
• https://github.com/intel-hadoop/Big-Data-Benchmark-
for-Big-Bench
• ALOJA benchmarking platform
• https://github.com/Aloja/aloja
• http://aloja.bsc.es/publications
• ALOJA fork of BigBench (adds support for HDI and fixes
spark)
• https://github.com/Aloja/Big-Data-Benchmark-for-Big-
Bench
• The State of SQL-on-Hadoop in the Cloud – N. Poggi
et. al.
• https://doi.org/10.1109/BigData.2016.7840751
Big Data Benchmarking
• Big Data Benchmarking Community (BDBC) mailing
list
• (~200 members from ~80organizations)
• http://clds.sdsc.edu/bdbc/community
• Workshop Big Data Benchmarking (WBDB)
• http://clds.sdsc.edu/bdbc/workshops
• SPEC Research Big Data working group
• http://research.spec.org/working-groups/big-data-
working-group.html
• Benchmarking slides and video:
• Benchmarking Hadoop:
• https://www.slideshare.net/ni_po/benchmarking-hadoop
• Michael Frank on Big Data benchmarking
• http://www.tele-task.de/archive/podcast/20430/
• Tilmann Rabl Big Data Benchmarking Tutorial
• http://www.slideshare.net/tilmann_rabl/ieee2014-
tutorialbarurabl
34
Thanks, questions?
Follow up / feedback : Nicolas.Poggi@bsc.es
Twitter: ni_po
The state of Spark in the Cloud

More Related Content

What's hot

Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye ZhouMetrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Databricks
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuBuilding a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Databricks
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Databricks
 

What's hot (20)

A short introduction to Spark and its benefits
A short introduction to Spark and its benefitsA short introduction to Spark and its benefits
A short introduction to Spark and its benefits
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
 
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye ZhouMetrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
 
Elastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryElastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent Memory
 
Tuning up with Apache Tez
Tuning up with Apache TezTuning up with Apache Tez
Tuning up with Apache Tez
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
 
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeSOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
 
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
 
The Hidden Life of Spark Jobs
The Hidden Life of Spark JobsThe Hidden Life of Spark Jobs
The Hidden Life of Spark Jobs
 
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuBuilding a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangExperiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
 
PGConf APAC 2018 - PostgreSQL performance comparison in various clouds
PGConf APAC 2018 - PostgreSQL performance comparison in various cloudsPGConf APAC 2018 - PostgreSQL performance comparison in various clouds
PGConf APAC 2018 - PostgreSQL performance comparison in various clouds
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
Distributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop ClustersDistributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop Clusters
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBenchWBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
 

Similar to The state of Spark in the cloud

The State of Spark in the Cloud with Nicolas Poggi
The State of Spark in the Cloud with Nicolas PoggiThe State of Spark in the Cloud with Nicolas Poggi
The State of Spark in the Cloud with Nicolas Poggi
Spark Summit
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
Chester Chen
 
stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4
Gaurav "GP" Pal
 

Similar to The state of Spark in the cloud (20)

The State of Spark in the Cloud with Nicolas Poggi
The State of Spark in the Cloud with Nicolas PoggiThe State of Spark in the Cloud with Nicolas Poggi
The State of Spark in the Cloud with Nicolas Poggi
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...
OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
 
Benchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public cloudsBenchmarking your cloud performance with top 4 global public clouds
Benchmarking your cloud performance with top 4 global public clouds
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
 
Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der Ster
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at Lyft
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and ChefDevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef
 
stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4stackArmor presentation for DevOpsDC ver 4
stackArmor presentation for DevOpsDC ver 4
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
 

More from Nicolas Poggi

More from Nicolas Poggi (6)

Benchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA ConstraintsBenchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA Constraints
 
Correctness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLCorrectness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQL
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
 
Vagrant + Docker provider [+Puppet]
Vagrant + Docker provider [+Puppet]Vagrant + Docker provider [+Puppet]
Vagrant + Docker provider [+Puppet]
 
The case for Hadoop performance
The case for Hadoop performanceThe case for Hadoop performance
The case for Hadoop performance
 

Recently uploaded

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 

Recently uploaded (20)

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 

The state of Spark in the cloud

  • 1. The state of in the cloud Nicolas Poggi May 2017 ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ /_/
  • 2. Outline 1. Intro to BSC and ALOJA 2. Motivation and background 3. BigBench and PaaS 4. Sequential tests 1GB – 1TB 1. Data scales 2. Cost 5. Concurrency tests 6. Summary 2
  • 3. Barcelona Supercomputing Center (BSC) • Spanish national supercomputing center 22 years history in: • Computer Architecture, networking and distributed systems research • Based at BarcelonaTech University (UPC) • Large ongoing life science computational projects • Prominent body of research activity around Hadoop • 2008-2013: SLA Adaptive Scheduler, Accelerators, Locality Awareness, Performance Management. 7 publications • 2013-Present: Cost-efficient upcoming Big Data architectures (ALOJA) 8+ publications
  • 4. ALOJA: towards cost-effective Big Data • Research project for automating characterization and optimization of Big Data deployments • Open source Benchmarking-to-Insights platform and tools • Largest Big Data public repository (70,000+ jobs) • Community collaboration with industry and academia http://aloja.bsc.es Big Data Benchmarking Online Repository Web / ML Analytics
  • 5. Platform-as-a-Service Spark • Cloud-based managed Hadoop services • Ready to use Spark, Hive, … • Simplified management • Deploys in minutes, on-demand, elastic • You select the instance and • the number of processing nodes • Decoupled compute and storage • Pay-as-you-go pricing model • Optimized for general purpose • Fined tuned to the cloud provider architecture 5
  • 6. Motivation • 2016 SQL-on-Hadoop paper and presentations • Focused on Hive, due to SparkSQL not being ready to use • Different versions (1.3, 1.5, 1.6) • Some in preview mode • Not carefully tuned • Used TCP-H SQL-only benchmark • Early 2017, BigBench on Hive and Spark work testing more than SQL • FOSDEM and HadoopSummit EU presentations • New code available this month for MLlib2 compatibility • Goal: evaluate the current out-of-the-box experience of Spark in PaaS cloud • Readiness, scalability, price, and performance 6
  • 7. Surveyed Hadoop/Hive PaaS services • Amazon Elastic Map Reduce (EMR) • Released: Apr 2009 • OS: Amazon Linux AMI (RHEL-like) • SW stack: EMR 5.5.0 • Spark 2.1.0 and Hive 2.1 • Google Cloud DataProc (CDP) • Released: Feb 2016 • OS: Debian GNU/Linux 8.4 • SW stack: Preview version Spark 2.1.0 • V 1.1 with Spark 2.0.2 • Both with Hive 2.1 • Azure HDInsight (HDI) • Released: Oct 2013 • OS: Windows Server and Ubuntu 16.04 • SW stack: HDP 2.6 based • Spark 2.1.0 and 1.6.3 • Hive 1.2 • Target deployment: • 16 data nodes with 8-cores each • Master node with 16-cores • Decoupled storage only • Object store / elastic stores 7
  • 8. VM instances and characteristics Amazon Elastic Map Reduce (EMR) • 16x M4.2xlarge (datanodes) • 8-core, 32GB RAM • 1x M4.4xlarge (master) • 16-core, 64 GB RAM • Storage: 2x EBS GP2 volumes • Price/hr: $10.96 (billed by the hour) Azure HDInsight (HDI) • 16x D4v2 (datanodes) • 8-core, 28GB RAM • 2x D14v2 (master) • 16-core, 112GB RAM • Storage: WASB (Azure Blob Store) • Price/hr: $20.68 (billed by the minute) 8 Google Cloud DataProc (CDP) • 16x n1-standard-8 (datanodes) • 8-core, 30GB RAM • 1x n1-standard-16 (master) • 16-core, 60GB RAM • Storage GCS • Price/hr: $10.38 (billed by the minute) Disclaimer: snapshot of the out-of-the-box price and performance during May 2017. Performance and especially costs change often. We use non-discounted pricing. I/O costs are complex to estimate for a single benchmark, using per second billing.
  • 9. What is BigBench (TPCx-BB)? • End-to-end application level benchmark specification • result of many years of collaboration of industry and academia • Covers most Big Data Analytical properties (3Vs) • Covers 30 business use cases for a retailer company • Defines data scale factors: 1GB to PBs 10 2012 • Launched at WBDB 2013 • Published at SIGMOD 2014 • First implementation on github 2016 • Standardized by TPC (Feb) 2016 • TCPx-BB Version 1.2 (Nov) 2017 • Spark MLlib v2 compatibility (under testing - May) BigBench history
  • 10. BigBench use cases and process overview • 30 business uses cases covering: • Merchandising, • Pricing Optimization • Product Return • Customers... • Implementation resulted in: • 14 Declarative queries (SQL) • 7 with Natural Language Processing • 4 with data preprocessing with M/R jobs • 5 with Machine Learning jobs 11 1 Data generation 2 Data loading 3 Power test 4 Throughput test 1 5 Data refresh 6 Throughput test 2 Result • BB queries / hour
  • 11. BigBench v1.2 – Reference Implementation HDFS Hive Metastore MapReduce Tez Spark Yarn Hive Spark SQL Mahout ML Custom Spark MLlibMachine Learning SQL Engine Table Metastore Execution Engine Filesystem Combination options: • Hive + MapReduce + Mahout • Hive + MapReduce + Spark_Mllib • v1 and v2 • Hive + Tez + Mahout • Hive + Tez + Spark_MLlib • Spark SQL + Mahout • Spark SQL + Spark_MLlib v1 • Spark 2 SQL + Mahout • Spark 2 SQL + Spark_MLlib • v1 and v2 • (also Hive-on-Spark… etc)
  • 12. Previous results: M/R vs Tez and Mahout vs. MLlib v1 13Average of three executions using 100 GB Scale Factor M/R Tez Mahout MLlib v1 3x 2x
  • 13. Sequential Spark 2.1 runs Queries 1-30 on Spark 2.1 (power runs) Per provider and combined Query 1 Query 2 …. Query 30 Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 2.1.0 /_/
  • 14. BB 1GB-1TB: Spark 2.1 - Cloud Dataproc (CDP) Notes: • Chart shows time increase as we move from 1GB to 1TB in data scale • From 1 to 100GB, there is less than 2x increase in time for 100x more data. • Indicating over- provisioning • From 10 to 1000, the increase is 3x in time Scale factor
  • 15. BB 1GB-1TB: Spark 2.1 – Elastic Map Reduce (EMR) Notes: • Chart shows time increase as we move from 1GB to 1TB in data scale for EMR • From 1 to 100GB, there is less than 2x increase in time for 100x more data. • Indicating over- provisioning • From 10 to 1000, the increase is 4x in time, while CDP was 3x only • The M/R jobs take a higher proportion of the run Scale factor
  • 16. BB 1GB-1TB: Spark 2.1 – HDInsight (HDI) Notes: • Chart shows time increase as we move from 1GB to 1TB in data scale for HDI • From 1 to 10GB, there is only 5% increase in time • From 10 to 1000, the increase is 2.5x in time, less than the other providers Scale factor
  • 17. Notes: • Chart shows time increase as we move from 1GB to 1TB in data scale for all providers • EMR is the fastest up to 100 • At 1TB HDI is the fasters and EMR the slowest • It has the largest increase in M/R queries Fastest EMR Slowest at 1TB BB 1GB-1TB: Spark 2.1 – All providers
  • 18. Errors… • Everything was run out-of-the-box, except for: • Q 14 17 requires cross joins to be enabled in Spark v2 • At 10TB, • spark.sql.broadcastTimeout (default 300) had to be increased in HDI • Timeout in seconds for the broadcast wait time in broadcast joins • At 1TB memory issues • Queries 3, 4, 8 • TimSort java.lang.OutOfMemoryError: Java heap space at org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate • Queries 2, and 30 • 17/05/15 16:57:46 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.6 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. • Configs • spark.yarn.driver and executor memoryOverhead • spark.yarn.executor.memory
  • 19. Versions and Spark config EMR CDP HDI Java version OpenJDK 1.8.0_121 OpenJDK 1.8.0_121 OpenJDK 1.8.0_131 Spark version 2.1.0 2.1 2.1.0.2.6.0.2-76 Driver memory 5G 5G 5G Executor memory 5G 10G 4G Executor cores 4 4 3 Executor instances Dynamic Dynamic 20 dynamicAllocation enabled TRUE TRUE FALSE Executor memoryOverhead Default (384MB) 1,117 MB 384 MB 20
  • 20. BB 1TB M/R-only: Spark 2.1 – All providers Notes: • When zooming by query, we can see that query 2 is the slowest on ERM • While on CDP and HDI is within proportions
  • 21. BB 1TB Q2: Spark 2.1 – CPU Util % EMR and HDI Notes: • Job was CPU bounded. Log showed: • WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.6 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. • Solution: Increased memory for executors and time was lowered from 6,417s to 1,501 Q2: Find the top 30 products that are mostly viewed together with a given product in online store CREATE TEMPORARY FUNCTION makePairs AS io.bigdatabenchmark.v1.queries.udf.PairwiseUDTF';
  • 22. The sky 10TB is the limit… and a Price / perf comparison Average of three executios of 100 GB Scale Factor 23
  • 23. BigBench 10TB SQL-only: All providers Notes: • At 10TB, only SQL part ran correctly in Spark • EMR got the fastest results • Rest still needs tuning to complete • But reaching the limit of the cluster / PaaS config
  • 24. BB 1GB-1TB: Spark 2.1 – Cost/Performance overview Notes: • Chart show execution time-to-cost plot • Costs calculated to the second (not to the billing fractions). • EMR cheapest to run for all sizes. Also the fastest up to 100GB • CDP second, but at 1TB, HDI becomes more cost- effective Faster and cheaper
  • 25. Other Spark comparisons: 2.0.2 vs 2.1.0 1.6.3 vs 2.1.0 MLlib v1 vs v2 Hive vs. Spark Average of three executios of 100 GB Scale Factor 26
  • 26. BigBench 1GB-1TB: Spark 2.0.2 vs 2.1.0 (CDP) Notes: Spark 2.1 a bit faster at small scales, slower at 100 GB and 1 TB on the UDF/NLP queries 2.1 faster up to 100GB Slower at 1TB
  • 27. BigBench 1GB-1TB: Spark 1.6.3 vs 2.1.0 MLlib 1 vs 2.1 MLlib 2(HDI) Notes: • Spark 2.1 is always faster than 1.6.3 in HDI • MLlilb 2 using dataframes over RDDs is only slightly faster than V1.
  • 28. BigBench 10GB and 1TB: Hive (+MLlib2) vs. Spark 2.1 Notes: • Hive is faster in both HDI and EMR at 10GB. Slower in CDP. • CDP shows a scalability problem at 1TB with Hive. • As it doesn’t enables Tez by default • This was observed on previous study as well • At 1TB both CDP and HDI are faster than Hive (HDI) 10 GB 1 TB Hive much slower on CDP (doesn’t enable Tez by default)
  • 29. Concurrency runs (throughput) 2 to 32 parallel streams 3030
  • 30. BigBench 1-32 streams Spark 2.1 1GB scale Notes: • From 16 streams on, the bottleneck is the CPU utilization on the master • HDI faster at concurrency, • But also showed the worst number (variability) High variability in HDI
  • 31. Conclusions • All providers have up to date (2.1.0) and well tuned versions of Spark • They could run BigBench up to 1TB on medium-sized cluster • [Almost] Out-of-the box • Performance similar among providers for similar cluster types and disk configs • Difference according to scale (and pricing) • Spark 2.1.0 is faster than previous versions • Also MLlib 2 with dataframes • But improvements within the 30% range • Hive (+Tez + MLlib) are still slightly faster than Spark at lower scales • But very similar at larger scales • And using mainly Spark simplifies the pipeline • BigBench has been useful to stress a cluster with different workloads • Highlights config problems fast and stresses scale limits • Helpful for tuning the clusters • And yes, Spark is now production ready and performant in PaaS in the cloud 32
  • 32. Future work / WiP • Compare Hive versions 1 and 2 • HDI still on v1 • Test LLAP with different settings • Variability study for spark workloads in the cloud • Fix 10TB runs to complete results • Compare to on-prem runs • optimizations • Test G1 GC • Fat vs. thin executors configs
  • 33. Resources and references BigBench and ALOJA • BigBench Spark 2 branch (thanks Christoph and Michael from bankmark.de): • https://github.com/carabolic/Big-Data-Benchmark-for- Big-Bench/tree/spark2 • Original BigBench Implementation repository • https://github.com/intel-hadoop/Big-Data-Benchmark- for-Big-Bench • ALOJA benchmarking platform • https://github.com/Aloja/aloja • http://aloja.bsc.es/publications • ALOJA fork of BigBench (adds support for HDI and fixes spark) • https://github.com/Aloja/Big-Data-Benchmark-for-Big- Bench • The State of SQL-on-Hadoop in the Cloud – N. Poggi et. al. • https://doi.org/10.1109/BigData.2016.7840751 Big Data Benchmarking • Big Data Benchmarking Community (BDBC) mailing list • (~200 members from ~80organizations) • http://clds.sdsc.edu/bdbc/community • Workshop Big Data Benchmarking (WBDB) • http://clds.sdsc.edu/bdbc/workshops • SPEC Research Big Data working group • http://research.spec.org/working-groups/big-data- working-group.html • Benchmarking slides and video: • Benchmarking Hadoop: • https://www.slideshare.net/ni_po/benchmarking-hadoop • Michael Frank on Big Data benchmarking • http://www.tele-task.de/archive/podcast/20430/ • Tilmann Rabl Big Data Benchmarking Tutorial • http://www.slideshare.net/tilmann_rabl/ieee2014- tutorialbarurabl 34
  • 34. Thanks, questions? Follow up / feedback : Nicolas.Poggi@bsc.es Twitter: ni_po The state of Spark in the Cloud