SlideShare a Scribd company logo
1 of 49
How Spark Beat Hadoop @ 100 TB Sort
Advanced Apache Spark Meetup
Chris Fregly, Principal Data Solutions Engineer
IBM Spark Technology Center
Power of data. Simplicity of design. Speed of innovation.IBM | spark.tc
Meetup Housekeeping
IBM | spark.tc
Announcements
Deepak Srinivasan
Big Commerce
Steve Beier
IBM Spark Tech Center
IBM | spark.tc
Who am I?
Streaming Platform Engineer
Streaming Data Engineer
Netflix Open Source Committer
Data Solutions Engineer
Apache Contributor
Principle Data Solutions Engineer
IBM Technology Center
IBM | spark.tc
Last Meetup (End-to-End Data Pipeline)
Presented `Flux Capacitor`
End-to-End Data Pipeline in a Box!
Real-time, Advanced Analytics
Machine Learning
Recommendations
Github
github.com/fluxcapacitor
Docker
hub.docker.com/r/fluxcapacitor
IBM | spark.tc
Since Last Meetup (End-to-End Data Pipeline)
Meetup Statistics
Total Spark Experts: ~850 (+100%)
Mean RSVPs per Meetup: 268
Mean Attendance Percentage: ~60% of RSVPs
Donations: $15 (Thank you so much, but please keep your $!)
Github Statistics (github.com/fluxcapacitor)
18 forks, 13 clones, ~1300 views
Docker Statistics (hub.docker.com/r/fluxcapacitor)
~1600 download
IBM | spark.tc
Recent Events
Replay of Last SF Meetup in Mtn View@BaseCR
M
Presented Flux Capacitor End-to-End Data Pipeline
(Scala + Big Data) By The Bay Conference
Workshop and 2 Talks
Trained ~100 on End-to-End Data Pipeline
Galvanize Workshop
Trained ~30 on End-to-End Data Pipeline
IBM | spark.tc
Upcoming USA Events
IBM Hackathon @ Galvanize (Sept 18th – Sept 21st)
Advanced Apache Spark Meetup@DataStax (Sept 21st)
Spark-Cassandra Spark SQL+DataFrame Connector
Cassandra Summit Talk (Sept 22nd – Sept 24th)
Real-time End-to-End Data Pipeline w/ Cassandra
Strata New York (Sept 29th - Oct 1st)
IBM | spark.tc
Upcoming European Events
Dublin Spark Meetup Talk (Oct 15th)
Barcelona Spark Meetup Talk (Oct ?)
Madrid Spark Meetup Talk (Oct ?)
Amsterdam Spark Meetup (Oct 27th)
Spark Summit Amsterdam (Oct 27th – Oct 29th)
Brussels Spark Meetup Talk (Oct 30th)
Spark and the Daytona GraySort tChallenge
sortbenchmark.org
sortbenchmark.org/ApacheSpark2014.pdf
IBM | spark.tc
Themes of this Talk: Mechanical Sympathy
Seek Once, Scan Sequentially
CPU Cache Locality, Memory Hierarchy are Key
Go Off-Heap Whenever Possible
Customize Data Structures for your Workload
IBM | spark.tc
What is the Daytona GraySort Challenge?
Key Metric
Throughput of sorting 100TB of 100 byte data, 10 byte key
Total time includes launching app and writing output file
Daytona
App must be general purpose
Gray
Named after Jim Gray
IBM | spark.tc
Daytona GraySort Challenge: Input and
ResourcesInput
Records are 100 bytes in length
First 10 bytes are random key
Input generator: `ordinal.com/gensort.html`
28,000 fixed-size partitions for 100 TB sort
250,000 fixed-size partitions for 1 PB sort
1 partition = 1 HDFS block = 1 node = no partial read I/O
Hardware and Runtime Resources
Commercially available and off-the-shelf
Unmodified, no over/under-clocking
Generates 500TB of disk I/O, 200TB network I/O
IBM | spark.tc
Daytona GraySort Challenge: Rules
Must sort to/from OS files in secondary storage
No raw disk since I/O subsystem is being tested
File and device striping (RAID 0) are encouraged
Output file(s) must have correct key order
IBM | spark.tc
Daytona GraySort Challenge: Task Scheduling
Types of Data Locality
PROCESS_LOCAL
NODE_LOCAL
RACK_LOCAL
ANY
Delay Scheduling
`spark.locality.wait.node`: time to wait for next shitty level
Set to infinite to reduce shittiness, force NODE_LOCAL
Straggling Executor JVMs naturally fade away on each run
Increasing
Level of
Shittiness
IBM | spark.tc
Daytona GraySort Challenge: Winning Results
On-disk only, in-memory caching disabled!
EC2
(i2.8xlarge)
EC2
(i2.8xlarge)
28,000
partitions
250,000
partitions (!!)
IBM | spark.tc
Daytona GraySort Challenge: EC2 Configuration
206 EC2 Worker nodes, 1 Master node
i2.8xlarge
32 Intel Xeon CPU E5-2670 @ 2.5 Ghz
244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4
NOOP I/O scheduler: FIFO, request merging, no reorderin
g
3 Gbps mixed read/write disk I/O
Deployed within Placement Group/VPC
Enhanced Networking
Single Root I/O Virtualization (SR-IOV): extension of PCIe
10 Gbps, low latency, low jitter (iperf showed ~9.5 Gbps)
IBM | spark.tc
Daytona GraySort Challenge: Winning
ConfigurationSpark 1.2, OpenJDK 1.7_<amazon-something>_u65-b17
Disabled in-memory caching -- all on-disk!
HDFS 2.4.1 short-circuit local reads, 2x replication
Writes flushed after every run (5 runs for 28,000 partitions)
Netty 4.0.23.Final with native epoll
Speculative Execution disabled: `spark.speculation`=false
Force NODE_LOCAL: `spark.locality.wait.node`=Infinite
Force Netty Off-Heap: `spark.shuffle.io.preferDirectBuffers`
Spilling disabled: `spark.shuffle.spill`=true
All compression disabled
IBM | spark.tc
Daytona GraySort Challenge: Partitioning
Range Partitioning (vs. Hash Partitioning)
Take advantage of sequential key space
Similar keys grouped together within a partition
Ranges defined by sampling 79 values per partition
Driver sorts samples and defines range boundaries
Sampling took ~10 seconds for 28,000 partitions
IBM | spark.tc
Daytona GraySort Challenge: Why Bother?
Sorting relies heavily on shuffle, I/O subsystem
Shuffle is major bottleneck in big data processing
Large number of partitions can exhaust OS resources
Shuffle optimization benefits all high-level libraries
Goal is to saturate network controller on all nodes
~125 MB/s (1 GB ethernet), 1.25 GB/s (10 GB ethernet)
IBM | spark.tc
Daytona GraySort Challenge: Per Node Results
Mappers: 3 Gbps/node disk I/O (8x800 SSD)
Reducers: 1.1 Gbps/node network I/O (10Gbps)
Quick Shuffle Refresher
IBM | spark.tc
Shuffle Overview
All to All, Cartesian Product Operation
Least ->
Useful
Example
I Could
Find ->
IBM | spark.tc
Spark Shuffle Overview
Most ->
Confusing
Example
I Could
Find ->
Stages are Defined by Shuffle Boundaries
IBM | spark.tc
Shuffle Intermediate Data: Spill to Disk
Intermediate shuffle data stored in memory
Spill to Disk
`spark.shuffle.spill`=true
`spark.shuffle.memoryFraction`=% of all shuffle buffers
Competes with `spark.storage.memoryFraction`
Bump this up from default!! Will help Spark SQL, too.
Skipped Stages
Reuse intermediate shuffle data found on reducer
DAG for that partition can be truncated
IBM | spark.tc
Shuffle Intermediate Data: Compression
`spark.shuffle.compress`
Compress outputs (mapper)
`spark.shuffle.spill.compress`
Compress spills (reducer)
`spark.io.compression.codec`
LZF: Most workloads (new default for Spark)
Snappy: LARGE workloads (less memory required to compress)
IBM | spark.tc
Spark Shuffle Operations
join
distinct
cogroup
coalesce
repartition
sortByKey
groupByKey
reduceByKey
aggregateByKey
IBM | spark.tc
Spark Shuffle Managers
`spark.shuffle.manager` = {
`hash` <10,000 Reducers
Output file determined by hashing the key of (K,V) pair
Each mapper creates an output buffer/file per reducer
Leads to M*R number of output buffers/files per shuffle
`sort` >= 10,000 Reducers
Default since Spark 1.2
Wins Daytona GraySort Challenge w/ 250,000 reducers!!
`tungsten-sort` (Future Meetup!)
}
IBM | spark.tc
Shuffle Managers
IBM | spark.tc
Hash Shuffle Manager
M*R num open files per shuffle; M=num mappers
R=num reducers
Mapper Opens 1 File per Partition/Reducer
HDFS
(2x repl)
HDFS
(2x repl)
IBM | spark.tc
Sort Shuffle Manager
Hold Tight!
IBM | spark.tc
Tungsten-Sort Shuffle Manager
Future Meetup!!
IBM | spark.tc
Shuffle Performance TuningHash Shuffle Manager (no longer default)
`spark.shuffle.consolidateFiles`: mapper output files
`o.a.s.shuffle.FileShuffleBlockResolver`
Intermediate Files
Increase `spark.shuffle.file.buffer`: reduce seeks & sys calls
Increase `spark.reducer.maxSizeInFlight` if memory allows
Use smaller number of larger workers to reduce total files
SQL: BroadcastHashJoin vs. ShuffledHashJoin
`spark.sql.autoBroadcastJoinThreshold`
IBM | spark.tc
Shuffle Configuration
Documentation
spark.apache.org/docs/latest/configuration.html#shuffle-behavior
Prefix
spark.shuffle
Winning Optimizations
Deployed across Spark 1.1 and 1.2
IBM | spark.tc
Daytona GraySort Challenge: Winning
OptimizationsCPU-Cache Locality: (Key, Pointer-to-Record)
& Cache Alignment
Optimized Sort Algorithm: Elements of (K, V) Pair
s
Reduce Network Overhead: Async Netty, epoll
Reduce OS Resource Utilization: Sort Shuffle
IBM | spark.tc
CPU-Cache Locality: (Key, Pointer-to-Record)
AlphaSort paper ~1995
Chris Nyberg and Jim Gray
Naïve
List (Pointer-to-Record)
Requires Key to be dereferenced for comparison
AlphaSort
List (Key, Pointer-to-Record)
Key is directly available for comparison
IBM | spark.tc
CPU-Cache Locality: Cache Alignment
Key(10 bytes) + Pointer(4 bytes*) = 14 bytes
*4 bytes when using compressed OOPS (<32 GB heap)
Not binary in size, not CPU-cache friendly
Cache Alignment Options
① Add Padding (2 bytes)
Key(10 bytes) + Pad(2 bytes) + Pointer(4 bytes)=16 bytes
② (Key-Prefix, Pointer-to-Record)
Perf affected by key distro
Key-Prefix (4 bytes) + Pointer (4 bytes)=8 bytes
IBM | spark.tc
CPU-Cache Locality: Performance Comparison
IBM | spark.tc
Optimized Sort Algorithm: Elements of (K, V) Pair
s`o.a.s.util.collection.TimSort`
Based on JDK 1.7 TimSort
Performs best on partially-sorted datasets
Optimized for elements of (K,V) pairs
Sorts impl of SortDataFormat (ie. KVArraySortDataFormat)
`o.a.s.util.collection.AppendOnlyMap`
Open addressing hash, quadratic probing
Array of [(key0, value0), (key1, value1)]
Good memory locality
Keys never removed, values only append
(^2 Probing)
IBM | spark.tc
Reduce Network Overhead: Async Netty, epoll
New Network Module based on Async Netty
Replaces old java.nio, low-level, socket-based code
Zero-copy epoll uses kernel-space between disk & networ
k
Custom memory management reduces GC pauses
`spark.shuffle.blockTransferService`=netty
Spark-Netty Performance Tuning
`spark.shuffle.io.numConnectionsPerPeer`
Increase to saturate hosts with multiple disks
`spark.shuffle.io.preferDirectBuffers`
On or Off-heap (Off-heap is default)
Apache Spark Jira
SPARK-2468
S
IBM | spark.tc
Reduce OS Resource Utilization: Sort Shuffle
M open files per shuffle; M = num of mappers
`spark.shuffle.sort.bypassMergeThreshold`
Merge Sort
(Disk)
Reducers seek and
scan from range offset
of Master File on
Mapper
TimSort
(RAM)
HDFS
(2x repl)
HDFS
(2x repl)
SPARK-2926:
Replace
TimSort
w/Merge Sort
(Memory)
Mapper Merge Sorts Partitions into 1 Master File
Indexed by Partition Range Offsets
<- Master->
File
Bonus!
IBM | spark.tc
External Shuffle Service: Separate JVM Process
Takes over when Spark Executor is in GC or dies
Use new Netty-based Network Module
Required for YARN dynamic allocation
Node Manager serves files
Apache Spark Jira: SPARK-3796
Next Steps
Project Tungsten
IBM | spark.tc
Project Tungsten: CPU and Memory Optimizations
Disk
Network
CPU
Memory
Daytona GraySort Optimizations
Tungsten Optimizations
Custom Memory Management
Eliminates JVM object and GC overhead
More Cache-aware Data Structs and Algos
`o.a.s.unsafe.map.BytesToBytesMap` vs. j.u.HashM
Code Generation (default in 1.5)
Generate bytecode from overall query plan
Thank you!
Special thanks to Big Commerce!!
IBM Spark Tech Center is Hiring!
Nice people only, please!! 
IBM | spark.tc
Sign up for our newsletter at
To Be Continued…
IBM | spark.tc
Relevant Links
http://sortbenchmark.org/ApacheSpark2014.pdf
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.ht
ml
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-
new-record-in-large-scale-sorting.html
http://0x0fff.com/spark-architecture-shuffle/
http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/pro
jects/reports/project16_report.pdf
Power of data. Simplicity of design. Speed of innovation.
IBM Spark

More Related Content

What's hot

Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...Databricks
 
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Chris Fregly
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Sparkfelixcss
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...Databricks
 
Spark r under the hood with Hossein Falaki
Spark r under the hood with Hossein FalakiSpark r under the hood with Hossein Falaki
Spark r under the hood with Hossein FalakiDatabricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with SparkChris Fregly
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIRyuji Tamagawa
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanA Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanDatabricks
 

What's hot (20)

Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
 
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
PySaprk
PySaprkPySaprk
PySaprk
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
 
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
 
Spark r under the hood with Hossein Falaki
Spark r under the hood with Hossein FalakiSpark r under the hood with Hossein Falaki
Spark r under the hood with Hossein Falaki
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanA Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen Fan
 

Viewers also liked

Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginningsDaniel Leon
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedinYukti Kaura
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015Databricks
 
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMathExtreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMathSpark Summit
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonDatabricks
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelMartin Zapletal
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksData Con LA
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks
 

Viewers also liked (20)

Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginnings
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedin
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
 
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMathExtreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 

Similar to Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Chris Fregly
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Chris Fregly
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
 
Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015Chris Fregly
 
Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015Chris Fregly
 
Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015Chris Fregly
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDatabricks
 
Under The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database ArchitectureUnder The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database ArchitectureScyllaDB
 
Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Chris Fregly
 
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...Chris Fregly
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...In-Memory Computing Summit
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015Chris Fregly
 
Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015Chris Fregly
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor DesignSri Prasanna
 
Accelerating apache spark with rdma
Accelerating apache spark with rdmaAccelerating apache spark with rdma
Accelerating apache spark with rdmainside-BigData.com
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementGanesan Narayanasamy
 

Similar to Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge (20)

Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
Stockholm Spark Meetup Nov 23 2015 Spark After Dark 1.5
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015Toronto Spark Meetup Dec 14 2015
Toronto Spark Meetup Dec 14 2015
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015Sydney Spark Meetup Dec 08, 2015
Sydney Spark Meetup Dec 08, 2015
 
Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015Copenhagen Spark Meetup Nov 25, 2015
Copenhagen Spark Meetup Nov 25, 2015
 
Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015Melbourne Spark Meetup Dec 09 2015
Melbourne Spark Meetup Dec 09 2015
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
IO Dubi Lebel
IO Dubi LebelIO Dubi Lebel
IO Dubi Lebel
 
Under The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database ArchitectureUnder The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database Architecture
 
Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015Istanbul Spark Meetup Nov 28 2015
Istanbul Spark Meetup Nov 28 2015
 
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...Brussels Spark Meetup Oct 30, 2015:  Spark After Dark 1.5:  Real-time, Advanc...
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5:  Real-time, Advanc...
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015
 
Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
Accelerating apache spark with rdma
Accelerating apache spark with rdmaAccelerating apache spark with rdma
Accelerating apache spark with rdma
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
 

More from Chris Fregly

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataChris Fregly
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfChris Fregly
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupChris Fregly
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedChris Fregly
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine LearningChris Fregly
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon BraketChris Fregly
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-PersonChris Fregly
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapChris Fregly
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...Chris Fregly
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Chris Fregly
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Chris Fregly
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Chris Fregly
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...Chris Fregly
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...Chris Fregly
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Chris Fregly
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...Chris Fregly
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Chris Fregly
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...Chris Fregly
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...Chris Fregly
 

More from Chris Fregly (20)

AWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and DataAWS reInvent 2022 reCap AI/ML and Data
AWS reInvent 2022 reCap AI/ML and Data
 
Pandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdfPandas on AWS - Let me count the ways.pdf
Pandas on AWS - Let me count the ways.pdf
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupRay AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedSmokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Amazon reInvent 2020 Recap:  AI and Machine LearningAmazon reInvent 2020 Recap:  AI and Machine Learning
Amazon reInvent 2020 Recap: AI and Machine Learning
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
 
Quantum Computing with Amazon Braket
Quantum Computing with Amazon BraketQuantum Computing with Amazon Braket
Quantum Computing with Amazon Braket
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
 

Recently uploaded

Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROmotivationalword821
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 

Recently uploaded (20)

Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTRO
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

  • 1. How Spark Beat Hadoop @ 100 TB Sort Advanced Apache Spark Meetup Chris Fregly, Principal Data Solutions Engineer IBM Spark Technology Center Power of data. Simplicity of design. Speed of innovation.IBM | spark.tc
  • 3. IBM | spark.tc Announcements Deepak Srinivasan Big Commerce Steve Beier IBM Spark Tech Center
  • 4. IBM | spark.tc Who am I? Streaming Platform Engineer Streaming Data Engineer Netflix Open Source Committer Data Solutions Engineer Apache Contributor Principle Data Solutions Engineer IBM Technology Center
  • 5. IBM | spark.tc Last Meetup (End-to-End Data Pipeline) Presented `Flux Capacitor` End-to-End Data Pipeline in a Box! Real-time, Advanced Analytics Machine Learning Recommendations Github github.com/fluxcapacitor Docker hub.docker.com/r/fluxcapacitor
  • 6. IBM | spark.tc Since Last Meetup (End-to-End Data Pipeline) Meetup Statistics Total Spark Experts: ~850 (+100%) Mean RSVPs per Meetup: 268 Mean Attendance Percentage: ~60% of RSVPs Donations: $15 (Thank you so much, but please keep your $!) Github Statistics (github.com/fluxcapacitor) 18 forks, 13 clones, ~1300 views Docker Statistics (hub.docker.com/r/fluxcapacitor) ~1600 download
  • 7. IBM | spark.tc Recent Events Replay of Last SF Meetup in Mtn View@BaseCR M Presented Flux Capacitor End-to-End Data Pipeline (Scala + Big Data) By The Bay Conference Workshop and 2 Talks Trained ~100 on End-to-End Data Pipeline Galvanize Workshop Trained ~30 on End-to-End Data Pipeline
  • 8. IBM | spark.tc Upcoming USA Events IBM Hackathon @ Galvanize (Sept 18th – Sept 21st) Advanced Apache Spark Meetup@DataStax (Sept 21st) Spark-Cassandra Spark SQL+DataFrame Connector Cassandra Summit Talk (Sept 22nd – Sept 24th) Real-time End-to-End Data Pipeline w/ Cassandra Strata New York (Sept 29th - Oct 1st)
  • 9. IBM | spark.tc Upcoming European Events Dublin Spark Meetup Talk (Oct 15th) Barcelona Spark Meetup Talk (Oct ?) Madrid Spark Meetup Talk (Oct ?) Amsterdam Spark Meetup (Oct 27th) Spark Summit Amsterdam (Oct 27th – Oct 29th) Brussels Spark Meetup Talk (Oct 30th)
  • 10. Spark and the Daytona GraySort tChallenge sortbenchmark.org sortbenchmark.org/ApacheSpark2014.pdf
  • 11. IBM | spark.tc Themes of this Talk: Mechanical Sympathy Seek Once, Scan Sequentially CPU Cache Locality, Memory Hierarchy are Key Go Off-Heap Whenever Possible Customize Data Structures for your Workload
  • 12. IBM | spark.tc What is the Daytona GraySort Challenge? Key Metric Throughput of sorting 100TB of 100 byte data, 10 byte key Total time includes launching app and writing output file Daytona App must be general purpose Gray Named after Jim Gray
  • 13. IBM | spark.tc Daytona GraySort Challenge: Input and ResourcesInput Records are 100 bytes in length First 10 bytes are random key Input generator: `ordinal.com/gensort.html` 28,000 fixed-size partitions for 100 TB sort 250,000 fixed-size partitions for 1 PB sort 1 partition = 1 HDFS block = 1 node = no partial read I/O Hardware and Runtime Resources Commercially available and off-the-shelf Unmodified, no over/under-clocking Generates 500TB of disk I/O, 200TB network I/O
  • 14. IBM | spark.tc Daytona GraySort Challenge: Rules Must sort to/from OS files in secondary storage No raw disk since I/O subsystem is being tested File and device striping (RAID 0) are encouraged Output file(s) must have correct key order
  • 15. IBM | spark.tc Daytona GraySort Challenge: Task Scheduling Types of Data Locality PROCESS_LOCAL NODE_LOCAL RACK_LOCAL ANY Delay Scheduling `spark.locality.wait.node`: time to wait for next shitty level Set to infinite to reduce shittiness, force NODE_LOCAL Straggling Executor JVMs naturally fade away on each run Increasing Level of Shittiness
  • 16. IBM | spark.tc Daytona GraySort Challenge: Winning Results On-disk only, in-memory caching disabled! EC2 (i2.8xlarge) EC2 (i2.8xlarge) 28,000 partitions 250,000 partitions (!!)
  • 17. IBM | spark.tc Daytona GraySort Challenge: EC2 Configuration 206 EC2 Worker nodes, 1 Master node i2.8xlarge 32 Intel Xeon CPU E5-2670 @ 2.5 Ghz 244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4 NOOP I/O scheduler: FIFO, request merging, no reorderin g 3 Gbps mixed read/write disk I/O Deployed within Placement Group/VPC Enhanced Networking Single Root I/O Virtualization (SR-IOV): extension of PCIe 10 Gbps, low latency, low jitter (iperf showed ~9.5 Gbps)
  • 18. IBM | spark.tc Daytona GraySort Challenge: Winning ConfigurationSpark 1.2, OpenJDK 1.7_<amazon-something>_u65-b17 Disabled in-memory caching -- all on-disk! HDFS 2.4.1 short-circuit local reads, 2x replication Writes flushed after every run (5 runs for 28,000 partitions) Netty 4.0.23.Final with native epoll Speculative Execution disabled: `spark.speculation`=false Force NODE_LOCAL: `spark.locality.wait.node`=Infinite Force Netty Off-Heap: `spark.shuffle.io.preferDirectBuffers` Spilling disabled: `spark.shuffle.spill`=true All compression disabled
  • 19. IBM | spark.tc Daytona GraySort Challenge: Partitioning Range Partitioning (vs. Hash Partitioning) Take advantage of sequential key space Similar keys grouped together within a partition Ranges defined by sampling 79 values per partition Driver sorts samples and defines range boundaries Sampling took ~10 seconds for 28,000 partitions
  • 20. IBM | spark.tc Daytona GraySort Challenge: Why Bother? Sorting relies heavily on shuffle, I/O subsystem Shuffle is major bottleneck in big data processing Large number of partitions can exhaust OS resources Shuffle optimization benefits all high-level libraries Goal is to saturate network controller on all nodes ~125 MB/s (1 GB ethernet), 1.25 GB/s (10 GB ethernet)
  • 21. IBM | spark.tc Daytona GraySort Challenge: Per Node Results Mappers: 3 Gbps/node disk I/O (8x800 SSD) Reducers: 1.1 Gbps/node network I/O (10Gbps)
  • 23. IBM | spark.tc Shuffle Overview All to All, Cartesian Product Operation Least -> Useful Example I Could Find ->
  • 24. IBM | spark.tc Spark Shuffle Overview Most -> Confusing Example I Could Find -> Stages are Defined by Shuffle Boundaries
  • 25. IBM | spark.tc Shuffle Intermediate Data: Spill to Disk Intermediate shuffle data stored in memory Spill to Disk `spark.shuffle.spill`=true `spark.shuffle.memoryFraction`=% of all shuffle buffers Competes with `spark.storage.memoryFraction` Bump this up from default!! Will help Spark SQL, too. Skipped Stages Reuse intermediate shuffle data found on reducer DAG for that partition can be truncated
  • 26. IBM | spark.tc Shuffle Intermediate Data: Compression `spark.shuffle.compress` Compress outputs (mapper) `spark.shuffle.spill.compress` Compress spills (reducer) `spark.io.compression.codec` LZF: Most workloads (new default for Spark) Snappy: LARGE workloads (less memory required to compress)
  • 27. IBM | spark.tc Spark Shuffle Operations join distinct cogroup coalesce repartition sortByKey groupByKey reduceByKey aggregateByKey
  • 28. IBM | spark.tc Spark Shuffle Managers `spark.shuffle.manager` = { `hash` <10,000 Reducers Output file determined by hashing the key of (K,V) pair Each mapper creates an output buffer/file per reducer Leads to M*R number of output buffers/files per shuffle `sort` >= 10,000 Reducers Default since Spark 1.2 Wins Daytona GraySort Challenge w/ 250,000 reducers!! `tungsten-sort` (Future Meetup!) }
  • 30. IBM | spark.tc Hash Shuffle Manager M*R num open files per shuffle; M=num mappers R=num reducers Mapper Opens 1 File per Partition/Reducer HDFS (2x repl) HDFS (2x repl)
  • 31. IBM | spark.tc Sort Shuffle Manager Hold Tight!
  • 32. IBM | spark.tc Tungsten-Sort Shuffle Manager Future Meetup!!
  • 33. IBM | spark.tc Shuffle Performance TuningHash Shuffle Manager (no longer default) `spark.shuffle.consolidateFiles`: mapper output files `o.a.s.shuffle.FileShuffleBlockResolver` Intermediate Files Increase `spark.shuffle.file.buffer`: reduce seeks & sys calls Increase `spark.reducer.maxSizeInFlight` if memory allows Use smaller number of larger workers to reduce total files SQL: BroadcastHashJoin vs. ShuffledHashJoin `spark.sql.autoBroadcastJoinThreshold`
  • 34. IBM | spark.tc Shuffle Configuration Documentation spark.apache.org/docs/latest/configuration.html#shuffle-behavior Prefix spark.shuffle
  • 36. IBM | spark.tc Daytona GraySort Challenge: Winning OptimizationsCPU-Cache Locality: (Key, Pointer-to-Record) & Cache Alignment Optimized Sort Algorithm: Elements of (K, V) Pair s Reduce Network Overhead: Async Netty, epoll Reduce OS Resource Utilization: Sort Shuffle
  • 37. IBM | spark.tc CPU-Cache Locality: (Key, Pointer-to-Record) AlphaSort paper ~1995 Chris Nyberg and Jim Gray Naïve List (Pointer-to-Record) Requires Key to be dereferenced for comparison AlphaSort List (Key, Pointer-to-Record) Key is directly available for comparison
  • 38. IBM | spark.tc CPU-Cache Locality: Cache Alignment Key(10 bytes) + Pointer(4 bytes*) = 14 bytes *4 bytes when using compressed OOPS (<32 GB heap) Not binary in size, not CPU-cache friendly Cache Alignment Options ① Add Padding (2 bytes) Key(10 bytes) + Pad(2 bytes) + Pointer(4 bytes)=16 bytes ② (Key-Prefix, Pointer-to-Record) Perf affected by key distro Key-Prefix (4 bytes) + Pointer (4 bytes)=8 bytes
  • 39. IBM | spark.tc CPU-Cache Locality: Performance Comparison
  • 40. IBM | spark.tc Optimized Sort Algorithm: Elements of (K, V) Pair s`o.a.s.util.collection.TimSort` Based on JDK 1.7 TimSort Performs best on partially-sorted datasets Optimized for elements of (K,V) pairs Sorts impl of SortDataFormat (ie. KVArraySortDataFormat) `o.a.s.util.collection.AppendOnlyMap` Open addressing hash, quadratic probing Array of [(key0, value0), (key1, value1)] Good memory locality Keys never removed, values only append (^2 Probing)
  • 41. IBM | spark.tc Reduce Network Overhead: Async Netty, epoll New Network Module based on Async Netty Replaces old java.nio, low-level, socket-based code Zero-copy epoll uses kernel-space between disk & networ k Custom memory management reduces GC pauses `spark.shuffle.blockTransferService`=netty Spark-Netty Performance Tuning `spark.shuffle.io.numConnectionsPerPeer` Increase to saturate hosts with multiple disks `spark.shuffle.io.preferDirectBuffers` On or Off-heap (Off-heap is default) Apache Spark Jira SPARK-2468
  • 42. S IBM | spark.tc Reduce OS Resource Utilization: Sort Shuffle M open files per shuffle; M = num of mappers `spark.shuffle.sort.bypassMergeThreshold` Merge Sort (Disk) Reducers seek and scan from range offset of Master File on Mapper TimSort (RAM) HDFS (2x repl) HDFS (2x repl) SPARK-2926: Replace TimSort w/Merge Sort (Memory) Mapper Merge Sorts Partitions into 1 Master File Indexed by Partition Range Offsets <- Master-> File
  • 44. IBM | spark.tc External Shuffle Service: Separate JVM Process Takes over when Spark Executor is in GC or dies Use new Netty-based Network Module Required for YARN dynamic allocation Node Manager serves files Apache Spark Jira: SPARK-3796
  • 46. IBM | spark.tc Project Tungsten: CPU and Memory Optimizations Disk Network CPU Memory Daytona GraySort Optimizations Tungsten Optimizations Custom Memory Management Eliminates JVM object and GC overhead More Cache-aware Data Structs and Algos `o.a.s.unsafe.map.BytesToBytesMap` vs. j.u.HashM Code Generation (default in 1.5) Generate bytecode from overall query plan
  • 47. Thank you! Special thanks to Big Commerce!! IBM Spark Tech Center is Hiring! Nice people only, please!!  IBM | spark.tc Sign up for our newsletter at To Be Continued…
  • 48. IBM | spark.tc Relevant Links http://sortbenchmark.org/ApacheSpark2014.pdf https://databricks.com/blog/2014/10/10/spark-petabyte-sort.ht ml https://databricks.com/blog/2014/11/05/spark-officially-sets-a- new-record-in-large-scale-sorting.html http://0x0fff.com/spark-architecture-shuffle/ http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/pro jects/reports/project16_report.pdf
  • 49. Power of data. Simplicity of design. Speed of innovation. IBM Spark