Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

How Spark Beat Hadoop @ 100 TB Sort
Advanced Apache Spark Meetup
Chris Fregly, Principal Data Solutions Engineer
IBM Spark Technology Center
Power of data. Simplicity of design. Speed of innovation.IBM | spark.tc

IBM | spark.tc
Announcements
Deepak Srinivasan
Big Commerce
Steve Beier
IBM Spark Tech Center

IBM | spark.tc
Who am I?
Streaming Platform Engineer
Streaming Data Engineer
Netflix Open Source Committer
Data Solutions Engineer
Apache Contributor
Principle Data Solutions Engineer
IBM Technology Center

IBM | spark.tc
Last Meetup (End-to-End Data Pipeline)
Presented `Flux Capacitor`
End-to-End Data Pipeline in a Box!
Real-time, Advanced Analytics
Machine Learning
Recommendations
Github
github.com/fluxcapacitor
Docker
hub.docker.com/r/fluxcapacitor

IBM | spark.tc
Since Last Meetup (End-to-End Data Pipeline)
Meetup Statistics
Total Spark Experts: ~850 (+100%)
Mean RSVPs per Meetup: 268
Mean Attendance Percentage: ~60% of RSVPs
Donations: $15 (Thank you so much, but please keep your $!)
Github Statistics (github.com/fluxcapacitor)
18 forks, 13 clones, ~1300 views
Docker Statistics (hub.docker.com/r/fluxcapacitor)
~1600 download

IBM | spark.tc
Recent Events
Replay of Last SF Meetup in Mtn View@BaseCR
M
Presented Flux Capacitor End-to-End Data Pipeline
(Scala + Big Data) By The Bay Conference
Workshop and 2 Talks
Trained ~100 on End-to-End Data Pipeline
Galvanize Workshop
Trained ~30 on End-to-End Data Pipeline

IBM | spark.tc
Upcoming USA Events
IBM Hackathon @ Galvanize (Sept 18th – Sept 21st)
Advanced Apache Spark Meetup@DataStax (Sept 21st)
Spark-Cassandra Spark SQL+DataFrame Connector
Cassandra Summit Talk (Sept 22nd – Sept 24th)
Real-time End-to-End Data Pipeline w/ Cassandra
Strata New York (Sept 29th - Oct 1st)

IBM | spark.tc
Upcoming European Events
Dublin Spark Meetup Talk (Oct 15th)
Barcelona Spark Meetup Talk (Oct ?)
Madrid Spark Meetup Talk (Oct ?)
Amsterdam Spark Meetup (Oct 27th)
Spark Summit Amsterdam (Oct 27th – Oct 29th)
Brussels Spark Meetup Talk (Oct 30th)

Spark and the Daytona GraySort tChallenge
sortbenchmark.org
sortbenchmark.org/ApacheSpark2014.pdf

IBM | spark.tc
Themes of this Talk: Mechanical Sympathy
Seek Once, Scan Sequentially
CPU Cache Locality, Memory Hierarchy are Key
Go Off-Heap Whenever Possible
Customize Data Structures for your Workload

IBM | spark.tc
What is the Daytona GraySort Challenge?
Key Metric
Throughput of sorting 100TB of 100 byte data, 10 byte key
Total time includes launching app and writing output file
Daytona
App must be general purpose
Gray
Named after Jim Gray

IBM | spark.tc
Daytona GraySort Challenge: Input and
ResourcesInput
Records are 100 bytes in length
First 10 bytes are random key
Input generator: `ordinal.com/gensort.html`
28,000 fixed-size partitions for 100 TB sort
250,000 fixed-size partitions for 1 PB sort
1 partition = 1 HDFS block = 1 node = no partial read I/O
Hardware and Runtime Resources
Commercially available and off-the-shelf
Unmodified, no over/under-clocking
Generates 500TB of disk I/O, 200TB network I/O

IBM | spark.tc
Daytona GraySort Challenge: Rules
Must sort to/from OS files in secondary storage
No raw disk since I/O subsystem is being tested
File and device striping (RAID 0) are encouraged
Output file(s) must have correct key order

IBM | spark.tc
Daytona GraySort Challenge: Task Scheduling
Types of Data Locality
PROCESS_LOCAL
NODE_LOCAL
RACK_LOCAL
ANY
Delay Scheduling
`spark.locality.wait.node`: time to wait for next shitty level
Set to infinite to reduce shittiness, force NODE_LOCAL
Straggling Executor JVMs naturally fade away on each run
Increasing
Level of
Shittiness

IBM | spark.tc
Daytona GraySort Challenge: Winning Results
On-disk only, in-memory caching disabled!
EC2
(i2.8xlarge)
EC2
(i2.8xlarge)
28,000
partitions
250,000
partitions (!!)

IBM | spark.tc
Daytona GraySort Challenge: EC2 Configuration
206 EC2 Worker nodes, 1 Master node
i2.8xlarge
32 Intel Xeon CPU E5-2670 @ 2.5 Ghz
244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4
NOOP I/O scheduler: FIFO, request merging, no reorderin
g
3 Gbps mixed read/write disk I/O
Deployed within Placement Group/VPC
Enhanced Networking
Single Root I/O Virtualization (SR-IOV): extension of PCIe
10 Gbps, low latency, low jitter (iperf showed ~9.5 Gbps)

IBM | spark.tc
Daytona GraySort Challenge: Winning
ConfigurationSpark 1.2, OpenJDK 1.7_<amazon-something>_u65-b17
Disabled in-memory caching -- all on-disk!
HDFS 2.4.1 short-circuit local reads, 2x replication
Writes flushed after every run (5 runs for 28,000 partitions)
Netty 4.0.23.Final with native epoll
Speculative Execution disabled: `spark.speculation`=false
Force NODE_LOCAL: `spark.locality.wait.node`=Infinite
Force Netty Off-Heap: `spark.shuffle.io.preferDirectBuffers`
Spilling disabled: `spark.shuffle.spill`=true
All compression disabled

IBM | spark.tc
Daytona GraySort Challenge: Partitioning
Range Partitioning (vs. Hash Partitioning)
Take advantage of sequential key space
Similar keys grouped together within a partition
Ranges defined by sampling 79 values per partition
Driver sorts samples and defines range boundaries
Sampling took ~10 seconds for 28,000 partitions

IBM | spark.tc
Daytona GraySort Challenge: Why Bother?
Sorting relies heavily on shuffle, I/O subsystem
Shuffle is major bottleneck in big data processing
Large number of partitions can exhaust OS resources
Shuffle optimization benefits all high-level libraries
Goal is to saturate network controller on all nodes
~125 MB/s (1 GB ethernet), 1.25 GB/s (10 GB ethernet)

IBM | spark.tc
Daytona GraySort Challenge: Per Node Results
Mappers: 3 Gbps/node disk I/O (8x800 SSD)
Reducers: 1.1 Gbps/node network I/O (10Gbps)

IBM | spark.tc
Shuffle Overview
All to All, Cartesian Product Operation
Least ->
Useful
Example
I Could
Find ->

IBM | spark.tc
Spark Shuffle Overview
Most ->
Confusing
Example
I Could
Find ->
Stages are Defined by Shuffle Boundaries

IBM | spark.tc
Shuffle Intermediate Data: Spill to Disk
Intermediate shuffle data stored in memory
Spill to Disk
`spark.shuffle.spill`=true
`spark.shuffle.memoryFraction`=% of all shuffle buffers
Competes with `spark.storage.memoryFraction`
Bump this up from default!! Will help Spark SQL, too.
Skipped Stages
Reuse intermediate shuffle data found on reducer
DAG for that partition can be truncated

IBM | spark.tc
Shuffle Intermediate Data: Compression
`spark.shuffle.compress`
Compress outputs (mapper)
`spark.shuffle.spill.compress`
Compress spills (reducer)
`spark.io.compression.codec`
LZF: Most workloads (new default for Spark)
Snappy: LARGE workloads (less memory required to compress)

IBM | spark.tc
Spark Shuffle Operations
join
distinct
cogroup
coalesce
repartition
sortByKey
groupByKey
reduceByKey
aggregateByKey

IBM | spark.tc
Spark Shuffle Managers
`spark.shuffle.manager` = {
`hash` <10,000 Reducers
Output file determined by hashing the key of (K,V) pair
Each mapper creates an output buffer/file per reducer
Leads to M*R number of output buffers/files per shuffle
`sort` >= 10,000 Reducers
Default since Spark 1.2
Wins Daytona GraySort Challenge w/ 250,000 reducers!!
`tungsten-sort` (Future Meetup!)
}

IBM | spark.tc
Shuffle Managers

IBM | spark.tc
Hash Shuffle Manager
M*R num open files per shuffle; M=num mappers
R=num reducers
Mapper Opens 1 File per Partition/Reducer
HDFS
(2x repl)
HDFS
(2x repl)

IBM | spark.tc
Sort Shuffle Manager
Hold Tight!

IBM | spark.tc
Tungsten-Sort Shuffle Manager
Future Meetup!!

IBM | spark.tc
Shuffle Performance TuningHash Shuffle Manager (no longer default)
`spark.shuffle.consolidateFiles`: mapper output files
`o.a.s.shuffle.FileShuffleBlockResolver`
Intermediate Files
Increase `spark.shuffle.file.buffer`: reduce seeks & sys calls
Increase `spark.reducer.maxSizeInFlight` if memory allows
Use smaller number of larger workers to reduce total files
SQL: BroadcastHashJoin vs. ShuffledHashJoin
`spark.sql.autoBroadcastJoinThreshold`

IBM | spark.tc
Shuffle Configuration
Documentation
spark.apache.org/docs/latest/configuration.html#shuffle-behavior
Prefix
spark.shuffle

Winning Optimizations
Deployed across Spark 1.1 and 1.2

IBM | spark.tc
Daytona GraySort Challenge: Winning
OptimizationsCPU-Cache Locality: (Key, Pointer-to-Record)
& Cache Alignment
Optimized Sort Algorithm: Elements of (K, V) Pair
s
Reduce Network Overhead: Async Netty, epoll
Reduce OS Resource Utilization: Sort Shuffle

IBM | spark.tc
CPU-Cache Locality: (Key, Pointer-to-Record)
AlphaSort paper ~1995
Chris Nyberg and Jim Gray
Naïve
List (Pointer-to-Record)
Requires Key to be dereferenced for comparison
AlphaSort
List (Key, Pointer-to-Record)
Key is directly available for comparison

IBM | spark.tc
CPU-Cache Locality: Cache Alignment
Key(10 bytes) + Pointer(4 bytes*) = 14 bytes
*4 bytes when using compressed OOPS (<32 GB heap)
Not binary in size, not CPU-cache friendly
Cache Alignment Options
① Add Padding (2 bytes)
Key(10 bytes) + Pad(2 bytes) + Pointer(4 bytes)=16 bytes
② (Key-Prefix, Pointer-to-Record)
Perf affected by key distro
Key-Prefix (4 bytes) + Pointer (4 bytes)=8 bytes

IBM | spark.tc
CPU-Cache Locality: Performance Comparison

IBM | spark.tc
Optimized Sort Algorithm: Elements of (K, V) Pair
s`o.a.s.util.collection.TimSort`
Based on JDK 1.7 TimSort
Performs best on partially-sorted datasets
Optimized for elements of (K,V) pairs
Sorts impl of SortDataFormat (ie. KVArraySortDataFormat)
`o.a.s.util.collection.AppendOnlyMap`
Open addressing hash, quadratic probing
Array of [(key0, value0), (key1, value1)]
Good memory locality
Keys never removed, values only append
(^2 Probing)

IBM | spark.tc
Reduce Network Overhead: Async Netty, epoll
New Network Module based on Async Netty
Replaces old java.nio, low-level, socket-based code
Zero-copy epoll uses kernel-space between disk & networ
k
Custom memory management reduces GC pauses
`spark.shuffle.blockTransferService`=netty
Spark-Netty Performance Tuning
`spark.shuffle.io.numConnectionsPerPeer`
Increase to saturate hosts with multiple disks
`spark.shuffle.io.preferDirectBuffers`
On or Off-heap (Off-heap is default)
Apache Spark Jira
SPARK-2468

S
IBM | spark.tc
Reduce OS Resource Utilization: Sort Shuffle
M open files per shuffle; M = num of mappers
`spark.shuffle.sort.bypassMergeThreshold`
Merge Sort
(Disk)
Reducers seek and
scan from range offset
of Master File on
Mapper
TimSort
(RAM)
HDFS
(2x repl)
HDFS
(2x repl)
SPARK-2926:
Replace
TimSort
w/Merge Sort
(Memory)
Mapper Merge Sorts Partitions into 1 Master File
Indexed by Partition Range Offsets
<- Master->
File

IBM | spark.tc
External Shuffle Service: Separate JVM Process
Takes over when Spark Executor is in GC or dies
Use new Netty-based Network Module
Required for YARN dynamic allocation
Node Manager serves files
Apache Spark Jira: SPARK-3796

IBM | spark.tc
Project Tungsten: CPU and Memory Optimizations
Disk
Network
CPU
Memory
Daytona GraySort Optimizations
Tungsten Optimizations
Custom Memory Management
Eliminates JVM object and GC overhead
More Cache-aware Data Structs and Algos
`o.a.s.unsafe.map.BytesToBytesMap` vs. j.u.HashM
Code Generation (default in 1.5)
Generate bytecode from overall query plan

Thank you!
Special thanks to Big Commerce!!
IBM Spark Tech Center is Hiring!
Nice people only, please!! 
IBM | spark.tc
Sign up for our newsletter at
To Be Continued…

IBM | spark.tc
Relevant Links
http://sortbenchmark.org/ApacheSpark2014.pdf
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.ht
ml
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-
new-record-in-large-scale-sorting.html
http://0x0fff.com/spark-architecture-shuffle/
http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/pro
jects/reports/project16_report.pdf

Power of data. Simplicity of design. Speed of innovation.
IBM Spark

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge

Similar to Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge (20)

More from Chris Fregly

More from Chris Fregly (20)

Recently uploaded

Recently uploaded (20)

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySort Challenge