1) The document discusses Spark's memory model and how data is stored and processed in Spark's executors in a memory-efficient way.
2) Spark uses variable-length pages to store data off-heap and frees memory immediately to simplify implementation and improve performance.
3) Spark stores data in a compact binary format rather than Java objects to reduce space overhead and improve efficiency of data processing and serialization.
4) Spark employs cache-aware data structures like sorting and hashing to improve data locality and reduce CPU cache misses, improving performance.
2. About Me
• Software Engineer @
• Apache Spark Committer
• One of the most active Spark contributors
3.
4. About Databricks
TEAM
Started Spark project (now Apache Spark) at UC Berkeley in
2009
MISSON
Make Big Data Simple
PRODUC
TUnified Analytics Platform
6. Memory Model inside Executor
data
1010100001010
0010001001001
1010001000111
1010100100101
internal format
Cache
Manager
Memory
Manager
operators
allocate
memor
y
memor
y pages allocate
memor
ymemor
y pages
7. Memory Model inside Executor
data
1010100001010
0010001001001
1010001000111
1010100100101
Cache
Manager
Memory
Manager
operators
allocate
memor
y
memor
y pages allocate
memor
ymemor
y pages
internal format
8. Memory Allocation
• Allocation happens in page granularity.
• Off-heap supported!
• Page is not fixed-size, but has a lower and upper bound.
• No pooling, pages are freed once there is no data on it.
9. Why var-length page and no pooling?
• Pros:
• simplify the implementation. (no single record will across pages)
• free memory immediately so that the OS can use them for file buffer,
etc.
• Cons:
• can not handle super big single record. (very rare in reality)
• fragmentation for records bigger than page size lower bound. (the
lower bound is several mega bytes, so it’s also rare)
• overhead in allocation. (most malloc algorithms should work well)
10. Memory Model inside Executor
data
1010100001010
0010001001001
1010001000111
1010100100101
Cache
Manager
Memory
Manager
operators
allocate
memor
y
memor
y pages allocate
memor
ymemor
y pages
internal format
11. Java Objects Based Row Format
Array
BoxedInteger(123)
String(“data”)
String(“bricks”)
Row
• 5+ objects
• high space overhead
• slow value accessing
• expensive hashCode()
(123, “data”, “bricks”)
12. Data objects? No!
• It is hard to monitor and control the memory usage when
we have a lot of objects.
• Garbage collection will be the killer.
• High serialization cost when transfer data inside cluster.
16. Efficient Binary Format
“bricks”0x0 123 32 40 “data”4 6
null tracking
(123, “data”, “bricks”)
offset and length of
offset and length of
data
17. Memory Model inside Executor
data
1010100001010
0010001001001
1010001000111
1010100100101
Cache
Manager
Memory
Manager
operators
allocate
memor
y
memor
y pages allocate
memor
ymemor
y pages
internal format
38. Naive Hash Map
Each lookup needs many pointer dereferences and
key comparison when hash collision happens, and
jumps between 2 memory regions, bad cache locality!
44. Cache-aware Hash Map
Each lookup mostly only needs one pointer
dereference and key comparison(full hash collision is
rare), and access data mostly in a single memory
region, better cache locality!
45. Recap: Cache-aware data structure
How to improve cache locality …
• store key-prefix with pointer.
• store key full hash with pointer.
Store extra information to try to keep the
memory accessing in a single region.
46. Memory Model inside Executor
data
1010100001010
0010001001001
1010001000111
1010100100101
Cache
Manager
Memory
Manager
operators
allocate
memor
y
memor
y pages allocate
memor
ymemor
y pages
internal format
47. Future Work
• SPARK-19489: Stable serialization format for external &
native code integration.
• SPARK-15689: Data source API v2
• SPARK-15687: Columnar execution engine.
48. Try Apache Spark in Databricks!
UNIFIED ANALYTICS PLATFORM
• Collaborative cloud environment
• Free version (community edition)
DATABRICKS RUNTIME
3.0
• Apache Spark - optimized for the
cloud
• Caching and optimization layer -
DBIO
• Enterprise security - DBES
Try for free
today.
databricks.com
first of all, spark is a distributed engine, let’s see the memory usage at the cluster level.
a typical spark cluster has one master node and several worker nodes
the entries of spark application is the driver process, it can be run on any node.
when driver launches, it will apply for executors from master
master will ask workers to launch executors for a driver
each executor is a JVM process and has a memory manager and a thread pool to run tasks
distribute resource across spark applications is out of this talk
this talk focus on memory management inside executors
Let’s zoom in!
Keep data as binary and operate on binary. execution and storage are the 2 major memory consumers in Spark. Memory manager only manages part of the total available memory.
16g upper bound
why use internal format? why not just use objects?
some more reasons about why Spark doesn’t use data objects
how to operate on binary data directly.
The whole data flow looks efficient, but, in Spark, we keep asking ourself, can it run faster?
shall we re-implement all operators in Spark to be CPU cache friendly? That’s a lot of work! Instead, we only focus on 2 algorithms, because …
many fancy algorithms/systems are fundamentally based on sort and hash, improving these 2 algorithms can benefit a lot of operators in Spark.
most sort algorithms work by, picking 2 records, compare and switch, and repeat that again and again
then repeat
The idea behind this algorithm is that, mostly we can compare 2 records with first several bytes.
collision is common
For a hash table, collision happens frequently when you have many entries.
the idea behind this algorithm is that, although collision is common in hash table, but full hash value collision is very rare. mostly we can identify a key with its full hash value.
The data flow is efficient, but the beginning of it, the data source may not. data source is a public API, RDD[Row]