2. 2
AGENDA
1. Introduction for RAPIDS Accelerator for Spark
2. RAPIDS Accelerator for Spark Performance
3. GPU Acceleration combined with Alluxio
3. 3
GROWTH IN REQUIREMENT FOR DATA PROCESSING
2030
2020
2010
2000
Hadoop Era Spark Era Spark GPU Era
Spark 2.0 on
CPUs
GPU Accelerated
Spark 3.0
“These contributions lead to faster data pipelines,
model training and scoring for more breakthroughs and
insights with Apache Spark 3.0 and Databricks.”
Matei Zaharia, creator of Apache Spark and chief
technologist at Databricks
4. 4
Accelerate data preparation
Quickly move to next stages of
the pipeline
Focus on most-critical activities
Orchestrate end-to-end pipelines
From ETL to model training to
visualization
Same infrastructure for Spark
and ML/DL frameworks
Complete jobs faster with less
hardware
Save on-prem and in the cloud
Do more with less
SPARK 3.0 ON NVIDIA GPUs
Accelerate data science pipelines without code changes
Faster Execution Time Streamline Analytics to AI
Reduced
Infrastructure Costs
5. NVIDIA BRINGS GPU ACCELERATION TO APACHE SPARK
Features
• Use existing (unmodified) customer
code
• Spark features that are not GPU
enabled run transparently on the
CPU
Initial Release - GPU Acceleration of:
• Spark Data Frames
• Spark SQL
• ML/DL training frameworks
Seamless integration with Spark 3.0
6. RAPIDS ACCELERATOR FOR APACHE SPARK
UCX Libraries
RAPIDS libcudf
(C++ Libraries)
CUDA
JNI bindings
Mapping From Java/Scala to C++
RAPIDS Accelerator
for Spark
DISTRIBUTED SCALE-OUT SPARK APPLICATIONS
Spark SQL API Spark Shuffle
DataFrame API
if gpu_enabled(operation, data_type)
call-out to RAPIDS
else
execute standard Spark operation
JNI bindings
Mapping From Java/Scala to C++
● Custom Implementation of Spark
Shuffle
● Optimized to use RDMA and
GPU-to-GPU direct communication
APACHE SPARK CORE
7. GAME-CHANGING PERFORMANCE GAINS
7x Performance Boost
90% Cost Savings
on Databricks
Opens new possibilities for AI-driven services in Adobe Experience Cloud
“We’re seeing significantly faster performance with
NVIDIA-accelerated Spark 3.0 compared to running
Spark on CPUs. With these game-changing GPU
performance gains, new possibilities open up for
enhancing AI-driven features in our full suite of
Adobe Experience Cloud apps.”
— William Yan, Senior Director of Machine Learning at Adobe
8. RAPIDS ACCELERATOR ECOSYSTEM MOMENTUM
Databricks
Machine Learning
Runtime
Google Cloud
Dataproc
Apache Spark 3.0
Community Release
Amazon EMR
Available
Now
Available
Now
Available
Now
Available
Now
Cloudera CDP
Available in
Jun’21
9. 9
NVIDIA CONFIDENTIAL - DO NOT COPY OR DISTRIBUTE
Nodes 8
CPU
2 x AMD EPYC 7452
(64 cores/128 threads)
GPU 2 x NVIDIA Ampere A100, PCIe, 250W, 40GB
RAM 0.5 TB
Storage 4 x 7.68 TB Gen4 U.2 NVMe
Networking 1 x Mellanox CX-6 Single Port HDR100 QSFP56
Cost w/o GPU ~$42,000 per w/ bulk discount
Cost w/ GPU ~$71,000 per w/ bulk discount
Software
HDFS (Hadoop 3.2.1)
Spark 3.0.2 (stand alone)
EGX / NVIDIA Certified
OEM servers
Benchmark Environment – EGX
10. SPARK SQL QUERIES – EGX CLUSTER
Based on 97 NVIDIA Decision Support (NDS) benchmark (3TB Dataset without decimals)*
GPU is 3.21x faster with a cost ratio of 0.52
(GPU cost was 52% that of the CPU)
Queries 14b and 72 were removed because of failures
*NVIDIA Decision Support (NDS) benchmark is derived from the TPC-DS benchmark and is used for internal performance testing. Results from NDS are not comparable to TPC-DS
11. UCX ON VS OFF
GPU + UCX shuffle is 1.23x faster than the GPU alone.
Queries 67 and 72 removed because of failures.
GPU + UCX Shuffle is 4.15x faster than the CPU and a cost ratio of 0.41
Queries 14a, 67, and 72 were removed because of failures.
SPARK SQL QUERIES – EGX CLUSTER
13. 13
*NVIDIA Decision Support (NDS) benchmark is derived from the TPC-DS benchmark and is used for internal performance testing. Results from NDS are not comparable to TPC-DS
14. WHY SOME GPU QUERIES FAILED
GPU Memory Limitations/Spilling
Operation Problem Solution
Sort In cases of data skew, the amount of data
being sorted can exceed limits of the
hardware/software.
A modified external batch sort is
implemented in the working
branch for the 0.5 release.
Window* In cases of data skew the amount of data
in the window operation can exceed the
limits of the hardware/software.
Implement a chunked rank
optimization. Github issue #1859
Join Worst case join output row count is
left.rows * right.rows
Materialize the output of a join in
chunks. Github issue #1629
Have conditional filters run as a
part of the join. Github issue #288
* Actually, this is for rank which we don’t support yet but plan to in the 0.5 release
15. 15
WHY IS THE GPU SLOWER FOR SOME QUERIES?
• Failed Queries
• Small Data Sizes
(spark.sql.adaptive.advisoryPartitionSizeInBytes=1G)
• Q28, Q44, and Q67
• Less computation overlap
(spark.rapids.sql.concurrentGpuTasks=1)
• Host/Device Memory Transfers
• All of them
• Cache Consistency on Reductions/Very Small Aggregate
Results
• Q88
• Lack of GPU support and CPU parallelism is much less
• Q44, Q49, and Q67
16. 16
ALLUXIO CONFIGURATION
- Co-locate the Alluxio worker nodes with Spark worker nodes to ensure short-circuit
reads and writes.
- Size cache according to the working set.
- Choose the right cache medium choice(SSD or System Memory)
17. 17
spark.rapids.sql.enabled is the master enable
spark.rapids.sql.explain enables logging of operations not accelerated
- Set to NOT_ON_GPU to print only incompatible ops
spark.rapids.sql.concurrentGpuTasks controls concurrent task count per GPU
- Set to a value between 2 and 4, with 2 typically providing the most benefit
spark.rapids.memory.pinnedPool.size significantly improves performance of
data transfers between the GPU and host memory
RAPIDS ACCELERATOR CONFIGURATION
18. 18
WILL MY SPARK WORKLOAD ACCELERATE WITHOUT CHANGES?
If I know my Spark workload characteristics...
Accelerates Well on GPUs Not for GPUs
Data Pipeline
Use Cases
● Data Mining, Analytics and BI
● Batch processing and writing large datasets to a Data
Warehouse
● Data extraction, aggregation and feature preparation for ML
Training & Inference
● Real-time Streaming Analytics/AI pipeline
● Online Transaction Processing (OLTP)
● Data Pipeline with custom code
Technical
Characteristics
● Batch processing of GB+ data sets
● Parquet, ORC, CSV data formats
● HDFS, S3-compatible, or V2 data sources
● DataFrame/SQL (join, agg, sort, window), Selected Hive &
Scala UDFs
● Stream processing
● Spark RDD, MLLib, Dataset, GraphX, Streaming libraries
If I am unsure...
Use the Log-Analysis Tool
● Review Spark history logs from existing CPU jobs
● Understand how much of the workloads could execute on GPUs
● Get tips on optional code optimizations for GPUs
Apache Spark
Apache Spark - Core
Catalyst Query Optimizer
Spark Streaming
Spark SQL
Spark Dataframes Spark Datasets RDD
Spark Shuffle
Spark MLLib GraphX
CPU Only
GPU Aware
GPU Accelerated
Partially GPU
Accelerated
19. 19
Summary
RAPIDS Accelerator for Spark unlocks GPU
acceleration for Spark dataframes, Spark SQL, &
ML/DL frameworks such as XGBoost(with more
coming)
Alluxio is a high performance data orchestration
system for GPU compute.
Spark & GPUs on Alluxio optimizes for
performance and cost on cloud scale datasets.
Spark 3 & GPUs on Databricks, EMR and Dataproc
available today
Try it yourself
https://nvidia.github.io/spark-rapids/Getting-
Started/
Developer Blog:
Accelerating Analytics and AI with Alluxio
and NVIDIA GPUs
GTC Talk:
Enabling Data Orchestration with RAPIDS
Accelerator [S32746]
Accelerating Apache Spark Shuffle with UCX
[S31822]
Tuning GPU Network and Memory Usage in
Apache Spark [S31566]
Running Large-Scale ETL Benchmarks with
GPU-Accelerated Apache Spark [S31846]
……. and more