In this session we will present a Configurable FPGA-Based Spark SQL Acceleration Architecture. It is target to leverage FPGA highly parallel computing capability to accelerate Spark SQL Query and for FPGA’s higher power efficiency than CPU we can lower the power consumption at the same time. The Architecture consists of SQL query decomposition algorithms, fine-grained FPGA based Engine Units which perform basic computation of sub string, arithmetic and logic operations. Using SQL query decomposition algorithm, we are able to decompose a complex SQL query into basic operations and according to their patterns each is fed into an Engine Unit. SQL Engine Units are highly configurable and can be chained together to perform complex Spark SQL queries, finally one SQL query is transformed into a Hardware Pipeline. We will present the performance benchmark results comparing the queries with FGPA-Based Spark SQL Acceleration Architecture on XEON E5 and FPGA to the ones with Spark SQL Query on XEON E5 with 10X ~ 100X improvement and we will demonstrate one SQL query workload from a real customer.
6. What is an FPGA?
• Field Programmable Gate Array
6
‒ Configurable Logic Blocks (CLB)
‒ Embedded Memory
‒ Digital signal processing (DSP) blocks
‒ I/O pads
‒ Hard IP(PCIe, DDR, GigE, etc )
7. 7
Why FPGA?
a
b
c
y
y a b c
Truth Table
a b c y
0 0 0 1
0 0 1 0
0 1 0 1
0 1 1 1
1 0 0 1
1 0 1 0
1 1 0 1
1 1 1 1
Programmed LUT
1
0
1
1
1
0
1
1
MUX y
a,b,c
LUT
Required Function
‒ Reconfigurable architecture
CLB consists of LUTs. LUT is a RAM with data width of 1 bit.
The contents are programmed at power up.
‒ Low-power, energy efficiency, compared with CPU/GPU
Extreme degree of customizations, Well positioned for High performance and providing flexibility
11. End User Programming Interfaces
11
FPGACPU
User Application
CPU
Infrastructure IP
(UPI, PCIe*, HSSI, FPGA Management)
FPGA Runtime Software
(Accelerator Abstraction Layer)
FPGA IP
(Acceleration
Function Unit)
Intel-Provided
Infrastructure
USER SOFTWARE
INTERFACE
User Developed
Application
Specific
Functions
UPI/PCIe
HSSI
= New blocks that simplify code development.
CORE CACHE
INTERFACE
Intel® Confidential
12. Traditional FPGA Development Approach
Kernels
exe
AFU
Bitstream
SW
Compiler
OpenCL
Compiler
HDL
SW
Compiler
exe AFU
Bitstream
HDL Programming
Syn.
PAR
AAL
Software
Blue
Bitstream
CPU FPGA
Green
Bitstream
OpenCL
Emulator
Application
Host
AFU
Simulation
Environment
(ASE)
C
OpenCL Programming
ASE
from Intel
AAL
from Intel
Altera® Quartus
Prime Pro
OpenCL BSP
AAL
Software
Blue
Bitstream
Green
Bitstream
Application
CPU FPGA
12
14. Workload Introduction
14Intel Confidential
The test case is from a customer and it utilizes SQL query to get the accounting summaries by
USER_ID on a big table. The SQL query contains heavy expression evaluations.
Accounting Big Table:
TIME_ID MBUSER_ID OPER_TID SUM_TIMES CHARGE1 …
20140407 2700007679977 5B013363363w 3 0 …
20140407 2704012998344 31011G13iG0 48 57180 …
20140407 2704040114238 31Q11512ZT0 1 180 …
20140407 2700007012466 31011G13iG0 8 52320 …
20140407 2700001523491 1T0311G80610ydH10G00 2 0 …
20140407 2700000765632 310103015G0 1 30 …
20140407 2700007800325 4562210021 1 0 …
…
1.6x10^8
Rows
38 Columns
SQL queries to summarize customers consumption characteristics utilizing
billing data.
5GB parquet format stored on HDFS, 160 Million rows.
15. Workload code snippet
Function Count
Max 13
Sum 155
Substr 329
Case 133
Implicit Data type cast (String to Double) n/a
Total 630
// Prepare
val parquet = spark.read.parquet ("/mnt/nvme/inputParquet/")
parquet.createOrReplaceTempView ("inputTable")
// Query
A very Long SQL statement, intensive use build-in functions:
16. 16Intel Confidential
SQL Query Physical Execution Plan
Two stages and with a shuffle(cross the data in network), the map stage contains file scan, projection and
partial aggregation while the reduce stage do further aggregation by merging the partial aggregation results.
Stage 1 (Map)
• File Scan
Read data from source.
• Projection
Expression evaluation
consumes most CPU cycles.
• Partial Aggregation
Aggregate per partition.
Shuffle
Stage 2 (Reduce)
• Full Aggregation. Tiny
task, consumes minor
CPU cycles.
18. Benchmark H/W Setup
18Intel Confidential
In a single server for profiling and performance evaluation.
• MCP(Skylake-FPGA Multiple Chips Package)
o CPU
Intel Xeon Skylake-P, 2Socketsx14Cores@2.8GHz, 56Hyper Threads
o FPGA
1xArria10 GX, 427,200ALM, 8MB RAM (10AX115U3F45E2SG)
o DMA Channels
1xUPI (80Gbps)
• Memory
384GB, DDR4@2133 MHz
• Disk
1xIntel SSD P3700, 1.6TB, SR:2800MB/s, SW:1900MB/s, RR:450K IOPS, RW:150K IOPS
19. 19Intel Confidential
Baseline Profile - CPU, The Bottleneck
• PAT(Performance Analysis Tool) shows CPU is heavily utilized (assigned 54/56 Virtual Cores to
Spark). The total query execution time is 85 seconds.
Note: We started measurement from the 2nd run(the 1st run is to warm up data Linux file system cache), so no disk access
bandwidth in general.
Reduce
Stage does
very simple
aggregation
and takes
minor
CPU.(~1s)
*For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
20. 20Intel Confidential
Baseline Profile - CPU, The Bottleneck, Contd.
• From the VisualVM map task’s CPU breakdown we can see the projection consumes 66.7% CPU.
Projection takes
66.7% of CPU
*For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
23. JVM
Spark
Spark FPGA Adaptor
Native
HW
InternalRow
to FPGA Batch
FPGA Batch
to InternalRow
FPGA Java Wrapper
FPGA Driver
FPGA
FPGA Project
Pattern
Configure
DMA
Configure
Huge Page
Memory Pool
Computation
Starter, Monitor
Java Native Interface(JNI)
Accelerator Abstraction Layer (AAL)
• Spark FPGA Adaptor
• Identify the expressions in projection and
export to FPGA SQL engine instructions
• Data conversions between Spark Internal
Rows FPGA Batches.
• FPGA Driver
• Configure the SQL Engine Patterns according
to the instructions from Spark FPGA Adaptor
• Trigger the FPGA computation and collect
results
• Huge pages memory management
• Configure the DMA channel between main
memory & FPGA
• AAL
• FPGA runtime library
• low level API to FPGA Driver
• FPGA SQL Engine (RTL)
• SQL expression pattern units, can be
configurable.
• DMA RX: FPGA reads input data from main
memory.
• DMA TX: FPGA writes results to main memory.
23
Arch Overview - S/W Stack
DMA RX/TX(RTL) SQL Engine(RTL)
24. null bit set(1 bit/field) values(8 bytes/field) variable length portion
4 bytes(TIME_ID) …… 64 bytes(For 4xCL alignment)
…… 4 bytes(For 4xCL alignment)8 bytes(MBUSER_ID)
8 bytes(MBUSER_ID)FPGA Input Batch
FPGA Output Batch
Internal Row
InternalRow
to FPGA Batch
FPGA Batch
to InternalRow
FPGA Java Wrapper
FPGA Project
1. Get HugePage
wrapped in
DirectByteBuffer
Internal Rows
FPGAInputBatch
FPGAOutputBatch
Internal Rows
4. Input
for
Computation
5. Collect
computation
result
2. Data Conversion 6. Data Conversion
7. Free HugePage
wrapped in
DirectByteBuffer
• Internal Row
Spark representation of one record, flexible to represent fixed and variable length fields.
• FPGA Input Batch
For memory and computation efficiency fields are placed in a sequential physical memory.
• FPGA Output Batch
Similar as FPGA Input Batch.
3. Engine
Configuration,
Start
24
Arch Overview - S/W Stack, Contd.
12 bytes(ACC_NBR)
Input Output
Data Flow
Control Flow
Spark FPGA Adaptor
25. Spark
Engine
Unit
Engine
Unit
Engine
Unit …DMA
RX
DMA
TX
Output BufferInput Buffer
CPU
FPGA
FPGA
Adapter & Driver
Data Source
Input BufferInput Buffer
Output BufferOutput Buffer
Engine
Unit
Engine
Unit
Engine
Unit …
Engine
Unit
Engine
Unit
Engine
Unit …
169 Levels Pipeline
Data Flow
Control Flow
Pattern Configure,
Computation Control
25
Arch Overview - Engine Pipeline, Data Flow
• Engine Pipeline
Spark FPGA SQL Engine is designed as Engine Unit Pipelines. Every Engine Unit plays a single computation, different Engine Units are
assembled together(configured by Spark) to perform a complex computation and works in the way of pipeline. A lot of pipelines(say N
pipelines) can be constructed to perform N parallel computations, so that in a single FPGA cycle, N records can be digested.
• Data Flow
Spark pumps Data from Data Source and converts them into the format as FPGA required, and then put them into InputBuffer Array.
Then FPGA gets input data via DMA RX and feed them into Engine Pipelines. The results of Engine Pipelines are filled into OutBuffer
Array via DMA TX. Finally Spark converts data back in the format of Spark SQL needed.
26. Arch Overview – SQL Engine Micro Architecture
26
• Every SQL Expression Evaluation engine is configurable.
• Every engine contain max four pattern engines. The input data is parallel fed into
pattern engine. The final result is the combine of the pattern engine result.
Pattern Engine 1 is configured to
evaluate the SQL expression
Substr(oper_tid,1,1) IN (‘1’, ‘7’)
Pattern Engine 2 is configured to
evaluate the SQL expression
Substr(oper_tid, 2, 1) IN (‘o’)
28. • The FPGA accelerated version significantly reduced the total execution time, from 86
seconds(baseline) to 44 seconds in the end to end benchmark.
Speedup Ratio: 86s/44s => ~2X
FPGA: 44s
Baseline: 86s
Performance Comparison - FPGA vs Baseline
28
*For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
29. • The FPGA accelerated version reduced the CPU time in expression evaluation,
from 66.7%(baseline) to 6.6-% in Map stage.
Projection with FPGA, less
than 6.6%
Projection in Baseline,
66.7%
29
Performance Comparison - FPGA vs Baseline, Contd.
*For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
31. Future Works
• Fully Configurable FPGA SQL Acceleration Engine
• In this PoC, we identified the SQL expression patterns manually in
frontend and configure them to the FPGA SQL Engine units in
runtime; however, we have limit FPGA SQL engines to support
some of the typical expression patterns, and arbitrary SQL
expression combinations is not supported yet.
• More Operators Support
• SQL Expression Evaluation in Projection is the first step, and for the
other typical operators like Aggregation/Sort/Join probably also can
be offload to FPGA.
• CPU can also computes the expression evaluation when FPGA
resources are fully occupied in computation.
31