SlideShare a Scribd company logo
1 of 51
Download to read offline
GPU&SSD Accelerates PostgreSQL
- Challenge towards query processing throughput 10GB/s -
HeteroDB,Inc
Chief Architect & President
KaiGai Kohei <kaigai@heterodb.com>
about HeteroDB?
▌Corporate Profile
 Name: HeteroDB,Inc
 Location: Shinagawa, Tokyo, Japan
 Establish: 4th-Jul-2017
 Businesses: Dev & Sales of GPU/SSD accelerated SQL database products
Consulting service for SQL tuning with GPU
 Stock share: 100% by the founders
▌Founder’s Profile
KaiGai Kohei (Chief Architect & co-founder; President)
He has more than 10 years experiment for development of Linux kernel and PostgreSQL database in the
developer’s community; contributed to various major features like SELinux support at PostgreSQL,
update/join support of FDW, custom-scan interface and so on.
In 2012, he launched PG-Strom project which transparently accelerates SQL workloads using GPU, and
then he has focused on analytic SQL acceleration with heterogeneous computing architecture.
The award of genius programmer by METI/IPA (2006); The award of OSS contributors by OSSPF (2014).
Kashiwagi Takehiko (Chief Sales Engineer & co-founder; Vice-president)
He has been a researcher at the NEC laboratory for networking technology, in-memory columnar storage,
and GPGPU acceleration of database workloads. He had also co-worked with KaiGai as PG-Strom project
member when both of us were employee of NEC.
Right now, he is responsible to market / customer development in HeteroDB, in addition to his own
company for networking solutions (Parallel-Networks,Inc).
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -2
Our core technology – PG-Strom
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -3
▌Features
 Utilization of GPU’s computing
capability to off-load SQL workloads.
Parallel execution by thousands cores.
 Auto-generation of GPU programs
from the supplied SQL, and just-in-
time compilation.
Enables transparent acceleration.
 Open Source Software
▌Functionalities
 GPU parallel execution for WHERE-
clause, JOIN, GROUP BY
 User defined GPU function (PL/CUDA)
▌Target usage
 Batch, summary / reporting
 In-database analytics / machine-
learning
Query
Optimizer
Query
Executor
PG-Strom
Extension
Storage Manager
SQL Parser
Application
Storage
PG-Strom: Extension of PostgreSQL for acceleration by GPU
same SQL query
same data schema
GPU Characteristics
Ultra parallel computing processor with thousands cores
and hundreds GB/s class memory band in a chip
CPU
Like a passenger car; well utilizable
but less transportation capacity.
GPU
Like a high-speed railway; a little bit
troublesome to get in or out, but capable
for mass-transportation.
Model Intel Xeon E5-2699v4 NVIDIA Tesla P40
# of transistors 7.2billion 12billion
# of cores 22 (functional) 3840 (simple)
Performance (FP32) 1.2 TFLOPS (with AVX2) 12TFLOPS
Memory size max 1.5TB (DDR4) 24GB (GDDR5)
Memory bandwidth 76.8GB/s 347GB/s
TDP 145W 250W
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -4
Example of GPU calculation – Reduction algorithm
●item[0]
step.1 step.2 step.4step.3
Calculation of the total
sum of an array by GPU
Σi=0...N-1item[i]
◆
●
▲ ■ ★
● ◆
●
● ◆ ▲
●
● ◆
●
● ◆ ▲ ■
●
● ◆
●
● ◆ ▲
●
● ◆
●
item[1]
item[2]
item[3]
item[4]
item[5]
item[6]
item[7]
item[8]
item[9]
item[10]
item[11]
item[12]
item[13]
item[14]
item[15]
Total sum of items[]
with log2N steps
Inter-cores synchronization with hardware support
SELECT count(X),
sum(Y),
avg(Z)
FROM my_table;
Same logic is internally used to
implement aggregate function.
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -5
Is SQL workload similar to image processing?
Image Data =
int/float[] array
Transposition
ID X Y Z
SELECT * FROM my_table
WHERE X BETWEEN 40 AND 60
Parallel Execution
GPU’s advantage: Same operations to massive amount of data.
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -6
Problems PG-Strom trying to solve
RAM
Data
(small)  Primitive of PG-Strom
✓ Heavy SQLs (JOIN, GROUP BY)
✓ Data size less than physical RAM
2015
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -7
Problems PG-Strom trying to solve
RAM
Data
(small)
Data
(large)
More computing
intensive workloads
More I/O
intensive workloads
 PL/CUDA
✓ In-database analytics and machine-
learning
✓ Drug discovery, anomaly detection
 SSD-to-GPU Direct SQL Execution
✓ Fast I/O nearly wired speed of H/W
✓ Large batch, reporting, and analytics
2017
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -8
 Primitive of PG-Strom
✓ Heavy SQLs (JOIN, GROUP BY)
✓ Data size less than physical RAM
It’s second time for me to have a talk at DB tech showcase!
DBTS2014: GPGPU Accelerates PostgreSQL
DBTS2017: GPU & SSD Accelerates PostgreSQL
DB tech showcase 2014
Utilization of new devices according to expansion of the application domain
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -9
PG-Strom Architecture
PostgreSQL Internals
SQL Parser
Query Optimizer
Query Executor
Storage
Manager
Transaction
Manager
IPC
(Lock, Latch,
shmem)
SQL Query
Parse-tree
Execution-Plan
Query
Results  Parse the supplied SQL, then transform
to an internal data structure (parse-tree)
 Detection of syntax errors
 Cost estimation based on the statistics
 Construction of the most optimal query
execution-plan
 Run the workloads (like Scan, Join,
Sort, ...) according to the query
execution plan.
 Utilization of PostgreSQL internal
infrastructures; like storage manager
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -11
How PostgreSQL constructs query execution plan (1/2)
Scan
t0 Scan
t1
Scan
t2
Join
t0,t1
Join
(t0,t1),t2
GROUP
BY cat
ORDER
BY score
LIMIT
100
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -12
How PostgreSQL constructs query execution plan (2/2)
Scan
t0 Scan
t1
Join
t0,t1
Statistics)
nrows: 1.2M
width: 80
Index: none
candidate
HashJoin
cost=4000
candidate
MergeJoin
cost=12000
candidate
NestLoop
cost=99999
candidate
Parallel
Hash Join
cost=3000
candidate
GpuJoin
cost=2500
WINNER!
Built-in execution path of PostgreSQLProposition by extensions
(since PostgreSQL v9.5)
(since PostgreSQL v9.6)
GpuJoin
t0,t1
Statistics)
nrows: 4000
width: 120
Index: t1.id
Competition of multiple algorithms, then chosen by the “cost”.
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -13
Interactions between PostgreSQL and PG-Strom with CustomScan
As long as consistent results are made, implementation is flexible.
CustomScan (GpuJoin)
(*BeginCustomScan)(...)
(*ExecCustomScan)(...)
(*EndCustomScan)(...)
:
SeqScan
on t0
SeqScan
on t1
GroupAgg
key: cat
ExecInitGpuJoin(...)
 Initialize GPU context
 Kick asynchronous JIT compilation of
the GPU program auto-generated
ExecGpuJoin(...)
 Read records from the t0 and t1, and
copy to the DMA buffer
 Kick asynchronous GPU tasks
 Fetch results from the completed GPU
tasks, then pass them to the next step
(GroupAgg)
ExecEndGpuJoin(...)
 Wait for completion of the
asynchronous tasks (if any)
 Release of GPU resource
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -14
SQL to GPU code auto-generation (example of WHERE-clause)
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -15
QUERY: SELECT cat, count(*), avg(x) FROM t0
WHERE x between y and y + 20.0 GROUP BY cat;
:
STATIC_FUNCTION(bool)
gpupreagg_qual_eval(kern_context *kcxt,
kern_data_store *kds,
size_t kds_index)
{
pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1);
pg_float8_t KVAR_3 = pg_float8_vref(kds,kcxt,2,kds_index);
pg_float8_t KVAR_4 = pg_float8_vref(kds,kcxt,3,kds_index);
return EVAL((pgfn_float8ge(kcxt, KVAR_3, KVAR_4) &&
pgfn_float8le(kcxt, KVAR_3,
pgfn_float8pl(kcxt, KVAR_4, KPARAM_1))));
} :
E.g) Transformation of the numeric-formula
in WHERE-clause to CUDA C code on demand
Reference to input data
SQL expression in CUDA source code
Run-time
compiler
Parallel
Execution
A simple micro-benchmark
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -16
▌Test Query:
SELECT cat, count(*), avg(x)
FROM t0 NATURAL JOIN t1 [NATURAL JOIN t2 ...]
GROUP BY cat;
✓ t0 contains 100M rows, t1...t8 contains 100K rows (like a star schema)
8.48
13.23
18.28
23.42
28.88
34.50
40.77
47.16
5.00 5.46 5.91 6.45 7.17 8.07
9.22 10.21
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
2 3 4 5 6 7 8 9
QueryResponseTime[sec]
Number of tables joined
PG-Strom microbenchmark with JOIN/GROUP BY
PostgreSQL v9.6 PG-Strom 2.0devel
CPU: Xeon E5-2650v4
GPU: Tesla P40
RAM: 128GB
OS: CentOS 7.3
DB: PostgreSQL 9.6.2 +
PG-Strom 2.0devel
Problems PG-Strom trying to solve
RAM
data
(small)
data
(large)
More computing
intensive workloads
More I/O intensive
workloads
How to move onto
this domain
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -17
~Re-definition of the GPU’s role~
SSD-to-GPU Direct SQL Execution
SSD-to-GPU Direct SQL Execution
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -19
Large PostgreSQL Tables
PCIe Bus
NVMe SSD GPUSSD-to-GPU P2P DMA
(NVMe-Strom driver)
WHERE-clause
JOIN
GROUP BYPostgreSQL
Data Blocks
Pre-processing of records
according to the SQL.
It reduces amount of data
to be loaded / processed by
CPU.
Existing data-flow
(Large I/O size)
Load PostgreSQL data blocks on SSD to GPU using peer-to-peer DMA, to execute
SQL workloads on GPU, for reduction of the data size prior to CPU/RAM loading.
Element technology① GPUDirect RDMA (1/2)
▌P2P data transfer technology between GPU and other PCIe devices, bypass CPU
 Originally designed for multi-nodes MPI over Infiniband
 Infrastructure of Linux kernel driver for other PCIe devices, including NVMe-SSDs.
Copyright (c) NVIDIA corporation, 2015
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -20
Element technology① GPUDirect RDMA (2/2)
Physical
address space
PCIe BAR1 Area
GPU
device
memory
RAM
NVMe-SSD Infiniband
HBA
PCIe device
GPUDirect RDMA
It enables to map GPU device
memory on physical address
space of the host system
Once “physical address of GPU device memory”
appears, we can use is as source or destination
address of DMA with PCIe devices.
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -21
0xf0000000
0xe0000000
DMA Request
SRC: 1200th sector
LEN: 40 sectors
DST: 0xe0200000
Element technology② NVMe-SSD
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -22
Intel SSD
750 Series
Intel SSD P3608
Samsung
PM1725
HGST
Ultrastar SN260
Seagate
Nytro XP7200
Interface PCIe 3.0 x4 PCIe 3.0 x8 PCIe 3.0 x8 PCIe 3.0 x8 PCIe 3.0 x16
Capacity 400GB, 800GB,
1.2TB
1.6TB, 3.2TB,
4.0TB
3.2TB, 6.4TB 1.6TB, 3.2TB,
6.4TB
3.8TB, 7.7TB
Form Factor HHHL HHHL HHHL HHHL FHHL
Max SeqRead 2.5GB/s 5.0GB/s 6.0GB/s 6.1GB/s 10.0GB/s
Max SeqWrite 1.2GB/s 3.0GB/s 2.0GB/s 2.2GB/s 3.6GB/s
Rand Read 460K IOPS 850K IOPS 1000K IOPS 1200K IOPS 940K IOPS
Rand Write 290K IOPS 50K IOPS 120K IOPS 200K IOPS 160K IOPS
Warranty 5years 5years 5years 5years 5 years
MTBF 1.2M hours 1.0M hours 2.0M hours 2.0M hours 2.0M hours
Target consumer enterprise / data-center
Major vendors release SSD production which supports NVMe specifications.
Benchmark configuration (1/2)
 Server: Supermicro 5018GR-T
 CPU: Intel Xeon 2650v4
 RAM: 128GB (16GB ECC DDR4-2166 x8)
 GPU: NVIDIA Tesla P40
 SSD: HGST SN260 (7.68TB)
 HDD: SATA 1TB + 3TBx2 (7200rpm)
 OS: CentOS 7.3
CUDA 8.0 + nvidia 375.51
NVMe-Strom driver
 DB: PostgreSQL 9.6.4
PG-Strom v2.0devel
 Data: Star Schema Benchmark
(SF=401, 351GB)
HGST
Ultrastar
SN260
NVIDIA
Tesla P40
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -23
Benchmark configuration (2/2) – Key devices
DB Tech Showcase 2017 - GPU/SSDがPostgreSQLを加速する-24
model HGST Ultrastar SN260
Interface PCIe 3.0 x8 (NVMe 1.2)
Capacity 7.68TB
Form Factor HHHL
Sequential Read 6,170MB/s
Sequential Write 2,200MB/s
Random Read 1200K IOPS
Random Write 75K IOPS
model NVIDIA Tesla P40
GPU Architecture NVIDIA Pascal
Interface PCIe 3.0 x16
Form Factor FHFL, Dual-Slot
# of CUDA cores 3840
Performance (FP32) 12 TFLOPS
GPU Memory 24GB (GDDR5)
GPU Memory Band 346GB/s
TDP 250W
SPECIAL THANKS FOR: SPECIAL THANKS FOR:
Benchmark results (1/2)
 We run the Q1-Q4 below towards lineorder table (351GB).
 Definition of throughput is (351 * 1024) / (Query response time[sec])
Q1... SELECT count(*) FROM lineorder;
Q2... SELECT count(*),sum(lo_revenue),sum(lo_supplycost) FROM lineorder;
Q3... SELECT count(*) FROM lineorder GROUP BY lo_orderpriority;
Q4... SELECT count(*),sum(lo_revenue),sum(lo_supplycost)
FROM lineorder GROUP BY lo_shipmode;
※ max_parallel_workers_per_gather is 24 for PostgreSQL v9.6, 4 for PG-Strom.
889.05 859.31 875.69 842.04
5988.0 5989.9 5988.8 5979.6
0
1000
2000
3000
4000
5000
6000
Q1 Q2 Q3 Q4
QueryExecutionThroughput[MB/s]
PostgreSQL v9.6 + SN260 PG-Strom v2.0 + SN260
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -25
Benchmark results (2/2)
 Full scan on the 351G table by the 128GB RAM system, thus its storage system obviously
limits the query execution performance.
 PG-Strom could pull out Seq-Read performance of the SSD close to the catalog spec.
 PostgreSQL has unignorable overhead around I/O.
0
1000
2000
3000
4000
5000
6000
7000
0 100 200 300 400
StorageReadThroughput[MB/s]
Elapsed Time for Query Execution [sec]
Time Series Results (Q4) with iostat
PG-Strom v2.0devel + SN260 PostgreSQL v9.6 + SN260
[kaigai@saba ~]$ iostat -cdm 1
:
Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
sda 6.00 0.19 0.00 0 0
sdb 0.00 0.00 0.00 0 0
sdc 0.00 0.00 0.00 0 0
nvme0n1 24059.00 5928.41 0.00 5928 0
:
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -26
Considerations to the hardware configuration
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -27
① SSD and GPU are
connected via PCIe switch.
OK
② A single CPU/IOH is involved
in the path of SSD/GPU.
Workable
③ SSD/GPU path traverses
QPI/HT link.
Not Supported
CPU CPU
PLX
SSD GPU
PCIe switch
HeteroDB has tested this configuration.
Up to 9.5GB/s was observed in raw-i/o,
by Xeon E5-2650v4
Very slow. Several dozen – hundreds MB/s
throughput was observed in our test.
It should not be used.
CPU CPU
SSD GPU
CPU CPU
SSD GPU
QPI
Pros:
Easy to obtain the
supported hardware
Cons:
Top performance is less
than PLX.
Pros:
Maximum performance
Cons:
Only limited hardware
supports this configuration.
Pros:
nothing
Cons:
Slow or not workable
Consideration to I/O performance of PostgreSQL
① Too frequent system call invocations because of small block size
② Multiple buffer copies on block-layer / filesystems
PostgreSQL
(Shared Buffer)
Filesystem
(Page Cache)
Block Device
(DMA Buffer)
NVMe-SSD
read(2)
DMA
request
DMA
completion
memcpy to
Page Cache
memcpy to
Shared Buffer
summary
processing
summary
processing
request to
block read
Dramatic
improvement of
device latency by
NVMe-SSD
buffer copy
System call
invocations
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -28
NVMe-Strom Software Stack
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -29
pg-strom
NVMe-Strom
driver
VFS
(ext4/xfs)
Page Cache
NVMe SSD
Driver
nvidia
driver
GPU
device
memory
PostgreSQL
cuMemAlloc()
/proc/nvme-strom
read(2)
User
Space
Kernel
Space
NVMe-Strom Software Stack
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -30
pg-strom
NVMe-Strom
driver
VFS
(ext4/xfs)
Page Cache
NVMe SSD
Driver
nvidia
driver
GPU
device
memory
GPU
device
memory
PostgreSQL
cuMemAlloc()
/proc/nvme-strom
ioctl(2)
read(2)
User
Space
Kernel
Space
NVMe-Strom Software Stack
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -31
pg-strom
NVMe-Strom
driver
VFS
(ext4/xfs)
Page Cache
NVMe SSD
Driver
nvidia
driver
GPU
device
memory
GPU
device
memory
PostgreSQL
DMA
request
File offset?
SSD-to-GPU Peer-to-Peer DMA
cuMemAlloc()
/proc/nvme-strom
ioctl(2)
read(2)
User
Space
Kernel
Space
Exists?
Block Number
Consistency with disk cache
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -32
 Challenges:
① If source blocks are kept in the disk cache, it may be dirty and storage blocks are not up-to-date.
② It is inefficient to issue RAM-to-GPU DMA with small unit size (8kB).
 Solution: Write-back cached blocks (8kB unit-size) to userspace buffer, then kicks bulk RAM-to-DMA
using CUDA APIs at once.
✓ Equivalent processing penalty with read(2) + cuMemcpyHtoDAsync()
✓ Re-order of blocks on GPU memory does not affect to adequacy of the operation.
BLK-100: uncached
BLK-101: cached
BLK-102: uncached
BLK-103: uncached
BLK-104: cached
BLK-105: cached
BLK-106: uncached
BLK-107: uncached
BLK-108: cached
BLK-109: uncached
BLCKSZ
(=8KB)
Transfer Size
Per Request
BLCKSZ *
NChunks
BLK-108: cached
BLK-105: cached
BLK-104: cached
BLK-101: cached
BLK-100: uncached
BLK-102: uncached
BLK-103: uncached
BLK-106: uncached
BLK-107: uncached
BLK-109: uncached
BLK-108: cached
BLK-105: cached
BLK-104: cached
BLK-101: cached
unused SSD-to-GPU
P2P DMA
File Userspace DMA
Buffer (RAM)
Device Memory
(GPU)
CUDA API
(userspace)
cuMemcpyHtoDAsync
Challenge to 10GB/s of query execution throughput
Why challenge to 10GB/s with single node?
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -34
NOT because it’s there.
Why challenge to 10GB/s with single node?
DWH appliance
in the prev-generation
Performance: 8.6GB/s-
Cost: tens of millions JPY-
Time of system replacement
after 5 years. How much cost
for successors?
Small Hadoop cluster
Operations: administration
of multi-node system is
usually troublesome .
What’s choice, if equivalent
performance would be capable
with single-node and SQL.
 Enhanced OLAP capability with GPU/SSD;
equivalent performance to the DWH appliance
in the previous generation.
 1/10 System cost
 Functionalities as DBMS are identical with
PostgreSQL which has widespread and long-
standing user base.HeteroServer 120GS
(Mar-2018, planned)
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -35
(Ref) GTC2017 chose “SSD-to-GPU direct SQL” as a top-5 finalist
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -36
Our R&D was chosen to the top-5 finalist
in the 138 posters from all over the world.
(Sadly, it was the second place by voting...)
GPU Technology Conference 2017
8th~ 11th May, San Jose
At May-2017, measured throughput
was 4.5GB/s with SSDx3 configuration.
Theoretically, 6.6GB/s. Why?
Investigation using visual profiler
Time consumption of sub-kernels
launched by dynamic parallelism
were too small (=less than 10%).
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -37
Investigation using visual profiler
Less concurrency of GPU kernels for SQL processing
by the synchronization of GPU sub-kernels.
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -38
Improvement by reduction of in-GPU synchronization points
GPU’s resource usage gets back to the reasonable level.
 Storage limits the performance again.
GPUs are enough capable to process 10GB data per
second.
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -39
Towards 10GB/s query processing throughput
▌Reduction of in-GPU kernel synchronization points
 GpuScan and GpuPreAgg were already reworked.
 Works for GpuJoin are in-progress.
▌Improvement of PG-Strom’s GPU task scheduler
 Elimination of dynamic parallelism and utilization of unified memory allows to
reduce unnecessary context switch by external GPU servers.
▌GpuScan + GpuJoin + GpuPreAgg Combined Kernel
 Little CPU interactions on typical analytics / reporting workloads.
 Execution of JOIN/GROUP BY with wired speed of table scan.
▌Optimization for md-raid0 (striping)
 To supply data stream with 10GB/s band, driver performance shall be
improved on multi NVMe-SSDs configurations.
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -40
GpuScan + GpuJoin + GpuPreAgg Combined Kernel (1/5)
Aggregation
GROUP BY
JOIN
SCAN
SELECT cat, count(*), avg(x)
FROM t0 JOIN t1 ON t0.id = t1.id
WHERE y like ‘%abc%’
GROUP BY cat;
count(*), avg(x)
GROUP BY cat
t0 JOIN t1
ON t0.id = t1.id
WHERE y like ‘%abc%’
Results
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -41
GpuScan
GpuJoin
Agg
+
GpuPreAgg
GpuScan + GpuJoin + GpuPreAgg Combined Kernel (2/5)
GpuScan
kernel
GpuJoin
kernel
GpuPreAgg
kernel
DMA
Buffer
Agg
(PostgreSQL)
GPU
CPU
Storage
Simple replacement of the existing logic leads
ping-pong of data-transfer between CPU and GPU.
DMA
Buffer
DMA
Buffer
Results
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -42
GpuScan + GpuJoin + GpuPreAgg Combined Kernel (3/5)
Evaluation of WHERE-clause is enough simple.
GpuJoin/GpuPreAgg already supports combined GPU kernel.
GpuScan
kernel
GpuJoin
kernel
GpuPreAgg
kernel
DMA
Buffer
Agg
(PostgreSQL)
GPU
CPU
Storage
DMA
Buffer
GpuScan + GpuJoin
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -43
Results
GpuScan + GpuJoin + GpuPreAgg Combined Kernel (4/5)
Extreme data reduction on GPU-side by GROUP BY/Aggregation
without any interaction to CPU.
GpuScan
kernel
GpuJoin
kernel
GpuPreAgg
kernel
DMA
Buffer
Agg
(PostgreSQL)
GPU
CPU
Storage
GpuScan + GpuJoin + GpuPreAgg Combined Kernel
data size
= large
data size
= small
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -44
Results
GpuScan + GpuJoin + GpuPreAgg Combined Kernel (5/5)
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -45
Only 1.2kB was written back to CPU-side,
after the 10GB of table scan.
Towards PG-Strom v2.0
Dec-2017, PG-Strom v2.0 beta
Mar-2018, PG-Strom v2.0 GA
HeteroServer 120GS
New features development
 SSD-to-GPU Direct SQL Execution, and related
 Reworks of GPU task scheduler
 On GPU data store (as foreign table)
 Columnar Cache mechanism
 In-database Analytics and Machine-Learning Library
QA & Release
 Test / Stabilization
 Packaging
 Documentation
Early adopter
development
Open Source Activity Business Activity
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -46
~Beyond the hardware limitation~
Idea of The Intelligent Storage
OLTP World OLAP World
Idea of the intelligent storage (1/2)
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -48
Row-to-Column translation on SSD enables Write-by-Row, Read-as-Column
RowCol
Translation
Logic
Column Data
Read of
Analytic Data
(column-format)
Write of
transactional
data
(row-format)
Data format
transformation
on SSD
SQL Execution
on top of column data;
that pull out max
performance of GPU
Pre-processed
records
(much smaller)Picks up only
referenced data
Why GPU prefers columnar storage format
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -49
▌Row-format (transactional data) – random memory access
 Increase of memory transaction, less usage ratio of memory-bus
▌Column-format (analytic data) – coalesced memory access
 Least number of memory transaction, maximum usage ratio of memory-bus
32bit
Memory transaction width: 256bit
32bit 32bit32bit 32bit 32bit
32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit
Memory transaction width: 256bit
32bit x 8 = 256bit of 256bit memory
transaction are valid
(100% of usage ratio)
32bit x 1 = 32bit of 256bit memory
transaction are valid
(12.5% of usage ratio)
GPU Cores
GPU Cores
Transactional data leads inefficient memory bus usage
Idea of the intelligent storage (2/2)
DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -50
Large PostgreSQL Tables
PCIe Bus
“SQL-aware”
NVMe SSD GPU
SSD-to-GPU P2P DMA
WHERE-clause
JOIN
GROUP BYPostgreSQL
Data Blocks
Row-to-column transformation
Extract only referenced attributes
enables to pull out “effective bandwidth”
more than the PCIe limitation.
“Ultra” parallel SQL execution by GPU
Columnar data format, the optimal one for
GPU, allows to pull out maximum
computing capability for SQL processing.
Write as row-data
Keeps the performance of
usual transactional workloads,
and simultaneously allows
real-time analytics / reporting.
THANK YOU!
contacts
e-mail: kaigai@heterodb.com
tw: @kkaigai

More Related Content

What's hot

GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)Kohei KaiGai
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStromKohei KaiGai
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)Kohei KaiGai
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrKohei KaiGai
 
20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_ProcessingKohei KaiGai
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - EnglishKohei KaiGai
 
20201128_OSC_Fukuoka_Online_GPUPostGIS
20201128_OSC_Fukuoka_Online_GPUPostGIS20201128_OSC_Fukuoka_Online_GPUPostGIS
20201128_OSC_Fukuoka_Online_GPUPostGISKohei KaiGai
 
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_IndexKohei KaiGai
 
20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multiKohei KaiGai
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwJan Holčapek
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdwKohei KaiGai
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU deviceKohei KaiGai
 
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...Equnix Business Solutions
 
20181210 - PGconf.ASIA Unconference
20181210 - PGconf.ASIA Unconference20181210 - PGconf.ASIA Unconference
20181210 - PGconf.ASIA UnconferenceKohei KaiGai
 
FDW-based Sharding Update and Future
FDW-based Sharding Update and FutureFDW-based Sharding Update and Future
FDW-based Sharding Update and FutureMasahiko Sawada
 
20210928_pgunconf_hll_count
20210928_pgunconf_hll_count20210928_pgunconf_hll_count
20210928_pgunconf_hll_countKohei KaiGai
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Ural-PDC
 
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020John Zedlewski
 

What's hot (20)

GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
 
20201128_OSC_Fukuoka_Online_GPUPostGIS
20201128_OSC_Fukuoka_Online_GPUPostGIS20201128_OSC_Fukuoka_Online_GPUPostGIS
20201128_OSC_Fukuoka_Online_GPUPostGIS
 
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
 
20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdw
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU device
 
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
 
20181210 - PGconf.ASIA Unconference
20181210 - PGconf.ASIA Unconference20181210 - PGconf.ASIA Unconference
20181210 - PGconf.ASIA Unconference
 
FDW-based Sharding Update and Future
FDW-based Sharding Update and FutureFDW-based Sharding Update and Future
FDW-based Sharding Update and Future
 
20210928_pgunconf_hll_count
20210928_pgunconf_hll_count20210928_pgunconf_hll_count
20210928_pgunconf_hll_count
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
 
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
 

Similar to GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughput 10GB/s -

20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQLKohei KaiGai
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGaiKohei KaiGai
 
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...Equnix Business Solutions
 
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_ENKohei KaiGai
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAlluxio, Inc.
 
SQream-GPU가속 초거대 정형데이타 분석용 SQL DB-제품소개-박문기@메가존클라우드
SQream-GPU가속 초거대 정형데이타 분석용 SQL DB-제품소개-박문기@메가존클라우드SQream-GPU가속 초거대 정형데이타 분석용 SQL DB-제품소개-박문기@메가존클라우드
SQream-GPU가속 초거대 정형데이타 분석용 SQL DB-제품소개-박문기@메가존클라우드문기 박
 
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red_Hat_Storage
 
PGConf APAC 2018 - PostgreSQL performance comparison in various clouds
PGConf APAC 2018 - PostgreSQL performance comparison in various cloudsPGConf APAC 2018 - PostgreSQL performance comparison in various clouds
PGConf APAC 2018 - PostgreSQL performance comparison in various cloudsPGConf APAC
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Kohei KaiGai
 
PostgreSQL 10; Long Awaited Enterprise Solutions
PostgreSQL 10; Long Awaited Enterprise SolutionsPostgreSQL 10; Long Awaited Enterprise Solutions
PostgreSQL 10; Long Awaited Enterprise SolutionsJulyanto SUTANDANG
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceData Works MD
 
QCon2016--Drive Best Spark Performance on AI
QCon2016--Drive Best Spark Performance on AIQCon2016--Drive Best Spark Performance on AI
QCon2016--Drive Best Spark Performance on AILex Yu
 
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsAltoros
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...E-Commerce Brasil
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfDLow6
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDatabricks
 
BlazingSQL + RAPIDS AI at GTC San Jose 2019
BlazingSQL + RAPIDS AI at GTC San Jose 2019BlazingSQL + RAPIDS AI at GTC San Jose 2019
BlazingSQL + RAPIDS AI at GTC San Jose 2019Rodrigo Aramburu
 
BlazingSQL & Graphistry - Netflow Demo
BlazingSQL & Graphistry - Netflow DemoBlazingSQL & Graphistry - Netflow Demo
BlazingSQL & Graphistry - Netflow DemoRodrigo Aramburu
 

Similar to GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughput 10GB/s - (20)

20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai
 
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
 
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
SQream-GPU가속 초거대 정형데이타 분석용 SQL DB-제품소개-박문기@메가존클라우드
SQream-GPU가속 초거대 정형데이타 분석용 SQL DB-제품소개-박문기@메가존클라우드SQream-GPU가속 초거대 정형데이타 분석용 SQL DB-제품소개-박문기@메가존클라우드
SQream-GPU가속 초거대 정형데이타 분석용 SQL DB-제품소개-박문기@메가존클라우드
 
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
 
PGConf APAC 2018 - PostgreSQL performance comparison in various clouds
PGConf APAC 2018 - PostgreSQL performance comparison in various cloudsPGConf APAC 2018 - PostgreSQL performance comparison in various clouds
PGConf APAC 2018 - PostgreSQL performance comparison in various clouds
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
 
PostgreSQL 10; Long Awaited Enterprise Solutions
PostgreSQL 10; Long Awaited Enterprise SolutionsPostgreSQL 10; Long Awaited Enterprise Solutions
PostgreSQL 10; Long Awaited Enterprise Solutions
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
QCon2016--Drive Best Spark Performance on AI
QCon2016--Drive Best Spark Performance on AIQCon2016--Drive Best Spark Performance on AI
QCon2016--Drive Best Spark Performance on AI
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
 
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
BlazingSQL + RAPIDS AI at GTC San Jose 2019
BlazingSQL + RAPIDS AI at GTC San Jose 2019BlazingSQL + RAPIDS AI at GTC San Jose 2019
BlazingSQL + RAPIDS AI at GTC San Jose 2019
 
BlazingSQL & Graphistry - Netflow Demo
BlazingSQL & Graphistry - Netflow DemoBlazingSQL & Graphistry - Netflow Demo
BlazingSQL & Graphistry - Netflow Demo
 

More from Kohei KaiGai

20221116_DBTS_PGStrom_History
20221116_DBTS_PGStrom_History20221116_DBTS_PGStrom_History
20221116_DBTS_PGStrom_HistoryKohei KaiGai
 
20221111_JPUG_CustomScan_API
20221111_JPUG_CustomScan_API20221111_JPUG_CustomScan_API
20221111_JPUG_CustomScan_APIKohei KaiGai
 
20211112_jpugcon_gpu_and_arrow
20211112_jpugcon_gpu_and_arrow20211112_jpugcon_gpu_and_arrow
20211112_jpugcon_gpu_and_arrowKohei KaiGai
 
20210731_OSC_Kyoto_PGStrom3.0
20210731_OSC_Kyoto_PGStrom3.020210731_OSC_Kyoto_PGStrom3.0
20210731_OSC_Kyoto_PGStrom3.0Kohei KaiGai
 
20210511_PGStrom_GpuCache
20210511_PGStrom_GpuCache20210511_PGStrom_GpuCache
20210511_PGStrom_GpuCacheKohei KaiGai
 
20201113_PGconf_Japan_GPU_PostGIS
20201113_PGconf_Japan_GPU_PostGIS20201113_PGconf_Japan_GPU_PostGIS
20201113_PGconf_Japan_GPU_PostGISKohei KaiGai
 
20200828_OSCKyoto_Online
20200828_OSCKyoto_Online20200828_OSCKyoto_Online
20200828_OSCKyoto_OnlineKohei KaiGai
 
20200806_PGStrom_PostGIS_GstoreFdw
20200806_PGStrom_PostGIS_GstoreFdw20200806_PGStrom_PostGIS_GstoreFdw
20200806_PGStrom_PostGIS_GstoreFdwKohei KaiGai
 
20200424_Writable_Arrow_Fdw
20200424_Writable_Arrow_Fdw20200424_Writable_Arrow_Fdw
20200424_Writable_Arrow_FdwKohei KaiGai
 
20191211_Apache_Arrow_Meetup_Tokyo
20191211_Apache_Arrow_Meetup_Tokyo20191211_Apache_Arrow_Meetup_Tokyo
20191211_Apache_Arrow_Meetup_TokyoKohei KaiGai
 
20191115-PGconf.Japan
20191115-PGconf.Japan20191115-PGconf.Japan
20191115-PGconf.JapanKohei KaiGai
 
20190926_Try_RHEL8_NVMEoF_Beta
20190926_Try_RHEL8_NVMEoF_Beta20190926_Try_RHEL8_NVMEoF_Beta
20190926_Try_RHEL8_NVMEoF_BetaKohei KaiGai
 
20190925_DBTS_PGStrom
20190925_DBTS_PGStrom20190925_DBTS_PGStrom
20190925_DBTS_PGStromKohei KaiGai
 
20190516_DLC10_PGStrom
20190516_DLC10_PGStrom20190516_DLC10_PGStrom
20190516_DLC10_PGStromKohei KaiGai
 
20190418_PGStrom_on_ArrowFdw
20190418_PGStrom_on_ArrowFdw20190418_PGStrom_on_ArrowFdw
20190418_PGStrom_on_ArrowFdwKohei KaiGai
 
20190314 PGStrom Arrow_Fdw
20190314 PGStrom Arrow_Fdw20190314 PGStrom Arrow_Fdw
20190314 PGStrom Arrow_FdwKohei KaiGai
 
20181212 - PGconf.ASIA - LT
20181212 - PGconf.ASIA - LT20181212 - PGconf.ASIA - LT
20181212 - PGconf.ASIA - LTKohei KaiGai
 
20181211 - PGconf.ASIA - NVMESSD&GPU for BigData
20181211 - PGconf.ASIA - NVMESSD&GPU for BigData20181211 - PGconf.ASIA - NVMESSD&GPU for BigData
20181211 - PGconf.ASIA - NVMESSD&GPU for BigDataKohei KaiGai
 
20180920_DBTS_PGStrom_JP
20180920_DBTS_PGStrom_JP20180920_DBTS_PGStrom_JP
20180920_DBTS_PGStrom_JPKohei KaiGai
 

More from Kohei KaiGai (19)

20221116_DBTS_PGStrom_History
20221116_DBTS_PGStrom_History20221116_DBTS_PGStrom_History
20221116_DBTS_PGStrom_History
 
20221111_JPUG_CustomScan_API
20221111_JPUG_CustomScan_API20221111_JPUG_CustomScan_API
20221111_JPUG_CustomScan_API
 
20211112_jpugcon_gpu_and_arrow
20211112_jpugcon_gpu_and_arrow20211112_jpugcon_gpu_and_arrow
20211112_jpugcon_gpu_and_arrow
 
20210731_OSC_Kyoto_PGStrom3.0
20210731_OSC_Kyoto_PGStrom3.020210731_OSC_Kyoto_PGStrom3.0
20210731_OSC_Kyoto_PGStrom3.0
 
20210511_PGStrom_GpuCache
20210511_PGStrom_GpuCache20210511_PGStrom_GpuCache
20210511_PGStrom_GpuCache
 
20201113_PGconf_Japan_GPU_PostGIS
20201113_PGconf_Japan_GPU_PostGIS20201113_PGconf_Japan_GPU_PostGIS
20201113_PGconf_Japan_GPU_PostGIS
 
20200828_OSCKyoto_Online
20200828_OSCKyoto_Online20200828_OSCKyoto_Online
20200828_OSCKyoto_Online
 
20200806_PGStrom_PostGIS_GstoreFdw
20200806_PGStrom_PostGIS_GstoreFdw20200806_PGStrom_PostGIS_GstoreFdw
20200806_PGStrom_PostGIS_GstoreFdw
 
20200424_Writable_Arrow_Fdw
20200424_Writable_Arrow_Fdw20200424_Writable_Arrow_Fdw
20200424_Writable_Arrow_Fdw
 
20191211_Apache_Arrow_Meetup_Tokyo
20191211_Apache_Arrow_Meetup_Tokyo20191211_Apache_Arrow_Meetup_Tokyo
20191211_Apache_Arrow_Meetup_Tokyo
 
20191115-PGconf.Japan
20191115-PGconf.Japan20191115-PGconf.Japan
20191115-PGconf.Japan
 
20190926_Try_RHEL8_NVMEoF_Beta
20190926_Try_RHEL8_NVMEoF_Beta20190926_Try_RHEL8_NVMEoF_Beta
20190926_Try_RHEL8_NVMEoF_Beta
 
20190925_DBTS_PGStrom
20190925_DBTS_PGStrom20190925_DBTS_PGStrom
20190925_DBTS_PGStrom
 
20190516_DLC10_PGStrom
20190516_DLC10_PGStrom20190516_DLC10_PGStrom
20190516_DLC10_PGStrom
 
20190418_PGStrom_on_ArrowFdw
20190418_PGStrom_on_ArrowFdw20190418_PGStrom_on_ArrowFdw
20190418_PGStrom_on_ArrowFdw
 
20190314 PGStrom Arrow_Fdw
20190314 PGStrom Arrow_Fdw20190314 PGStrom Arrow_Fdw
20190314 PGStrom Arrow_Fdw
 
20181212 - PGconf.ASIA - LT
20181212 - PGconf.ASIA - LT20181212 - PGconf.ASIA - LT
20181212 - PGconf.ASIA - LT
 
20181211 - PGconf.ASIA - NVMESSD&GPU for BigData
20181211 - PGconf.ASIA - NVMESSD&GPU for BigData20181211 - PGconf.ASIA - NVMESSD&GPU for BigData
20181211 - PGconf.ASIA - NVMESSD&GPU for BigData
 
20180920_DBTS_PGStrom_JP
20180920_DBTS_PGStrom_JP20180920_DBTS_PGStrom_JP
20180920_DBTS_PGStrom_JP
 

Recently uploaded

Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxPurva Nikam
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction managementMariconPadriquez1
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 

Recently uploaded (20)

Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptx
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction management
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 

GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughput 10GB/s -

  • 1. GPU&SSD Accelerates PostgreSQL - Challenge towards query processing throughput 10GB/s - HeteroDB,Inc Chief Architect & President KaiGai Kohei <kaigai@heterodb.com>
  • 2. about HeteroDB? ▌Corporate Profile  Name: HeteroDB,Inc  Location: Shinagawa, Tokyo, Japan  Establish: 4th-Jul-2017  Businesses: Dev & Sales of GPU/SSD accelerated SQL database products Consulting service for SQL tuning with GPU  Stock share: 100% by the founders ▌Founder’s Profile KaiGai Kohei (Chief Architect & co-founder; President) He has more than 10 years experiment for development of Linux kernel and PostgreSQL database in the developer’s community; contributed to various major features like SELinux support at PostgreSQL, update/join support of FDW, custom-scan interface and so on. In 2012, he launched PG-Strom project which transparently accelerates SQL workloads using GPU, and then he has focused on analytic SQL acceleration with heterogeneous computing architecture. The award of genius programmer by METI/IPA (2006); The award of OSS contributors by OSSPF (2014). Kashiwagi Takehiko (Chief Sales Engineer & co-founder; Vice-president) He has been a researcher at the NEC laboratory for networking technology, in-memory columnar storage, and GPGPU acceleration of database workloads. He had also co-worked with KaiGai as PG-Strom project member when both of us were employee of NEC. Right now, he is responsible to market / customer development in HeteroDB, in addition to his own company for networking solutions (Parallel-Networks,Inc). DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -2
  • 3. Our core technology – PG-Strom DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -3 ▌Features  Utilization of GPU’s computing capability to off-load SQL workloads. Parallel execution by thousands cores.  Auto-generation of GPU programs from the supplied SQL, and just-in- time compilation. Enables transparent acceleration.  Open Source Software ▌Functionalities  GPU parallel execution for WHERE- clause, JOIN, GROUP BY  User defined GPU function (PL/CUDA) ▌Target usage  Batch, summary / reporting  In-database analytics / machine- learning Query Optimizer Query Executor PG-Strom Extension Storage Manager SQL Parser Application Storage PG-Strom: Extension of PostgreSQL for acceleration by GPU same SQL query same data schema
  • 4. GPU Characteristics Ultra parallel computing processor with thousands cores and hundreds GB/s class memory band in a chip CPU Like a passenger car; well utilizable but less transportation capacity. GPU Like a high-speed railway; a little bit troublesome to get in or out, but capable for mass-transportation. Model Intel Xeon E5-2699v4 NVIDIA Tesla P40 # of transistors 7.2billion 12billion # of cores 22 (functional) 3840 (simple) Performance (FP32) 1.2 TFLOPS (with AVX2) 12TFLOPS Memory size max 1.5TB (DDR4) 24GB (GDDR5) Memory bandwidth 76.8GB/s 347GB/s TDP 145W 250W DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -4
  • 5. Example of GPU calculation – Reduction algorithm ●item[0] step.1 step.2 step.4step.3 Calculation of the total sum of an array by GPU Σi=0...N-1item[i] ◆ ● ▲ ■ ★ ● ◆ ● ● ◆ ▲ ● ● ◆ ● ● ◆ ▲ ■ ● ● ◆ ● ● ◆ ▲ ● ● ◆ ● item[1] item[2] item[3] item[4] item[5] item[6] item[7] item[8] item[9] item[10] item[11] item[12] item[13] item[14] item[15] Total sum of items[] with log2N steps Inter-cores synchronization with hardware support SELECT count(X), sum(Y), avg(Z) FROM my_table; Same logic is internally used to implement aggregate function. DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -5
  • 6. Is SQL workload similar to image processing? Image Data = int/float[] array Transposition ID X Y Z SELECT * FROM my_table WHERE X BETWEEN 40 AND 60 Parallel Execution GPU’s advantage: Same operations to massive amount of data. DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -6
  • 7. Problems PG-Strom trying to solve RAM Data (small)  Primitive of PG-Strom ✓ Heavy SQLs (JOIN, GROUP BY) ✓ Data size less than physical RAM 2015 DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -7
  • 8. Problems PG-Strom trying to solve RAM Data (small) Data (large) More computing intensive workloads More I/O intensive workloads  PL/CUDA ✓ In-database analytics and machine- learning ✓ Drug discovery, anomaly detection  SSD-to-GPU Direct SQL Execution ✓ Fast I/O nearly wired speed of H/W ✓ Large batch, reporting, and analytics 2017 DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -8  Primitive of PG-Strom ✓ Heavy SQLs (JOIN, GROUP BY) ✓ Data size less than physical RAM
  • 9. It’s second time for me to have a talk at DB tech showcase! DBTS2014: GPGPU Accelerates PostgreSQL DBTS2017: GPU & SSD Accelerates PostgreSQL DB tech showcase 2014 Utilization of new devices according to expansion of the application domain DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -9
  • 11. PostgreSQL Internals SQL Parser Query Optimizer Query Executor Storage Manager Transaction Manager IPC (Lock, Latch, shmem) SQL Query Parse-tree Execution-Plan Query Results  Parse the supplied SQL, then transform to an internal data structure (parse-tree)  Detection of syntax errors  Cost estimation based on the statistics  Construction of the most optimal query execution-plan  Run the workloads (like Scan, Join, Sort, ...) according to the query execution plan.  Utilization of PostgreSQL internal infrastructures; like storage manager DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -11
  • 12. How PostgreSQL constructs query execution plan (1/2) Scan t0 Scan t1 Scan t2 Join t0,t1 Join (t0,t1),t2 GROUP BY cat ORDER BY score LIMIT 100 DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -12
  • 13. How PostgreSQL constructs query execution plan (2/2) Scan t0 Scan t1 Join t0,t1 Statistics) nrows: 1.2M width: 80 Index: none candidate HashJoin cost=4000 candidate MergeJoin cost=12000 candidate NestLoop cost=99999 candidate Parallel Hash Join cost=3000 candidate GpuJoin cost=2500 WINNER! Built-in execution path of PostgreSQLProposition by extensions (since PostgreSQL v9.5) (since PostgreSQL v9.6) GpuJoin t0,t1 Statistics) nrows: 4000 width: 120 Index: t1.id Competition of multiple algorithms, then chosen by the “cost”. DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -13
  • 14. Interactions between PostgreSQL and PG-Strom with CustomScan As long as consistent results are made, implementation is flexible. CustomScan (GpuJoin) (*BeginCustomScan)(...) (*ExecCustomScan)(...) (*EndCustomScan)(...) : SeqScan on t0 SeqScan on t1 GroupAgg key: cat ExecInitGpuJoin(...)  Initialize GPU context  Kick asynchronous JIT compilation of the GPU program auto-generated ExecGpuJoin(...)  Read records from the t0 and t1, and copy to the DMA buffer  Kick asynchronous GPU tasks  Fetch results from the completed GPU tasks, then pass them to the next step (GroupAgg) ExecEndGpuJoin(...)  Wait for completion of the asynchronous tasks (if any)  Release of GPU resource DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -14
  • 15. SQL to GPU code auto-generation (example of WHERE-clause) DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -15 QUERY: SELECT cat, count(*), avg(x) FROM t0 WHERE x between y and y + 20.0 GROUP BY cat; : STATIC_FUNCTION(bool) gpupreagg_qual_eval(kern_context *kcxt, kern_data_store *kds, size_t kds_index) { pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1); pg_float8_t KVAR_3 = pg_float8_vref(kds,kcxt,2,kds_index); pg_float8_t KVAR_4 = pg_float8_vref(kds,kcxt,3,kds_index); return EVAL((pgfn_float8ge(kcxt, KVAR_3, KVAR_4) && pgfn_float8le(kcxt, KVAR_3, pgfn_float8pl(kcxt, KVAR_4, KPARAM_1)))); } : E.g) Transformation of the numeric-formula in WHERE-clause to CUDA C code on demand Reference to input data SQL expression in CUDA source code Run-time compiler Parallel Execution
  • 16. A simple micro-benchmark DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -16 ▌Test Query: SELECT cat, count(*), avg(x) FROM t0 NATURAL JOIN t1 [NATURAL JOIN t2 ...] GROUP BY cat; ✓ t0 contains 100M rows, t1...t8 contains 100K rows (like a star schema) 8.48 13.23 18.28 23.42 28.88 34.50 40.77 47.16 5.00 5.46 5.91 6.45 7.17 8.07 9.22 10.21 0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 2 3 4 5 6 7 8 9 QueryResponseTime[sec] Number of tables joined PG-Strom microbenchmark with JOIN/GROUP BY PostgreSQL v9.6 PG-Strom 2.0devel CPU: Xeon E5-2650v4 GPU: Tesla P40 RAM: 128GB OS: CentOS 7.3 DB: PostgreSQL 9.6.2 + PG-Strom 2.0devel
  • 17. Problems PG-Strom trying to solve RAM data (small) data (large) More computing intensive workloads More I/O intensive workloads How to move onto this domain DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -17
  • 18. ~Re-definition of the GPU’s role~ SSD-to-GPU Direct SQL Execution
  • 19. SSD-to-GPU Direct SQL Execution DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -19 Large PostgreSQL Tables PCIe Bus NVMe SSD GPUSSD-to-GPU P2P DMA (NVMe-Strom driver) WHERE-clause JOIN GROUP BYPostgreSQL Data Blocks Pre-processing of records according to the SQL. It reduces amount of data to be loaded / processed by CPU. Existing data-flow (Large I/O size) Load PostgreSQL data blocks on SSD to GPU using peer-to-peer DMA, to execute SQL workloads on GPU, for reduction of the data size prior to CPU/RAM loading.
  • 20. Element technology① GPUDirect RDMA (1/2) ▌P2P data transfer technology between GPU and other PCIe devices, bypass CPU  Originally designed for multi-nodes MPI over Infiniband  Infrastructure of Linux kernel driver for other PCIe devices, including NVMe-SSDs. Copyright (c) NVIDIA corporation, 2015 DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -20
  • 21. Element technology① GPUDirect RDMA (2/2) Physical address space PCIe BAR1 Area GPU device memory RAM NVMe-SSD Infiniband HBA PCIe device GPUDirect RDMA It enables to map GPU device memory on physical address space of the host system Once “physical address of GPU device memory” appears, we can use is as source or destination address of DMA with PCIe devices. DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -21 0xf0000000 0xe0000000 DMA Request SRC: 1200th sector LEN: 40 sectors DST: 0xe0200000
  • 22. Element technology② NVMe-SSD DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -22 Intel SSD 750 Series Intel SSD P3608 Samsung PM1725 HGST Ultrastar SN260 Seagate Nytro XP7200 Interface PCIe 3.0 x4 PCIe 3.0 x8 PCIe 3.0 x8 PCIe 3.0 x8 PCIe 3.0 x16 Capacity 400GB, 800GB, 1.2TB 1.6TB, 3.2TB, 4.0TB 3.2TB, 6.4TB 1.6TB, 3.2TB, 6.4TB 3.8TB, 7.7TB Form Factor HHHL HHHL HHHL HHHL FHHL Max SeqRead 2.5GB/s 5.0GB/s 6.0GB/s 6.1GB/s 10.0GB/s Max SeqWrite 1.2GB/s 3.0GB/s 2.0GB/s 2.2GB/s 3.6GB/s Rand Read 460K IOPS 850K IOPS 1000K IOPS 1200K IOPS 940K IOPS Rand Write 290K IOPS 50K IOPS 120K IOPS 200K IOPS 160K IOPS Warranty 5years 5years 5years 5years 5 years MTBF 1.2M hours 1.0M hours 2.0M hours 2.0M hours 2.0M hours Target consumer enterprise / data-center Major vendors release SSD production which supports NVMe specifications.
  • 23. Benchmark configuration (1/2)  Server: Supermicro 5018GR-T  CPU: Intel Xeon 2650v4  RAM: 128GB (16GB ECC DDR4-2166 x8)  GPU: NVIDIA Tesla P40  SSD: HGST SN260 (7.68TB)  HDD: SATA 1TB + 3TBx2 (7200rpm)  OS: CentOS 7.3 CUDA 8.0 + nvidia 375.51 NVMe-Strom driver  DB: PostgreSQL 9.6.4 PG-Strom v2.0devel  Data: Star Schema Benchmark (SF=401, 351GB) HGST Ultrastar SN260 NVIDIA Tesla P40 DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -23
  • 24. Benchmark configuration (2/2) – Key devices DB Tech Showcase 2017 - GPU/SSDがPostgreSQLを加速する-24 model HGST Ultrastar SN260 Interface PCIe 3.0 x8 (NVMe 1.2) Capacity 7.68TB Form Factor HHHL Sequential Read 6,170MB/s Sequential Write 2,200MB/s Random Read 1200K IOPS Random Write 75K IOPS model NVIDIA Tesla P40 GPU Architecture NVIDIA Pascal Interface PCIe 3.0 x16 Form Factor FHFL, Dual-Slot # of CUDA cores 3840 Performance (FP32) 12 TFLOPS GPU Memory 24GB (GDDR5) GPU Memory Band 346GB/s TDP 250W SPECIAL THANKS FOR: SPECIAL THANKS FOR:
  • 25. Benchmark results (1/2)  We run the Q1-Q4 below towards lineorder table (351GB).  Definition of throughput is (351 * 1024) / (Query response time[sec]) Q1... SELECT count(*) FROM lineorder; Q2... SELECT count(*),sum(lo_revenue),sum(lo_supplycost) FROM lineorder; Q3... SELECT count(*) FROM lineorder GROUP BY lo_orderpriority; Q4... SELECT count(*),sum(lo_revenue),sum(lo_supplycost) FROM lineorder GROUP BY lo_shipmode; ※ max_parallel_workers_per_gather is 24 for PostgreSQL v9.6, 4 for PG-Strom. 889.05 859.31 875.69 842.04 5988.0 5989.9 5988.8 5979.6 0 1000 2000 3000 4000 5000 6000 Q1 Q2 Q3 Q4 QueryExecutionThroughput[MB/s] PostgreSQL v9.6 + SN260 PG-Strom v2.0 + SN260 DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -25
  • 26. Benchmark results (2/2)  Full scan on the 351G table by the 128GB RAM system, thus its storage system obviously limits the query execution performance.  PG-Strom could pull out Seq-Read performance of the SSD close to the catalog spec.  PostgreSQL has unignorable overhead around I/O. 0 1000 2000 3000 4000 5000 6000 7000 0 100 200 300 400 StorageReadThroughput[MB/s] Elapsed Time for Query Execution [sec] Time Series Results (Q4) with iostat PG-Strom v2.0devel + SN260 PostgreSQL v9.6 + SN260 [kaigai@saba ~]$ iostat -cdm 1 : Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn sda 6.00 0.19 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdc 0.00 0.00 0.00 0 0 nvme0n1 24059.00 5928.41 0.00 5928 0 : DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -26
  • 27. Considerations to the hardware configuration DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -27 ① SSD and GPU are connected via PCIe switch. OK ② A single CPU/IOH is involved in the path of SSD/GPU. Workable ③ SSD/GPU path traverses QPI/HT link. Not Supported CPU CPU PLX SSD GPU PCIe switch HeteroDB has tested this configuration. Up to 9.5GB/s was observed in raw-i/o, by Xeon E5-2650v4 Very slow. Several dozen – hundreds MB/s throughput was observed in our test. It should not be used. CPU CPU SSD GPU CPU CPU SSD GPU QPI Pros: Easy to obtain the supported hardware Cons: Top performance is less than PLX. Pros: Maximum performance Cons: Only limited hardware supports this configuration. Pros: nothing Cons: Slow or not workable
  • 28. Consideration to I/O performance of PostgreSQL ① Too frequent system call invocations because of small block size ② Multiple buffer copies on block-layer / filesystems PostgreSQL (Shared Buffer) Filesystem (Page Cache) Block Device (DMA Buffer) NVMe-SSD read(2) DMA request DMA completion memcpy to Page Cache memcpy to Shared Buffer summary processing summary processing request to block read Dramatic improvement of device latency by NVMe-SSD buffer copy System call invocations DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -28
  • 29. NVMe-Strom Software Stack DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -29 pg-strom NVMe-Strom driver VFS (ext4/xfs) Page Cache NVMe SSD Driver nvidia driver GPU device memory PostgreSQL cuMemAlloc() /proc/nvme-strom read(2) User Space Kernel Space
  • 30. NVMe-Strom Software Stack DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -30 pg-strom NVMe-Strom driver VFS (ext4/xfs) Page Cache NVMe SSD Driver nvidia driver GPU device memory GPU device memory PostgreSQL cuMemAlloc() /proc/nvme-strom ioctl(2) read(2) User Space Kernel Space
  • 31. NVMe-Strom Software Stack DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -31 pg-strom NVMe-Strom driver VFS (ext4/xfs) Page Cache NVMe SSD Driver nvidia driver GPU device memory GPU device memory PostgreSQL DMA request File offset? SSD-to-GPU Peer-to-Peer DMA cuMemAlloc() /proc/nvme-strom ioctl(2) read(2) User Space Kernel Space Exists? Block Number
  • 32. Consistency with disk cache DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -32  Challenges: ① If source blocks are kept in the disk cache, it may be dirty and storage blocks are not up-to-date. ② It is inefficient to issue RAM-to-GPU DMA with small unit size (8kB).  Solution: Write-back cached blocks (8kB unit-size) to userspace buffer, then kicks bulk RAM-to-DMA using CUDA APIs at once. ✓ Equivalent processing penalty with read(2) + cuMemcpyHtoDAsync() ✓ Re-order of blocks on GPU memory does not affect to adequacy of the operation. BLK-100: uncached BLK-101: cached BLK-102: uncached BLK-103: uncached BLK-104: cached BLK-105: cached BLK-106: uncached BLK-107: uncached BLK-108: cached BLK-109: uncached BLCKSZ (=8KB) Transfer Size Per Request BLCKSZ * NChunks BLK-108: cached BLK-105: cached BLK-104: cached BLK-101: cached BLK-100: uncached BLK-102: uncached BLK-103: uncached BLK-106: uncached BLK-107: uncached BLK-109: uncached BLK-108: cached BLK-105: cached BLK-104: cached BLK-101: cached unused SSD-to-GPU P2P DMA File Userspace DMA Buffer (RAM) Device Memory (GPU) CUDA API (userspace) cuMemcpyHtoDAsync
  • 33. Challenge to 10GB/s of query execution throughput
  • 34. Why challenge to 10GB/s with single node? DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -34 NOT because it’s there.
  • 35. Why challenge to 10GB/s with single node? DWH appliance in the prev-generation Performance: 8.6GB/s- Cost: tens of millions JPY- Time of system replacement after 5 years. How much cost for successors? Small Hadoop cluster Operations: administration of multi-node system is usually troublesome . What’s choice, if equivalent performance would be capable with single-node and SQL.  Enhanced OLAP capability with GPU/SSD; equivalent performance to the DWH appliance in the previous generation.  1/10 System cost  Functionalities as DBMS are identical with PostgreSQL which has widespread and long- standing user base.HeteroServer 120GS (Mar-2018, planned) DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -35
  • 36. (Ref) GTC2017 chose “SSD-to-GPU direct SQL” as a top-5 finalist DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -36 Our R&D was chosen to the top-5 finalist in the 138 posters from all over the world. (Sadly, it was the second place by voting...) GPU Technology Conference 2017 8th~ 11th May, San Jose At May-2017, measured throughput was 4.5GB/s with SSDx3 configuration. Theoretically, 6.6GB/s. Why?
  • 37. Investigation using visual profiler Time consumption of sub-kernels launched by dynamic parallelism were too small (=less than 10%). DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -37
  • 38. Investigation using visual profiler Less concurrency of GPU kernels for SQL processing by the synchronization of GPU sub-kernels. DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -38
  • 39. Improvement by reduction of in-GPU synchronization points GPU’s resource usage gets back to the reasonable level.  Storage limits the performance again. GPUs are enough capable to process 10GB data per second. DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -39
  • 40. Towards 10GB/s query processing throughput ▌Reduction of in-GPU kernel synchronization points  GpuScan and GpuPreAgg were already reworked.  Works for GpuJoin are in-progress. ▌Improvement of PG-Strom’s GPU task scheduler  Elimination of dynamic parallelism and utilization of unified memory allows to reduce unnecessary context switch by external GPU servers. ▌GpuScan + GpuJoin + GpuPreAgg Combined Kernel  Little CPU interactions on typical analytics / reporting workloads.  Execution of JOIN/GROUP BY with wired speed of table scan. ▌Optimization for md-raid0 (striping)  To supply data stream with 10GB/s band, driver performance shall be improved on multi NVMe-SSDs configurations. DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -40
  • 41. GpuScan + GpuJoin + GpuPreAgg Combined Kernel (1/5) Aggregation GROUP BY JOIN SCAN SELECT cat, count(*), avg(x) FROM t0 JOIN t1 ON t0.id = t1.id WHERE y like ‘%abc%’ GROUP BY cat; count(*), avg(x) GROUP BY cat t0 JOIN t1 ON t0.id = t1.id WHERE y like ‘%abc%’ Results DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -41 GpuScan GpuJoin Agg + GpuPreAgg
  • 42. GpuScan + GpuJoin + GpuPreAgg Combined Kernel (2/5) GpuScan kernel GpuJoin kernel GpuPreAgg kernel DMA Buffer Agg (PostgreSQL) GPU CPU Storage Simple replacement of the existing logic leads ping-pong of data-transfer between CPU and GPU. DMA Buffer DMA Buffer Results DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -42
  • 43. GpuScan + GpuJoin + GpuPreAgg Combined Kernel (3/5) Evaluation of WHERE-clause is enough simple. GpuJoin/GpuPreAgg already supports combined GPU kernel. GpuScan kernel GpuJoin kernel GpuPreAgg kernel DMA Buffer Agg (PostgreSQL) GPU CPU Storage DMA Buffer GpuScan + GpuJoin DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -43 Results
  • 44. GpuScan + GpuJoin + GpuPreAgg Combined Kernel (4/5) Extreme data reduction on GPU-side by GROUP BY/Aggregation without any interaction to CPU. GpuScan kernel GpuJoin kernel GpuPreAgg kernel DMA Buffer Agg (PostgreSQL) GPU CPU Storage GpuScan + GpuJoin + GpuPreAgg Combined Kernel data size = large data size = small DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -44 Results
  • 45. GpuScan + GpuJoin + GpuPreAgg Combined Kernel (5/5) DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -45 Only 1.2kB was written back to CPU-side, after the 10GB of table scan.
  • 46. Towards PG-Strom v2.0 Dec-2017, PG-Strom v2.0 beta Mar-2018, PG-Strom v2.0 GA HeteroServer 120GS New features development  SSD-to-GPU Direct SQL Execution, and related  Reworks of GPU task scheduler  On GPU data store (as foreign table)  Columnar Cache mechanism  In-database Analytics and Machine-Learning Library QA & Release  Test / Stabilization  Packaging  Documentation Early adopter development Open Source Activity Business Activity DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -46
  • 47. ~Beyond the hardware limitation~ Idea of The Intelligent Storage
  • 48. OLTP World OLAP World Idea of the intelligent storage (1/2) DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -48 Row-to-Column translation on SSD enables Write-by-Row, Read-as-Column RowCol Translation Logic Column Data Read of Analytic Data (column-format) Write of transactional data (row-format) Data format transformation on SSD SQL Execution on top of column data; that pull out max performance of GPU Pre-processed records (much smaller)Picks up only referenced data
  • 49. Why GPU prefers columnar storage format DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -49 ▌Row-format (transactional data) – random memory access  Increase of memory transaction, less usage ratio of memory-bus ▌Column-format (analytic data) – coalesced memory access  Least number of memory transaction, maximum usage ratio of memory-bus 32bit Memory transaction width: 256bit 32bit 32bit32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit Memory transaction width: 256bit 32bit x 8 = 256bit of 256bit memory transaction are valid (100% of usage ratio) 32bit x 1 = 32bit of 256bit memory transaction are valid (12.5% of usage ratio) GPU Cores GPU Cores Transactional data leads inefficient memory bus usage
  • 50. Idea of the intelligent storage (2/2) DB Tech Showcase 2017 - GPU/SSD Accelerates PostgreSQL -50 Large PostgreSQL Tables PCIe Bus “SQL-aware” NVMe SSD GPU SSD-to-GPU P2P DMA WHERE-clause JOIN GROUP BYPostgreSQL Data Blocks Row-to-column transformation Extract only referenced attributes enables to pull out “effective bandwidth” more than the PCIe limitation. “Ultra” parallel SQL execution by GPU Columnar data format, the optimal one for GPU, allows to pull out maximum computing capability for SQL processing. Write as row-data Keeps the performance of usual transactional workloads, and simultaneously allows real-time analytics / reporting.