論文紹介:Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Groun...
pgconfasia2016 lt ssd2gpu
1. Performing SQL with SSD-to-GPU P2P Transfer
かぴばらの旦那 / Herr.Wasserschwein
<kaigai@kaigai.gr.jp>
2. The PG-Strom Project
Feedbacks under the PG-Strom v1.0 development
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer2
Application
Storage
Query
Optimizer
Query
Executor
PG-Strom
Extension
SQL Parser
Storage Manager
GPU
計算集約的ワークロード
computing intensive workloads
• 統計解析、科学技術計算、
マーケティング、etc...
• statistics, science, marketing, ...
by PL/CUDA + Matrix-Array
I/O集約的ワークロード
(i/o intensive workloads)
• DWH, ETL, Reporting, ...
by SSD-to-GPU P2P DMA
3. The PG-Strom Project
x86サーバのアーキテクチャ / Architecture of x86 server
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer3
RAMRAM
PCI Bus
NVMe-SSD GPU
PCIe x16PCIe x4~x8
PCH
その他低速デバイス
(other slow devices)
4. The PG-Strom Project
x86サーバのアーキテクチャ / Architecture of x86 server
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer4
PCH
RAMRAM
PCI Bus
NVMe-SSD GPU
PCIe x16PCIe x4~x8
I/O READ
disk
block
disk
buffer
カタログスペック
(catalog spec)
2GB~6GB/s
その他低速デバイス
(other slow devices)
5. The PG-Strom Project
x86サーバのアーキテクチャ / Architecture of x86 server
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer5
PCH
RAMRAM
PCI Bus
NVMe-SSD GPU
PCIe x16PCIe x4~x8
disk
buffer
その他低速デバイス
(other slow devices)
6. The PG-Strom Project
やりたい事 / What I want to do
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer6
PCH
RAMRAM
PCI Bus
NVMe-SSD GPU
PCIe x16PCIe x4~x8
result
buffer
disk
block
その他低速デバイス
(other slow devices)
7. The PG-Strom Project
やりたい事 / What I want to do
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer7
PCH
RAM
PCI Bus
NVMe-SSD GPU
PCIe x16PCIe x4~x8
Large PostgreSQL Tables
Small Inner Tables
WHERE句
JOIN
GROUP BY
データサイズが
すごく小さく!
(making data-size
much smaller!)
その他低速デバイス
(other slow devices)
8. The PG-Strom Project
やりたい事 / What I want to do
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer8
PCH
RAM
PCI Bus
NVMe-SSD GPU
PCIe x16PCIe x4~x8
Large PostgreSQL Tables
Small Inner Tables
WHERE句
JOIN
GROUP BY
ここまで完成
(works completed)
その他低速デバイス
(other slow devices)
9. The PG-Strom Project
要素技術 / Element Technology: GPUDirect RDMA by NVIDIA
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer9
GPUのデバイスメモリを、物理アドレス空間
にマップするためのAPI
(API to map GPU’s device memory
on physical address space of the host system)
ストレージからのDMA転送先に
GPU上のデバイスメモリを指定できる。
(GPU’s device memory can be used for the
destination address of DMA from the storage)
10. The PG-Strom Project
NVMe-Stromドライバ / NVMe-Strom Driver
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer10
pg-strom
NVMe-Strom
VFS
Page Cache
NVMe SSD
Driver
nvidia
driver
PostgreSQL
/proc/nvme-strom
read(2)
User
Space
Kernel
Space
11. The PG-Strom Project
NVMe-Stromドライバ / NVMe-Strom Driver
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer11
pg-strom
NVMe-Strom
VFS
Page Cache
NVMe SSD
Driver
nvidia
driver
GPU
device
memory
PostgreSQL
cuMemAlloc()
/proc/nvme-strom
read(2)
User
Space
Kernel
Space
12. The PG-Strom Project
NVMe-Stromドライバ / NVMe-Strom Driver
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer12
pg-strom
NVMe-Strom
VFS
Page Cache
NVMe SSD
Driver
nvidia
driver
GPU
device
memory
GPU
device
memory
PostgreSQL
cuMemAlloc()
/proc/nvme-strom
ioctl(2)
read(2)
User
Space
Kernel
Space
13. The PG-Strom Project
NVMe-Stromドライバ / NVMe-Strom Driver
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer13
pg-strom
NVMe-Strom
VFS
Page Cache
NVMe SSD
Driver
nvidia
driver
GPU
device
memory
GPU
device
memory
PostgreSQL
file offset
block number
cuMemAlloc()
/proc/nvme-strom
ioctl(2)
read(2)
User
Space
Kernel
Space
14. The PG-Strom Project
NVMe-Stromドライバ / NVMe-Strom Driver
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer14
pg-strom
NVMe-Strom
VFS
Page Cache
NVMe SSD
Driver
nvidia
driver
GPU
device
memory
GPU
device
memory
PostgreSQL
file offset
DMA
request
block number
cuMemAlloc()
/proc/nvme-strom
ioctl(2)
read(2)
User
Space
Kernel
Space
15. The PG-Strom Project
NVMe-Stromドライバ / NVMe-Strom Driver
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer15
pg-strom
NVMe-Strom
VFS
Page Cache
NVMe SSD
Driver
nvidia
driver
GPU
device
memory
GPU
device
memory
PostgreSQL
file offset
DMA
request
block number
SSD-to-GPU Peer-to-Peer DMA
cuMemAlloc()
/proc/nvme-strom
ioctl(2)
read(2)
User
Space
Kernel
Space
16. The PG-Strom Project
単純I/O性能 / Raw I/O Performance
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer16
32MB x 6個のバッファを使用。バッファが空になる度に非同期DMAをキック
6 of 32MB buffers were used. Async DMA was kicked per
測定環境 / Environment
CPU: Xeon E5-2670 v3, RAM: 64GB
Intel SSD 750 (400GB; PCI-E x4)
NVIDIA Tesla K20c (2496core; 706MHz, 5GB GDDR5; 208GB/s)
OS: CentOS 7 (3.10.0-327.18.2.el7.x86_64), Filesystem: Ext4
カタログスペック!
(catalog spec!!)
17. The PG-Strom Project
測定に使用したNVMe-SSD / NVMe-SSD for this measurement
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer17
容量
(capacity)
順次128KB
読出し
(Seq Read)
順次128KB
書込み
(Seq Write)
ランダム4KB
読出し
(Random Read)
ランダム4KB
書込み
(Random Write)
インターフェース
(Interface)
400GB 2,200MB/s 900MB/s 430,000 IOPS 230,000 IOPS PCIe 3.0 x4
800GB 2,100MB/s 800MB/s 420,000 IOPS 210,000 IOPS PCIe 3.0 x4
1.2TB 2,500MB/s 1,200MB/s 460,000 IOPS 290,000 IOPS PCIe 3.0 x4
これ以外に、Samsung PM1725 NVMe SSD (1.6TB, 6GB/s) での動作報告あり。
(working at Samsung PM1725 NVMe SSD (1.6TB, 6GB/s) was reported)
Raw-I/OのSSD-to-GPUで5634MB/sを記録との報告
(It said the raw-I/O SSD-to-GPU worked with 5634MB/s)
https://github.com/kaigai/nvme-kmod/issues/1
18. The PG-Strom Project
SQLスキャン性能 / SQL Scan Performance
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer18
▌この測定結果から分かる事 / What this measurement tells us
既存ストレージ層の性能限界 / Performance limit of the storage layer
64GB / 140sec = 468MB/s Raw-I/O性能(587MB/s)に20%程度の追加コスト
(Extra 20% cost in addition to the raw-i/o throughput (587MB/s))
NVMe-Stromによる改善 / Improvement by NVMe-Strom
スループット / Throughput: 64GB / 43sec = 1524MB/s
Existing
Limit
19. The PG-Strom Project
測定に使用したクエリ / Query for the measurement
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer19
CREATE TABLE t_64g (id int not null,
x float not null,
y float not null,
z float not null,
memo text);
INSERT INTO t_64g (SELECT x, random()*1000, random()*1000,
random()*1000, md5(x::text)
FROM generate_series(1,700000000) x);
postgres=# ¥d+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------------------+-------+--------+---------+-------------
public | t | table | kaigai | 965 MB |
public | t_64g | table | kaigai | 66 GB |
Query-1) Scan query with a simple WHERE-clause
SELECT * FROM t WHERE x BETWEEN y-2 AND y+2;
Query-2) Scan query with a complicated WHERE-clause
SELECT * FROM t_64g WHERE sqrt((x-200)^2 + (y-300)^2 +
(z-400)^2) < 10;
Query-3) Scan query with text matching
SELECT * FROM t WHERE memo LIKE '%abcd%';
20. The PG-Strom Project
開発ロードマップ / Development Roadmap
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer20
• GPUデバイスメモリのホストマッピングと、SSD-to-GPU P2P DMA要求の発行
• Host mapping of GPU device memory, and P2P DMA request for SSD-to-GPU transfer
① NVMe-Strom driver: the basic functionality
• PostgreSQL v9.6のCPU並列対応と、NVMe-Stromを使ったP2Pのデータロード
• hybrid parallel, and peer-to-peer data loading by NVMe-StromSupport of CPU+GPU
② PG-Strom: Integration with GpuScan + NVMe-Strom
• PostgreSQL v9.6の新オプティマイザ対応と、スキャン実装のGpuScanとの統合
• Support of new optimizer in PostgreSQL v9.6, and integration with GpuScan for simple scan
③ PG-Strom: JOIN/GROUP BY Support
• RAID-0/1区画に対するストライピングREAD / Striping READ on RAID-0/1 volumes
④ NVMe-Strom driver: RAID-0/1 support
• テスト、テスト、テスト、デバッグ / Test, Test, Test, Debug
⑤ 品質改善・安定化 / Quality improvement and stabilization
⑥ PG-Strom v2.0!! (2017/2Q~3Q)
いまココ!!
21. The PG-Strom Project
PG-Strom v2.0のターゲット / Target on PG-Strom v2.0
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer21
PCI-E x8
5.0GB/s
PCI-E
PCI-E x16
~10GB/s
シングルノードで最大20GB/sのデータ処理能力を目指す
(towards 20GB/s data processing capability per node)
Dual NVMe-SSD
+RAID0/1対応
10GB/sのスループットで
SSDブロックをGPUへロード
数千コアによる
GPU並列処理
PCI-E
PCI-E x8
5.0GB/s
PCI-E x8
5.0GB/s
PCI-E x16
~10GB/s
PCI-E x8
5.0GB/s
Dual NVMe-SSD
+RAID0/1 Support
Loading SSD blocks to GPU
with 10GB/s throughput
GPU Parallels by
thousands cores