20210301_PGconf_Online_GPU_PostGIS_GiST_Index

GPU version of PostGIS and GiST-Index
～A new horizon of geospatial data analytics～
HeteroDB,Inc
Chief Architect & CEO
KaiGai Kohei <kaigai@heterodb.com>

about Us
 KaiGai Kohei (海外浩平)
 Chief Architect & CEO of HeteroDB
 Contributor of PostgreSQL (2006-)
 Primary Developer of PG-Strom (2012-)
 Interested in: Big-data, GPU, NVME/PMEM, ...
about Myself
about HeteroDB
 Established: 4th-Jul-2017
 Location: Shinagawa, Tokyo, Japan
 Businesses:
✓ Development of high-performance data-processing
software on top of heterogeneous architecture.
✓ Technology consulting service on GPU&DB area.
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
2
PG-Strom

What is PG-Strom?
3
 Adds custom-scan/join/aggregate path with GPU execution
 SQL to GPU binary code generation & Just-in-time compilation
 Designed for IoT/M2M grade log-data processing
 Closely connected NVME-SSD and GPU using P2P data transfer [I/O acceleration]
 Geospatial analytics with GPU-version of PostGIS and GiST-index support
PG-Strom: An extension of PostgreSQL to pull out maximum capability of
GPU and NVME for processing of terabytes scale data
App
GPU
off-loading
for IoT/Big-Data
for ML/Analytics
PG-Strom’s Features
➢ JIT of GPU code with SQL &
Transparent SQL acceleration
➢ SSD-to-GPU Direct SQL
➢ Columnar Store (Arrow_Fdw)
➢ Asymmetric Partition-wise
JOIN/GROUP BY
➢ GPU Memory Store
➢ PostGIS support
➢ GiST-Index on GPU
➢ BRIN-Index support
➢ NVME-over-Fabric support
➢ Data-frame exchange for
Python scripts
NEW

GPU’s overview
4
Over 10years history in HPC, then massive popularization in Machine-Learning
NVIDIA Tesla V100
Super Computer
(TITEC; TSUBAME3.0) Computer Graphics Machine-Learning
More than thousands cores and TB/s grade memory bandwidth on a chip
for highly computing intensive workloads.
Simulation

Target: Fast data search of mobile devices
▌Mobile devices
 Location (Longitude, Latitude) is updated very frequently.
 Its log data often contains (device_id, timestamp, location (point), other attributes)
▌Area definitions
 Relatively small items, and almost static data.
 Polygon often has very complicated form, thus heavy “collision detection”.
▌Purpose
 Area marketing, Logistics analytics, Advertisement delivery, Emergency Alert, etc…
Latest Location
(Point)
Area definition
(Polygon)
Mobile device
Extract mobile devices
within the target areas
5
GPS

GPU-version PostGIS
6

about PostGIS (1/2)
 Extension to add geometry type, functions and operators.
 First release at 2005, then contentious development over 15 years.
 More than 400 functions / operators
 R-tree on GiST-index framework
 Parallel query execution
© GAIA RESOURCES © OSGeo © KTGIS.net
7

about PostGIS (2/2）
8
St_Distance(a,b)
St_Crosses(a,b)
St_Contains(a,b)

PostGIS’s optimization (1/2)
▌Bounding Box
 The least rectangle that contains a polygon (that is often very complicated).
 (x1,y1) - (x2,y2) form by FP32 (= 16bytes)
 PostGIS assigns a bounding-box when it stores geometry values
 It allows to skip heavy operations prior to geolocational operations
obviously
disjointed
Tokyo
Germany
France
9

Why GPU is capable for SQL acceleration?
11
▌GPU’s characteristics
 SIMT (Single-Instruction Multiple-Threads) architecture
 More than thousands processor cores and 1.0TB/s grade memory band
➔ Designed for “same operations on large number of values”
▌SQL characteristics
 “same operations on large number of values”,
like evaluation of WHERE-clause, JOIN or GROUP BY.
SQL runs WHERE, JOIN, GROUP BY on very large number of rows
Thousands processor units on GPU
evaluates thousands rows in parallel
CPU Parallel
GPU Parallel

Performing PostGIS functions on GPU (1/2)
postgres=# ¥d _gistest
Table "public._gistest"
Column | Type | Collation | Nullable | Default
--------+----------+-----------+----------+--------------------------------------
id | integer | | not null | nextval('_gistest_id_seq'::regclass)
a | geometry | | |
b | geometry | | |
postgres=# explain verbose select * from _gistest where st_contains(a,b);
QUERY PLAN
------------------------------------------------------------------------------------
Custom Scan (GpuScan) on public._gistest (cost=4251.50..4251.50 rows=1 width=196)
Output: id, a, b
GPU Filter: st_contains(_gistest.a, _gistest.b)
GPU Preference: None
Kernel Source: /var/lib/pgdata/pgsql_tmp/pgsql_tmp_strom_21028.4.gpu
Kernel Binary: /var/lib/pgdata/pgsql_tmp/pgsql_tmp_strom_21028.5.ptx
(6 rows)
12

Performing PostGIS functions on GPU (2/2)
$ less /var/lib/pgdata/pgsql_tmp/pgsql_tmp_strom_21028.4.gpu
:
#include "cuda_postgis.h"
#include "cuda_gpuscan.h"
DEVICE_FUNCTION(cl_bool)
gpuscan_quals_eval(kern_context *kcxt,
kern_data_store *kds,
ItemPointerData *t_self,
HeapTupleHeaderData *htup)
{
void *addr __attribute__((unused));
pg_geometry_t KVAR_2;
pg_geometry_t KVAR_3;
assert(htup != NULL);
EXTRACT_HEAP_TUPLE_BEGIN(addr, kds, htup);
EXTRACT_HEAP_TUPLE_NEXT(addr);
pg_datum_ref(kcxt,KVAR_2,addr); // pg_geometry_t
EXTRACT_HEAP_TUPLE_NEXT(addr);
pg_datum_ref(kcxt,KVAR_3,addr); // pg_geometry_t
EXTRACT_HEAP_TUPLE_END();
return EVAL(pgfn_st_contains(kcxt, KVAR_2, KVAR_3));
}
:
13
Load the geometry
values from column-A
and column-B
Calls GPU-revision of
st_contains() for each thread
GPU code automatically
generated for evaluation
of WHERE-clause

Basic performance (1/2)
--- GpuScan (GPU-rev PostGIS)
=# SELECT count(*) FROM ft
WHERE st_contains('polygon ((10 10,90 10,90 12,12 12,12 88,90 88,90 90,¥
10 90,10 10))’, st_makepoint(x,y));
count
--------
236610
(1 row)
Time: 44.680 ms
--- Vanilla PostGIS
=# SET pg_strom.enabled = off;
SET
=# SELECT count(*) FROM tt
WHERE st_contains('polygon ((10 10,90 10,90 12,12 12,12 88,90 88,90 90,¥
10 90,10 10))', st_makepoint(x,y));
count
--------
236610
(1 row)
Time: 622.314 ms
Count number of points in the specified
area from 5 million points.
14

Basic performance (2/2)
15
(100,100)
(0,0)
(90,90)
(90,10)
(90,12)
(12,12)
(12,88) (90,88)
(10,90)
(10,10)

GPU-version of PostGIS
 geometry st_makepoint(float8, float8[, float8[, float8]])
 float8 st_distance(geometry,geometry)
 bool st_dwithin(geometry,geometry,float8)
 bool st_contains(geometry,geometry)
 bool st_crosses(geometry,geometry)
 text st_relate(geometry,geometry)
 ...and more in the future version
Current Status of the supported functions
16

Extract Points within Polygon
17
Number of Polygons x Points combination is too large even for GPU
Latest Location
(Point)
Area definition
(Polygon)
Mobile device
GPS
100K-10M devices
100 - 100K polygons
10M-1T
Combinations?

GiST-Index on GPU
18

How GiST-Index (R-tree) works
▌How GiST-Index (R-tree) works
✓ R1 is a rectangle [(Xmin,Ymin) – (Xmax,Ymax)] that contains all of (R3, R4, R5) and their pointers
✓ R4 is a rectangle that contains all of (R11,R12) and their pointers
✓ R12 is a rectangle that contains the target geometry and its ItemPointer
✓ Sequentially evaluates the entry for each tree-node. Dive into the next depth only if matched.
✓ Not fast as like B-tree, because of sequential evaluation for each depth in R-tree.
(xmin,ymin)
(xmax,ymax)
19
Search Key (Lon,Lat)
〇〇

(xmin,ymin)
(xmax,ymax)
20
〇〇
× 〇 ×

(xmin,ymin)
(xmax,ymax)
21
〇〇
× 〇 ×
〇
×

(xmin,ymin)
(xmax,ymax)
22
〇〇
× 〇 ×
〇
×
× ×

GPU version of GiST-Index
23
▌Overview
 PG-Strom may utilize GiST-Index for joining a table with area definition (polygons; small)
and a table with locational data (points; large).
 Both the area definition table and its index are loaded onto GPU on GpuJoin.
 GpuJoin looks at the GiST-index first for rough pruning, using bounding-box
 Then, evaluate the “collision detection” with polygon value on the table
 Above operations by thousands cores of GPU in parallel,
so we expected its search performance is better, but...
collision detection of Polygons x Points as a part of GpuJoin
GiST-Index (R-tree)
Area Definitions
(Polygon)
A table with
Location Data
(Points)
Thousands
threads search
R-tree index
in parallel

Background) GPU’s Thread Scheduling
25
 GPU groups its processor cores by Streaming Multiprocessor(SM)
✓ Cores, Registers and L1 cache (shared memory) are managed per streaming-multiprocessor
 A SM can run up to 1024 threads simultaneously on the shared 64 cores/SM
✓ Like a very large scale hyper-threading in CPU
 Threads are scheduled per warp (32threads).
called SIMT (Single-Instruction Multiple-Thread) architecture
✓ Uniformed workloads, like matrix operations, fully utilizes the processor cores
✓ If a particular thread tends to consume large cycles, other threads in a warp must wait.
GPU Block Diagram
(Tesla V100; 80SMs)
Streaming Multiprocessors
(64CUDA cores/SM)
●●●…●●●
●●●…●●●
●●●…●●●
●●●…●●●
●●●…●●●
●●●…●●●
Thread Group
(1～1024 Threads)
Warp (32 threads)

Background) Internal design of GpuJoin
26
SELECT * FROM A, B
WHERE A.id = B.id;
Table-A
512 threads fetches
512 rows at once.
Table-B
● ● ● ● ● ●
〇 × 〇 × 〇 ×
GpuHashJoin / GpuNestLoop
N = __syncthreads_count(...)
Write out the JOIN results (if any)
thread-0 allocates result buffer for N-items
Fetch next frame,
and repeat the
above steps.
GpuHashJoin
• Hash calculation
• Search Hash table
• Evaluation of JOIN condition
GpuNestLoop
• Evaluation of JOIN condition
➔little differences in processing
cycles between the threads
References

Issues of index-search in GPU
27
Large variance of thread’s processing time, that leads synchronization wait!
Table-A
● ● ● ● ● ●
Reference
Index-A
× × × ×
×
〇
N = __syncthreads_count(...)
Reference
Walk down to the leaf
node of R-tree, but no
matched item
No matched items,
by just looking at
the root node
of R-tree
Walk down to the leaf
node of R-tree, and
evaluate JOIN conditions
Other GPU cores must be idle by the
completion of the longest operations.

Naive implementation of GPU GiST-Index Search (2020-Sep)
28
Little utilization ratio of GPU cores, due to inter-core synchronization
● ● ● ● ● ● ●
●
Every thread loads a row from the input buffer,
and extract the key for GiST-Index.
Hit on the GiST-Index?
Search the GiST-Index by the key
Is the JOIN-condition true?
Evaluation of the JOIN-condition
write out JOIN-results (if found)
nitems = __syncthreads_count(found);
Repeat
found=true
found=false
No
No
A thread that found a matched entry can
block other 511 threads in the same
thread-group until completion of the
evaluation of JOIN-condition.
Very low efficiency of GPU core usage.

A new more optimal implementation (2020-Nov)
29
Minimization of the synchronization point, to pull up utilization ratio
● ● ● ● ● ● ●
●
Every thread loads a row from the input buffer,
and extract the key for GiST-Index.
Search the GiST-Index by the key
Hit on the GiST-Index?
Allocation of the temporary buffer,
to save the pointers that hit GiST-Index above.
Consumption of temporary
buffer exceeds 512 items.
__syncthreads()
Evaluation of the JOIN-condition
Is the JOIN-condition true?
No
No
nitems = __syncthreads_count(found)
Write out JOIN results
Repeat
found=false
found=true
As long as temporary buffer has space,
threads continue to fetch rows and
search GiST-Index.
➔ Then, evaluate JOIN-conditions
by all the threads at once.

EXPLAIN of the simple test (1/2) - CPU version
postgres=# EXPLAIN (analyze, costs off)
GROUP BY n03_001,n03_004;
QUERY PLAN
--------------------------------------------------------------------------------------------
Finalize GroupAggregate (actual time=30709.855..30710.080 rows=63 loops=1)
Group Key: j.n03_001, j.n03_004
-> Gather Merge (actual time=30709.838..30732.270 rows=244 loops=1)
Workers Planned: 4
Workers Launched: 3
-> Partial GroupAggregate (actual time=30687.466..30687.572 rows=61 loops=4)
Group Key: j.n03_001, j.n03_004
-> Sort (actual time=30687.452..30687.475 rows=638 loops=4)
Sort Key: j.n03_001, j.n03_004
Sort Method: quicksort Memory: 73kB
-> Nested Loop (actual time=71.496..30686.278 rows=638 loops=4)
-> Parallel Seq Scan on geopoint p (actual time=0.012..207.553 rows=2500000 loops=4)
-> Index Scan using geo_japan_geom_idx on geo_japan j ¥
(actual time=0.012..0.012 rows=0 loops=10000000)
Index Cond: (geom ~ st_makepoint(p.x, p.y))
Filter: (((n03_001)::text ~~ ‘Tokyo’::text) AND
st_contains(geom, st_makepoint(p.x, p.y)))
Rows Removed by Filter: 0
Planning Time: 0.156 ms
Execution Time: 30732.422 ms
(21 rows)
31

EXPLAIN of the simple test (2/2) - GPU version
postgres=# EXPLAIN (analyze, costs off)
GROUP BY n03_001,n03_004;
QUERY PLAN
--------------------------------------------------------------------------------------------
GroupAggregate (actual time=329.118..329.139 rows=63 loops=1)
Group Key: j.n03_001, j.n03_004
-> Sort (actual time=329.107..329.110 rows=63 loops=1)
Sort Key: j.n03_001, j.n03_004
Sort Method: quicksort Memory: 29kB
-> Custom Scan (GpuPreAgg) (actual time=328.902..328.911 rows=63 loops=1)
Reduction: Local
Combined GpuJoin: enabled
-> Custom Scan (GpuJoin) on fgeopoint p (never executed)
Outer Scan: fgeopoint p (never executed)
Depth 1: GpuGiSTJoin(plan nrows: 10000000...60840000, actual nrows: 10000000...2553)
HeapSize: 7841.91KB (estimated: 3113.70KB), IndexSize: 13.28MB
IndexFilter: (j.geom ~ st_makepoint(p.x, p.y)) on geo_japan_geom_idx
Rows Fetched by Index: 4952
JoinQuals: st_contains(j.geom, st_makepoint(p.x, p.y))
-> Seq Scan on geo_japan j (actual time=0.164..17.723 rows=6173 loops=1)
Filter: ((n03_001)::text ~~ ‘Tokyo’::text)
Rows Removed by Filter: 112726
Planning Time: 0.344 ms
Execution Time: 340.415 ms
(20 rows)
Portion executed on GPU
32

Conclusion (1/2)
33
▌GPU-version PostGIS
 Enhancement for parallel execution of PostGIS functions
 Also support GiST-Index (R-tree) of geometry type
 100 times faster results to pickup “points in area” type workloads.
▌Expected Use scenarios
 Area marketing analytics
 Real-time advertisement delivery
 Push event notifications, etc...
➔ GPU+PostGIS allows to run “computing intensive” workloads on your workstation or
cloud instance as like you are usually doing.
▌Resources
 GitHub: https://github.com/heterodb/pg-strom
 Document: http://heterodb.github.io/pg-strom/ja/
 Contact: Tw: @kkaigai / ✉ kaigai@heterodb.com

Conclusion (2/2)
34
▌Resources
 GitHub: https://github.com/heterodb/pg-strom
 Document: http://heterodb.github.io/pg-strom/ja/
 Contact: Tw: @kkaigai / ✉ kaigai@heterodb.com
PG-Strom project welcomes your participation. Please contact us.

20210301_PGconf_Online_GPU_PostGIS_GiST_Index

20210301_PGconf_Online_GPU_PostGIS_GiST_Index

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 20210301_PGconf_Online_GPU_PostGIS_GiST_Index

Similar to 20210301_PGconf_Online_GPU_PostGIS_GiST_Index (20)

More from Kohei KaiGai

More from Kohei KaiGai (20)

Recently uploaded

Recently uploaded (20)

20210301_PGconf_Online_GPU_PostGIS_GiST_Index