1. GPU version of PostGIS and GiST-Index
~A new horizon of geospatial data analytics~
HeteroDB,Inc
Chief Architect & CEO
KaiGai Kohei <kaigai@heterodb.com>
2. about Us
KaiGai Kohei (海外浩平)
Chief Architect & CEO of HeteroDB
Contributor of PostgreSQL (2006-)
Primary Developer of PG-Strom (2012-)
Interested in: Big-data, GPU, NVME/PMEM, ...
about Myself
about HeteroDB
Established: 4th-Jul-2017
Location: Shinagawa, Tokyo, Japan
Businesses:
✓ Development of high-performance data-processing
software on top of heterogeneous architecture.
✓ Technology consulting service on GPU&DB area.
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
2
PG-Strom
3. What is PG-Strom?
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
3
Adds custom-scan/join/aggregate path with GPU execution
SQL to GPU binary code generation & Just-in-time compilation
Designed for IoT/M2M grade log-data processing
Closely connected NVME-SSD and GPU using P2P data transfer [I/O acceleration]
Geospatial analytics with GPU-version of PostGIS and GiST-index support
PG-Strom: An extension of PostgreSQL to pull out maximum capability of
GPU and NVME for processing of terabytes scale data
App
GPU
off-loading
for IoT/Big-Data
for ML/Analytics
PG-Strom’s Features
➢ JIT of GPU code with SQL &
Transparent SQL acceleration
➢ SSD-to-GPU Direct SQL
➢ Columnar Store (Arrow_Fdw)
➢ Asymmetric Partition-wise
JOIN/GROUP BY
➢ GPU Memory Store
➢ PostGIS support
➢ GiST-Index on GPU
➢ BRIN-Index support
➢ NVME-over-Fabric support
➢ Data-frame exchange for
Python scripts
NEW
4. GPU’s overview
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
4
Over 10years history in HPC, then massive popularization in Machine-Learning
NVIDIA Tesla V100
Super Computer
(TITEC; TSUBAME3.0) Computer Graphics Machine-Learning
More than thousands cores and TB/s grade memory bandwidth on a chip
for highly computing intensive workloads.
Simulation
5. Target: Fast data search of mobile devices
▌Mobile devices
Location (Longitude, Latitude) is updated very frequently.
Its log data often contains (device_id, timestamp, location (point), other attributes)
▌Area definitions
Relatively small items, and almost static data.
Polygon often has very complicated form, thus heavy “collision detection”.
▌Purpose
Area marketing, Logistics analytics, Advertisement delivery, Emergency Alert, etc…
Latest Location
(Point)
Area definition
(Polygon)
Mobile device
Extract mobile devices
within the target areas
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
5
GPS
9. PostGIS’s optimization (1/2)
▌Bounding Box
The least rectangle that contains a polygon (that is often very complicated).
(x1,y1) - (x2,y2) form by FP32 (= 16bytes)
PostGIS assigns a bounding-box when it stores geometry values
It allows to skip heavy operations prior to geolocational operations
obviously
disjointed
Tokyo
Germany
France
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
9
10. PostGIS’s optimization (2/2)
▌R-Tree on GiST-Index
GiST (Generalized Search Tree) - A framework of index at PostgreSQL
PostGIS implements R-tree for Geometry type
Check for contains (‘@’ operator), overlaps (‘&&’ operator)
Works efficiently to check relationship between very large number of points
and many polygons.
# Data published by Geospatial Information Authority of Japan
$ shp2pgsql N03-20_200101.shp | psql gistest
gistest=# ¥d+
List of relations
Schema | Name | Type | Owner | Size |
--------+-------------+----------+--------+------------+
public | geo_japan | table | kaigai | 243 MB |
gistest=# ¥di+
List of relations
Schema | Name | Type | Owner | Table | Size |
--------+--------------------+-------+--------+-----------+---------+
public | geo_japan_pkey | index | kaigai | geo_japan | 2616 kB |
public | geo_japan_geom_idx | index | kaigai | geo_japan | 14 MB |
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
10
11. Why GPU is capable for SQL acceleration?
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
11
▌GPU’s characteristics
SIMT (Single-Instruction Multiple-Threads) architecture
More than thousands processor cores and 1.0TB/s grade memory band
➔ Designed for “same operations on large number of values”
▌SQL characteristics
“same operations on large number of values”,
like evaluation of WHERE-clause, JOIN or GROUP BY.
SQL runs WHERE, JOIN, GROUP BY on very large number of rows
Thousands processor units on GPU
evaluates thousands rows in parallel
CPU Parallel
GPU Parallel
12. Performing PostGIS functions on GPU (1/2)
postgres=# ¥d _gistest
Table "public._gistest"
Column | Type | Collation | Nullable | Default
--------+----------+-----------+----------+--------------------------------------
id | integer | | not null | nextval('_gistest_id_seq'::regclass)
a | geometry | | |
b | geometry | | |
postgres=# explain verbose select * from _gistest where st_contains(a,b);
QUERY PLAN
------------------------------------------------------------------------------------
Custom Scan (GpuScan) on public._gistest (cost=4251.50..4251.50 rows=1 width=196)
Output: id, a, b
GPU Filter: st_contains(_gistest.a, _gistest.b)
GPU Preference: None
Kernel Source: /var/lib/pgdata/pgsql_tmp/pgsql_tmp_strom_21028.4.gpu
Kernel Binary: /var/lib/pgdata/pgsql_tmp/pgsql_tmp_strom_21028.5.ptx
(6 rows)
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
12
13. Performing PostGIS functions on GPU (2/2)
$ less /var/lib/pgdata/pgsql_tmp/pgsql_tmp_strom_21028.4.gpu
:
#include "cuda_postgis.h"
#include "cuda_gpuscan.h"
DEVICE_FUNCTION(cl_bool)
gpuscan_quals_eval(kern_context *kcxt,
kern_data_store *kds,
ItemPointerData *t_self,
HeapTupleHeaderData *htup)
{
void *addr __attribute__((unused));
pg_geometry_t KVAR_2;
pg_geometry_t KVAR_3;
assert(htup != NULL);
EXTRACT_HEAP_TUPLE_BEGIN(addr, kds, htup);
EXTRACT_HEAP_TUPLE_NEXT(addr);
pg_datum_ref(kcxt,KVAR_2,addr); // pg_geometry_t
EXTRACT_HEAP_TUPLE_NEXT(addr);
pg_datum_ref(kcxt,KVAR_3,addr); // pg_geometry_t
EXTRACT_HEAP_TUPLE_END();
return EVAL(pgfn_st_contains(kcxt, KVAR_2, KVAR_3));
}
:
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
13
Load the geometry
values from column-A
and column-B
Calls GPU-revision of
st_contains() for each thread
GPU code automatically
generated for evaluation
of WHERE-clause
14. Basic performance (1/2)
--- GpuScan (GPU-rev PostGIS)
=# SELECT count(*) FROM ft
WHERE st_contains('polygon ((10 10,90 10,90 12,12 12,12 88,90 88,90 90,¥
10 90,10 10))’, st_makepoint(x,y));
count
--------
236610
(1 row)
Time: 44.680 ms
--- Vanilla PostGIS
=# SET pg_strom.enabled = off;
SET
=# SELECT count(*) FROM tt
WHERE st_contains('polygon ((10 10,90 10,90 12,12 12,12 88,90 88,90 90,¥
10 90,10 10))', st_makepoint(x,y));
count
--------
236610
(1 row)
Time: 622.314 ms
Count number of points in the specified
area from 5 million points.
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
14
15. Basic performance (2/2)
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
15
(100,100)
(0,0)
(90,90)
(90,10)
(90,12)
(12,12)
(12,88) (90,88)
(10,90)
(10,10)
16. GPU-version of PostGIS
geometry st_makepoint(float8, float8[, float8[, float8]])
float8 st_distance(geometry,geometry)
bool st_dwithin(geometry,geometry,float8)
bool st_contains(geometry,geometry)
bool st_crosses(geometry,geometry)
text st_relate(geometry,geometry)
...and more in the future version
Current Status of the supported functions
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
16
17. Extract Points within Polygon
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
17
Number of Polygons x Points combination is too large even for GPU
Latest Location
(Point)
Area definition
(Polygon)
Mobile device
GPS
100K-10M devices
100 - 100K polygons
10M-1T
Combinations?
19. How GiST-Index (R-tree) works
▌How GiST-Index (R-tree) works
✓ R1 is a rectangle [(Xmin,Ymin) – (Xmax,Ymax)] that contains all of (R3, R4, R5) and their pointers
✓ R4 is a rectangle that contains all of (R11,R12) and their pointers
✓ R12 is a rectangle that contains the target geometry and its ItemPointer
✓ Sequentially evaluates the entry for each tree-node. Dive into the next depth only if matched.
✓ Not fast as like B-tree, because of sequential evaluation for each depth in R-tree.
(xmin,ymin)
(xmax,ymax)
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
19
Search Key (Lon,Lat)
〇 〇
20. How GiST-Index (R-tree) works
▌How GiST-Index (R-tree) works
✓ R1 is a rectangle [(Xmin,Ymin) – (Xmax,Ymax)] that contains all of (R3, R4, R5) and their pointers
✓ R4 is a rectangle that contains all of (R11,R12) and their pointers
✓ R12 is a rectangle that contains the target geometry and its ItemPointer
✓ Sequentially evaluates the entry for each tree-node. Dive into the next depth only if matched.
✓ Not fast as like B-tree, because of sequential evaluation for each depth in R-tree.
(xmin,ymin)
(xmax,ymax)
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
20
Search Key (Lon,Lat)
〇 〇
× 〇 ×
21. How GiST-Index (R-tree) works
▌How GiST-Index (R-tree) works
✓ R1 is a rectangle [(Xmin,Ymin) – (Xmax,Ymax)] that contains all of (R3, R4, R5) and their pointers
✓ R4 is a rectangle that contains all of (R11,R12) and their pointers
✓ R12 is a rectangle that contains the target geometry and its ItemPointer
✓ Sequentially evaluates the entry for each tree-node. Dive into the next depth only if matched.
✓ Not fast as like B-tree, because of sequential evaluation for each depth in R-tree.
(xmin,ymin)
(xmax,ymax)
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
21
Search Key (Lon,Lat)
〇 〇
× 〇 ×
〇
×
22. How GiST-Index (R-tree) works
▌How GiST-Index (R-tree) works
✓ R1 is a rectangle [(Xmin,Ymin) – (Xmax,Ymax)] that contains all of (R3, R4, R5) and their pointers
✓ R4 is a rectangle that contains all of (R11,R12) and their pointers
✓ R12 is a rectangle that contains the target geometry and its ItemPointer
✓ Sequentially evaluates the entry for each tree-node. Dive into the next depth only if matched.
✓ Not fast as like B-tree, because of sequential evaluation for each depth in R-tree.
(xmin,ymin)
(xmax,ymax)
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
22
Search Key (Lon,Lat)
〇 〇
× 〇 ×
〇
×
× ×
23. GPU version of GiST-Index
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
23
▌Overview
PG-Strom may utilize GiST-Index for joining a table with area definition (polygons; small)
and a table with locational data (points; large).
Both the area definition table and its index are loaded onto GPU on GpuJoin.
GpuJoin looks at the GiST-index first for rough pruning, using bounding-box
Then, evaluate the “collision detection” with polygon value on the table
Above operations by thousands cores of GPU in parallel,
so we expected its search performance is better, but...
collision detection of Polygons x Points as a part of GpuJoin
GiST-Index (R-tree)
Area Definitions
(Polygon)
A table with
Location Data
(Points)
Thousands
threads search
R-tree index
in parallel
24. Simple Test: Random points and St_Contains (2020-Sep)
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
24
(123.0, 20.0)
(154.2, 46.2)
SELECT n03_001,n03_004,count(*)
FROM geo_japan j, geopoint p
WHERE st_contains(j.geom, st_makepoint(x,y))
AND j.n03_001 like ‘Tokyo’
GROUP BY n03_001,n03_004;
10M of randomly
generated
geolocational
points
n03_001 | n03_004 | count
---------+-------------+-------
Tokyo | Akiruno | 105
Tokyo | Miyake | 76
Tokyo | Mitaka | 17
Tokyo | Setagaya-ku | 67
Tokyo | Chuo-ku | 12
Tokyo | Nakano-ku | 18
Tokyo | Hachijo | 105
: : :
Tokyo | Toshima-ku | 14
Tokyo | Adachi-ku | 55
Tokyo | Aogashima | 7
Tokyo | Ome | 117
(63 rows)
CPU-rev: 30.539s
GPU-rev: 33.841s (Slow!)
Area definition data
by the Geospatial Information
Authority of Japan
25. Background) GPU’s Thread Scheduling
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
25
GPU groups its processor cores by Streaming Multiprocessor(SM)
✓ Cores, Registers and L1 cache (shared memory) are managed per streaming-multiprocessor
A SM can run up to 1024 threads simultaneously on the shared 64 cores/SM
✓ Like a very large scale hyper-threading in CPU
Threads are scheduled per warp (32threads).
called SIMT (Single-Instruction Multiple-Thread) architecture
✓ Uniformed workloads, like matrix operations, fully utilizes the processor cores
✓ If a particular thread tends to consume large cycles, other threads in a warp must wait.
GPU Block Diagram
(Tesla V100; 80SMs)
Streaming Multiprocessors
(64CUDA cores/SM)
●●●…●●●
●●●…●●●
●●●…●●●
●●●…●●●
●●●…●●●
●●●…●●●
Thread Group
(1~1024 Threads)
Warp (32 threads)
26. Background) Internal design of GpuJoin
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
26
SELECT * FROM A, B
WHERE A.id = B.id;
Table-A
512 threads fetches
512 rows at once.
Table-B
● ● ● ● ● ●
〇 × 〇 × 〇 ×
GpuHashJoin / GpuNestLoop
N = __syncthreads_count(...)
Write out the JOIN results (if any)
thread-0 allocates result buffer for N-items
Fetch next frame,
and repeat the
above steps.
GpuHashJoin
• Hash calculation
• Search Hash table
• Evaluation of JOIN condition
GpuNestLoop
• Evaluation of JOIN condition
➔little differences in processing
cycles between the threads
References
27. Issues of index-search in GPU
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
27
Large variance of thread’s processing time, that leads synchronization wait!
Table-A
● ● ● ● ● ●
Reference
Index-A
× × × ×
×
〇
N = __syncthreads_count(...)
Reference
Walk down to the leaf
node of R-tree, but no
matched item
No matched items,
by just looking at
the root node
of R-tree
Walk down to the leaf
node of R-tree, and
evaluate JOIN conditions
Other GPU cores must be idle by the
completion of the longest operations.
28. Naive implementation of GPU GiST-Index Search (2020-Sep)
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
28
Little utilization ratio of GPU cores, due to inter-core synchronization
● ● ● ● ● ● ●
●
Every thread loads a row from the input buffer,
and extract the key for GiST-Index.
Hit on the GiST-Index?
Search the GiST-Index by the key
Is the JOIN-condition true?
Evaluation of the JOIN-condition
write out JOIN-results (if found)
nitems = __syncthreads_count(found);
Repeat
found=true
found=false
No
No
A thread that found a matched entry can
block other 511 threads in the same
thread-group until completion of the
evaluation of JOIN-condition.
Very low efficiency of GPU core usage.
29. A new more optimal implementation (2020-Nov)
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
29
Minimization of the synchronization point, to pull up utilization ratio
● ● ● ● ● ● ●
●
Every thread loads a row from the input buffer,
and extract the key for GiST-Index.
Search the GiST-Index by the key
Hit on the GiST-Index?
Allocation of the temporary buffer,
to save the pointers that hit GiST-Index above.
Consumption of temporary
buffer exceeds 512 items.
__syncthreads()
Evaluation of the JOIN-condition
Is the JOIN-condition true?
No
No
nitems = __syncthreads_count(found)
Write out JOIN results
Repeat
found=false
found=true
As long as temporary buffer has space,
threads continue to fetch rows and
search GiST-Index.
➔ Then, evaluate JOIN-conditions
by all the threads at once.
30. Simple Test: Random points and St_Contains (2020-Nov; the latest)
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
30
(123.0, 20.0)
(154.2, 46.2)
SELECT n03_001,n03_004,count(*)
FROM geo_japan j, geopoint p
WHERE st_contains(j.geom, st_makepoint(x,y))
AND j.n03_001 like ‘Tokyo’
GROUP BY n03_001,n03_004;
10M of randomly
generated
geolocational
points
n03_001 | n03_004 | count
---------+-------------+-------
Tokyo | Akiruno | 105
Tokyo | Miyake | 76
Tokyo | Mitaka | 17
Tokyo | Setagaya-ku | 67
Tokyo | Chuo-ku | 12
Tokyo | Nakano-ku | 18
Tokyo | Hachijo | 105
: : :
Tokyo | Toshima-ku | 14
Tokyo | Adachi-ku | 55
Tokyo | Aogashima | 7
Tokyo | Ome | 117
(63 rows)
CPU-rev: 30.539s
GPU-rev: 0.316s
Area definition data
by the Geospatial Information
Authority of Japan
100 times
faster!!
31. EXPLAIN of the simple test (1/2) - CPU version
postgres=# EXPLAIN (analyze, costs off)
SELECT n03_001,n03_004,count(*)
FROM geo_japan j, geopoint p
WHERE st_contains(j.geom, st_makepoint(x,y))
AND j.n03_001 like ‘Tokyo’
GROUP BY n03_001,n03_004;
QUERY PLAN
--------------------------------------------------------------------------------------------
Finalize GroupAggregate (actual time=30709.855..30710.080 rows=63 loops=1)
Group Key: j.n03_001, j.n03_004
-> Gather Merge (actual time=30709.838..30732.270 rows=244 loops=1)
Workers Planned: 4
Workers Launched: 3
-> Partial GroupAggregate (actual time=30687.466..30687.572 rows=61 loops=4)
Group Key: j.n03_001, j.n03_004
-> Sort (actual time=30687.452..30687.475 rows=638 loops=4)
Sort Key: j.n03_001, j.n03_004
Sort Method: quicksort Memory: 73kB
-> Nested Loop (actual time=71.496..30686.278 rows=638 loops=4)
-> Parallel Seq Scan on geopoint p (actual time=0.012..207.553 rows=2500000 loops=4)
-> Index Scan using geo_japan_geom_idx on geo_japan j ¥
(actual time=0.012..0.012 rows=0 loops=10000000)
Index Cond: (geom ~ st_makepoint(p.x, p.y))
Filter: (((n03_001)::text ~~ ‘Tokyo’::text) AND
st_contains(geom, st_makepoint(p.x, p.y)))
Rows Removed by Filter: 0
Planning Time: 0.156 ms
Execution Time: 30732.422 ms
(21 rows)
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
31
32. EXPLAIN of the simple test (2/2) - GPU version
postgres=# EXPLAIN (analyze, costs off)
SELECT n03_001,n03_004,count(*)
FROM geo_japan j, geopoint p
WHERE st_contains(j.geom, st_makepoint(x,y))
AND j.n03_001 like ‘Tokyo’
GROUP BY n03_001,n03_004;
QUERY PLAN
--------------------------------------------------------------------------------------------
GroupAggregate (actual time=329.118..329.139 rows=63 loops=1)
Group Key: j.n03_001, j.n03_004
-> Sort (actual time=329.107..329.110 rows=63 loops=1)
Sort Key: j.n03_001, j.n03_004
Sort Method: quicksort Memory: 29kB
-> Custom Scan (GpuPreAgg) (actual time=328.902..328.911 rows=63 loops=1)
Reduction: Local
Combined GpuJoin: enabled
-> Custom Scan (GpuJoin) on fgeopoint p (never executed)
Outer Scan: fgeopoint p (never executed)
Depth 1: GpuGiSTJoin(plan nrows: 10000000...60840000, actual nrows: 10000000...2553)
HeapSize: 7841.91KB (estimated: 3113.70KB), IndexSize: 13.28MB
IndexFilter: (j.geom ~ st_makepoint(p.x, p.y)) on geo_japan_geom_idx
Rows Fetched by Index: 4952
JoinQuals: st_contains(j.geom, st_makepoint(p.x, p.y))
-> Seq Scan on geo_japan j (actual time=0.164..17.723 rows=6173 loops=1)
Filter: ((n03_001)::text ~~ ‘Tokyo’::text)
Rows Removed by Filter: 112726
Planning Time: 0.344 ms
Execution Time: 340.415 ms
(20 rows)
Portion executed on GPU
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
32
33. Conclusion (1/2)
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
33
▌GPU-version PostGIS
Enhancement for parallel execution of PostGIS functions
Also support GiST-Index (R-tree) of geometry type
100 times faster results to pickup “points in area” type workloads.
▌Expected Use scenarios
Area marketing analytics
Real-time advertisement delivery
Push event notifications, etc...
➔ GPU+PostGIS allows to run “computing intensive” workloads on your workstation or
cloud instance as like you are usually doing.
▌Resources
GitHub: https://github.com/heterodb/pg-strom
Document: http://heterodb.github.io/pg-strom/ja/
Contact: Tw: @kkaigai / ✉ kaigai@heterodb.com
34. Conclusion (2/2)
PGconf.online 2021 - GPU version of PostGIS and GiST-Index
34
▌Resources
GitHub: https://github.com/heterodb/pg-strom
Document: http://heterodb.github.io/pg-strom/ja/
Contact: Tw: @kkaigai / ✉ kaigai@heterodb.com
PG-Strom project welcomes your participation. Please contact us.