GPDB v6: Massive Postgres Power for Analytics

© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
Ivan Novick
@NovickGreenplum
March 2019
Present & Future of Greenplum Database
A massively parallel Postgres Database

© Copyright 2019 Pivotal Software, Inc. All rights Reserved.© Copyright 2019 Pivotal Software, Inc. All rights Reserved.
Greenplum Database v5
Mission Critical Analytical Database Platform

GPDB v5: Mission Critical Analytical Database
Platform
Well rounded and proven feature set:
● Proven in Mission Critical Use Cases
● ORCA Optimizer
● Resource Groups & PGBouncer for Concurrency
● In-Database Analytics
● External Data Federation Ecosystem
● Pivotal Greenplum Command Center 4.x
● Updated Backup and Migration Tooling
“Pivotal Greenplum is
often used in mission-
critical use cases,
where downtime is
not well-tolerated.”
-- Gartner MQ 2019

Greenplum Database V6
Massive Postgres Power

GPDB v6: Massive Postgres Power
What if Greenplum was a Superset and not a
subset of Postgres
● Postgres 9.4 merged
● WAL Replication
● Row Level Locking for Updates/Deletes
● Foreign Data Wrapper API
● PG Extensions: e.g. pgaudit
● Recursive CTE
● JSON, JSONB, FTS, GIN Index
“Customers
frequently called out
the open-source
alignment with
PostgreSQL as a
strong and cost-
effective positive”
-- Gartner MQ 2019

GPDB v6: OLTP Performance with Greenplum
Up to 50x Performance gain on pgbench in
early testing
● Greenplum has always been ACID with
Transaction semantics
● Many Analytical Systems Require a Mix of
Analytical and OLTP Queries
● Remove Table Lock on Updates & Deletes
● Distributed Deadlock Detector introduced
● Concurrent OLTP Operations allowed
“Customers
frequently called out
the open-source
alignment with
PostgreSQL as a
strong and cost-
effective positive”
-- Gartner MQ 2019

V6: Big Data Features
#ScaleMatters
● Online Expansion w/ Jump Consistent Hash
● Star-Schema DW with Replicated Tables
● Join Aggregrate Query Perf with Eager
Aggregation Optimizations
● zStandard compression
“Reference customers
for Pivotal praised the
overall performance
and scalability of
Pivotal Greenplum”
-- Gartner MQ 2019

GP v5 Expand Example
Distributed by Call ID
Detailed Call Records
Example
Call id 1
Call id 4
Call id 7
Call id 10
Call id 2
Call id 5
Call id 8
Call id 11
Call id 3
Call id 6
Call id 9
Call id 12
Call id 1
Call id 5
Call id 9
Call id 2
Call id 6
Call id 10
Call id 3
Call id 7
Call id 11
Call id 4
Call id 8
Call id 12
RESHUFFLE
ALL GPEXPAND

GP v6 Online Expand w/ Jump Consistent Hash
Distributed by Call ID
Detailed Call Records
Example
Call id 1
Call id 4
Call id 7
Call id 10
Call id 2
Call id 5
Call id 8
Call id 11
Call id 3
Call id 6
Call id 9
Call id 12
Call id 1
Call id 4
Call id 7
Call id 2
Call id 5
Call id 8
Call id 3
Call id 6
Call id 9
Call id 10
Call id 11
Call id 12
MINIMAL DATA
MOVEMENT
GPEXPAND

GP v6 Replicated Tables
Call 1, Caller 1
Call 5, Caller 2
Call 9, Caller 1
Call 13, Caller 3
Call 2, Caller 1
Call 6, Caller 3
Call 10, Caller 3
Call 14, Caller 3
Call 3, Caller 3
Call 7, Caller 3
Call 11, Caller 1
Call 15, Caller 1
CallerID 1
CallerID 2
CallerID 3
JOIN
Call 4, Caller 2
Call 8, Caller 3
Call 12, Caller 1
Call 16, Caller 1
CallerID 1
CallerID 2
CallerID 3
CallerID 1
CallerID 2
CallerID 3
CallerID 1
CallerID 2
CallerID 3
Distributed
Fact Table
Replicated
Dimension Table
SEGMENT 1 SEGMENT 2 SEGMENT 3 SEGMENT 4
CREATE TABLE CallerUser (x CallerId, y Attribute) DISTRIBUTED REPLICATED;

Eager-Agg Optimization in GPDB v6
create table foo (j1 int, g1 int, s1 int);
insert into foo select i%10000, i %1000, i from generate_series(1,100000000) i;
● 10,000 unique grouping columns
● 1000 unique join columns
create table bar (j2 int, g2 int, s2 int);
insert into bar select i%100, i %1000, i from generate_series(1,100000) i;
● 1000 unique grouping columns
● 100 unique join columns
Query:
select sum(s1)
from foo, bar
where j1 = j2 and s1%2 = 0
group by g1, g2;
Greenplum v5 63.8 seconds
Greenplum v6 7.4 seconds
~
9X
Im
provem
ent

Aggregate Queries over Join GPDB v5
Find the loss per line item for
all returned items
Join the line items to the
orders
Group them by store and
compute the aggregate loss
Straightforward translation of
the query into the query plan
If each order has a large
number of line items, the join
results can be quite large and
expensiveLINEITEM ORDERS
! L_LOSS:
L_EXTENDEDPRICE * (I-
L_DISCOUNT)
⨝ (L_ORDERKEY =
O_ORDERKEY)
#O_STORE (SUM(L_LOSS))
σ (L_RETURNFLAG = “R”)

Eager Agg Optimization GPDB v6
● Find the loss of revenue
for each order
● Join the aggregated
view with table ORDERS
● Compute the total loss
for each store
● Benefit: Inner group-by
reduces the number of
row to the join
[Yan95] W. P. Yan and P.
Larson, "Eager Aggregation
and Lazy Aggregation",
VLDB 1995
LINEITEM ORDERS
! L_LOSS:
L_EXTENDEDPRICE * (I-
L_DISCOUNT)
⨝ (L_ORDERKEY =
O_ORDERKEY)
# O_STORE
SUM(L_ORDERLOSS)
σ (L_RETURNFLAG = “R”)
# L_ORDERKEY
L_ORDERLOSS:
SUM(L_LOSS)

GPDB v6 zStd Compression
Same or more for less
● Open Source
● Lower CPU Cycles with same or better compression
● Originated at Facebook
CREATE TABLE call_data_records(callid int4, calldetails json)
WITH (appendonly=true, compresstype=zstd, orientation=column)
DISTRIBUTED BY (callid);

Containerized Greenplum w/ GPDB v6● GP embedded in
containers for
portability and
dependency
management
● Containers
managed by
Kubernetes for
higher availability
and elasticity
● Kubernetes
operator used for
automation
Container
Operator AUTOMATION
AUTOMATION
AUTOMATION
pod pod

Greenplum Database V7
BEYOND THE CLUSTER

GPDB v7: Beyond the Cluster
We have all this Postgres infrastructure in
GPDB v6 now lets use it
● Postgres 9.6 target
● DB Snapshots / Backup
● Streaming Replication
● Log Shipping and Reconciliation
● Greenplum as a source for Kafka
● Greenplum as a source for CDC Tools
● Greenplum to Greenplum Inter Cluster Queries
“You do this and you
can beat Oracle”
-- US Federal
Customer, 2018

GPDB v7: Thought Leadership in Database AI
Define Artificial Intelligence. Does it make
sense to integrate intelligence into an
analytical platform?
● 2019 Apache Madlib is focused on Deep Learning
and GPU processing
● 2019 Pivotal’s GPText Solution will add more
cognitive intelligence of human language
● Combine with existing functions: PostGIS
Geospatial; Apache Madlib Machine Learning &
Graph; Python, R libraries, SQL at scale
● This is a platform for modern AI!
“With the Apache MADlib
analytics libraries, Pivotal
Greenplum has capable
in-database analytics that
allow for predictive
modeling and ML to be
applied to relational
data.” -- Gartner MQ 2019

“Greenplum Database, soar
with us new to new heights”

GPDB v6: Massive Postgres Power for Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to GPDB v6: Massive Postgres Power for Analytics

Similar to GPDB v6: Massive Postgres Power for Analytics (20)

More from VMware Tanzu

More from VMware Tanzu (20)

Recently uploaded

Recently uploaded (20)

GPDB v6: Massive Postgres Power for Analytics