6. Greenplum uses PXF
as a federated query engine
to access
external heterogeneous data.
7. Platform Extension Framework (PXF)
Tabular view for
heterogeneous data
Built-in connectors for
various data sources/formats Pluggable framework
Parallel high throughput
data access
Open source
Read and write
external data
7
9. Q: How can I access sales data residing in an S3 bucket stored in parquet format?
Greenplum External Table
CREATE EXTERNAL TABLE sales
(cust int, sku text, amount decimal, date date)
LOCATION
('pxf://s3-bucket/2018/sales/?PROFILE=s3:parquet&SERVER=s3_sales')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import')
profilepath to data server
9
10. How can we scale performance
when querying remote data ?
11. Performance - Predicate Pushdown
state=NY
state=NJ
state=CA
state=CA
{state='CA'}
SELECT item, amount FROM orders
WHERE state = 'CA'MASTER
SEGMENT
predicates :
state=CA
PXF with
JDBC
Row oriented
storage format
● Predicate information
pushed to external
system
● External engines can
support predicates for
its own queries (e.g.
JDBC)
● No filtering within PXF
itself
● Partition pruning (e.g.
Hive)
12. Performance - Column Projection
date:
{item:,
amount:,
state='CA'}
SELECT item, amount FROM orders
WHERE state = 'CA'MASTER
SEGMENT
columns : item, amount
predicates : state=CA
aggregates : count
PXF with
Hive/ORC
Columnar
storage format
● Propagate columns
projection metadata to
external systems
● JDBC, Parquet & ORC
● Reduces Network I/O
● Reduces Remote Disk
I/O
● Improved performance
for aggregate queries
state:
amount:
item:
14. Use Case: Multi-temperature data querying
● Storage based on
operational requirements
● Can I work with data
created few second ago ?
● Can I run a report on data
from few days ago ?
● Can I inspect the data
archived months or years
ago ?
In-Memory
Database
RDBMS
dataData Lake
HOT
DATA
WARM
DATA
COLD
DATA
14
15. Use Case: Elastic scaling with Greenplum
● Greenplum on K8s for
elastic compute
● Elastic storage with
S3/Azure/Google
● Ability to separate
compute from storage
● On-demand data
warehouses
15
16. Use Case: Access Heterogenous data on multiple
clouds
● Different cloud providers
based on business
requirements
● Low cost storage
● No storage admin
● Data doesn’t need to be
copied
16
17. Use Case: Access Heterogenous data on multiple
clouds
Historical_Orders
xx xx
xx xx
Historical_Invoices
xx xx
xx xx
Product_Catalog
xx xx
xx xx
Historical_Orders
xx xx
xx xx
Admin migrates data from s3-
bucket-orders to Azure Blob
Storage
SELECT * FROM historical_orders o, product_catalog p
WHERE o.product_id = p.product_id
s3-bucket-orders s3-bucket-price
Historical_Invoices
xx xx
xx xx
17
SELECT * FROM historical_orders o, product_catalog p
WHERE o.product_id = p.product_id
18. Use Case: Access Heterogenous data on multiple
cloud
CREATE EXTERNAL TABLE historical_orders
(item int, amount money)
LOCATION
('pxf://s3-bucket-orders/path?PROFILE=s3:parquet&SERVER=s3_orders')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
CREATE EXTERNAL TABLE historical_orders
(item int, amount money)
LOCATION
('pxf://my.azuredatalakestore.net/path?PROFILE=adl:parquet&SERVER=azure')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
Historical_orders table data on S3
Historical_orders table data now on Azure Data Lake
18
19. Summary
Greenplum embraces the modern data landscape
● Scale and manage compute independently from storage
● Federate queries across heterogeneous data sources
● Cloud Agnostic
Data is available for analytics with Greenplum no matter its form and where it
resides!
19
21. Cover w/ Image
Greenplum External
Table
Define an external table with the following:
● the schema of the external data
● the protocol pxf
● the location of the data in an external
system
● the profile to identify the specific connector
● The compressions_codec of the data
● the format of the external data
CREATE [READABLE|WRITABLE] EXTERNAL TABLE
table_name
( col_name data_type [,...] | LIKE other_table )
LOCATION ('pxf://<path to data>?
PROFILE=[<profile_name>|<data_store:data_type>]&
COMPRESSIONG_CODEC=[snappy|gzip|lzo|bzip2]&
[&<CUSTOM_OPTIONS>=<value>[...]]’)
FORMAT '[TEXT|CSV|CUSTOM]'
cust, sku, amount, date
1234, ABC, $9.90, 4/01
1235, CDE, $8.80, 3/30
CREATE EXTERNAL TABLE sales
(cust int, sku text, amount decimal, date date)
LOCATION
('pxf:///2018/sales.csv?PROFILE=hdfs:text')
FORMAT 'TEXT'
22. Cover w/ Image
PXF supports accessing multiple external datastores
simultaneously
● server identifies an external datastore
● Staging directory server/ under
${PXF_CONF}
● Contains relevant configuration files under
servers/{server_name}/
○ HDFS: core-site.xml, hdfs-site.xml, ...
○ S3: s3-site.xml containing access
properties
PXF Multi Server
CREATE [READABLE|WRITABLE] EXTERNAL TABLE
table_name
( col_name data_type [,...] | LIKE other_table )
LOCATION ('pxf://<path to data>?
PROFILE=<data_store:data_type>&
SERVER=<server_name>’)
CREATE EXTERNAL TABLE sales
(cust int, sku text, amount decimal, date date)
LOCATION ('pxf://s3-bucket-
sales/2018/sales.csv?PROFILE=s3:text&server=s3_s
ales’)
FORMAT 'TEXT'
cust, sku, amount, date
1234, ABC, $9.90, 4/01
1235, CDE, $8.80, 3/30
23. Performance in PXF
● Parallel access to data
● Predicate pushdown
● Column projection
23
SELECT item, amount
WHERE state = 'CA'
column projection
predicate pushdown
24. Performance in PXF
● Parallel access to
data
● Column Projection
● Predicate Pushdown
24
SELECT item, amount
WHERE state = 'CA'
column projection
predicate pushdown