2. The Fundamental Theorem of Software Engineering
"We can solve any problem by introducing an extra level of
indirection."
originated by Andrew Koenig to describe a remark by Butler Lampson attributed to the late David J. Wheeler
Question: What provides indirection for Databases?
3. Backgrounds
Greenplum History
● Exists since PostgreSQL 7.4 - went live in 2005
● Merged with PostgreSQL until 8.2, then forked
● Product evolved, company acquired
● Open Source in 2015
● Greenplum getting closer to latest PostgreSQL
with every new merged version
○ PG 8.3 is merged (GPDB Version V5)
○ PG 9.4 is WIP (GBDB Version 6)
Heimdall History
● Founded in 2014
● Advanced Partner with Pivotal
● AWS Competency Partner
● Database Vendor Neutral:
○ Postgres, SQL Server, MySQL, JDBC data
sources
4. Pivotal Greenplum
Powerful, Postgres-based MPP and multi-cloud analytics on petabyte-scale data
Challenges
• Legacy scale-up DBs are
expensive to operate
• Hadoop doesn’t fit low-latency,
iterative analytics with high
user concurrency
• Multiple environments with
messy, disjointed structured
and unstructured data
Greenplum Delivers
• Multi-cloud, Open-source,
analytics data platform
• Massively parallel processing
with machine learning and ANSI
SQL compliance
• Unify and query structured and
unstructured data from native,
HDFS, and cloud storage -
including text, spatial, and graph
data
Benefits
• Scales linearly with hardware for
optimal cost and performance
• Faster workflow; train models in
parallel, publish to DB for rapid
parallel scoring
• Analyze more types of data more
quickly for faster, deeper
insights
5. Hadoop Data Lakes
Massively Parallel Data Warehouse
Public Cloud Data Lakes
Predefined Libraries
Programmatic
GPText
Massively Parallel Analytical Processing
High Speed
of Ingestion
Pivotal
Greenplum
Massively Parallel Data Load from External Sources
In-DB Predictive Analytics
High Speed of
Processing
Massively Parallel Postgres Architecture
<Postgres in Parallel>
6. Application Server
Application
Heimdall
Data
Driver/Proxy
Application Servers
SQL Auto-Caching
Auto-invalidation
Auto-Cache Refresh
Automated Failover
Load Balancing
Read/Write Split
Batch Processing
OLTP/OLAP Routing
Query Triggers
Query Analytics,
Transformation,
& Firewall
Connection Pooling
Heimdall Architecture
Application Server
Application
Heimdall
Data
Driver/Proxy
Application Server
Application
Heimdall
Database
Proxy
7. OLAP VS OLTP
Analytics-based (OLAP)
● High latency, many reads, less writes
● Bulk ETL operations, complex queries
● Calculate Results across very large datasets
● Most are purposely built to scale (and expand)
to many nodes with replication and HA built in.
● Optimizer should evaluate best plan using
statistics for more complex analytical queries
● SLA’s are not sub-second/minute
● Caching or materialized views typically is not
leveraged due to inherent nature of deep/wide
analytical queries
Transactional-based (OLTP)
● Low Latency, memory intensive operations
● Singleton ETL operations including DML
● Typically targeted data retrieval
● Scale has limitations and expensive – single
node for OTLP purposes (Postgres)
● Optimizer does not need to be intelligent as
most queries are single threaded
● SLA’s are typically sub-second
● Caching utilized heavily for SLA
8. OLTP happens - on Analytical shared-nothing
systems
Many applications ported from Oracle, etc
● Greenplum will open and spawn many threads based on query type
● Singleton ops take up unnecessary threads exhausting finite RAM/CPU
resources.
● Pooling agents can alleviate pressure on the Master – but throughput will be
affected by number of resources used and operation type.
● Small, quick queries are not cached resulting in re-reads (lookup / dim tables, etc).
● Historically, applications need to be re-written to utilize batch loading operations –
expensive!
● When combined, referred to as HTAP “Hybrid Transaction Analytical Processing”
9. HTAP Use Case: Dell, Inc.
Problem: Legacy Apps with Singleton DML
(Insert/Update/Delete)
● Existing infrastructure supported applications performing single
inserts/update/deletes in volume
● Greenplum’s MPP Design has slow commit times for Singleton Inserts
● Customer desired to support DML without a redesign
Solution: Heimdall Auto-Batching into Greenplum
● DML operations are isolated and batched by Heimdall
● Commits are performed over many operations, reducing overhead
● Exceptions are tracked by Heimdall for later analysis
Result: DML Performance Increased by 20x, Meeting
Requirements
10. 4 3 2 1
Application DML Request
6 5 4 3 2 1
Queue
Batch Size 4
78
START TRANSACTION;
DML 1;
DML 2;
DML 3;
DML 4;
COMMIT;
Exceptions are logged,
removed from batch, and
transaction restarted
Benefits:
•Lower CPU overhead due to fewer commits
•Improved application response time
•Improved DML scale
#1
#2
#3
#4
Asynchronous Batch Processing
12. Customer Example: CATL
Problem: Slow Report Generation in Tableau
● Each report contained up to 30 queries, taking 30 seconds each
● Data was updated every two hours
● Reports viewed at random intervals by management
Solution: Heimdall Auto-Refresh Caching into Gemfire
● Redundant queries were learned by Heimdall
● Via Stored Procedure after data load, Heimdall invalidates modified tables
● Query cache was refreshed from Greenplum into Gemfire by Heimdall
Result: Average Report Generation Went From 17s to 3s
13. Auto-Refresh Caching
Bulk Data
Upload
Invalidation SP
(Or Trigger)
Invalidation Event
Initial Request
& Response
Initial Request &
Response
Cache
Populated
Query Tracker
Application
Caches (L1+L2)
Data Source
Cached
Result
Later Request &
Cached Result
Cache
Invalidated
Queries
Reissued
Cache Re-
Populated
Refresh Request & Response
Auto-Refresh targets finite query pattern environments, i.e
reporting and dashboard interfaces
14. Customer Example: Questis w/ Aurora for Postgres
Problem: Productizing MVP (Minimally Viable Product)
● Development had focus on features, not performance
● No cache layer had been implemented during MVP development
● In use, many redundant queries were being performed
Solution: Heimdall Caching Logic for Amazon Elasticache
● Reduced Database load by 90%
● Improved page generation time
● Auto-Invalidation gave peak cache efficiency without stale data
Result: MVP Code was put into production without rewrites
for caching and met customer SLA’s
15. Pivotal Greenplum: Learn More
● Find out more about Pivotal Greenplum and Heimdall at
○ https://pivotal.io/pivotal-greenplum
○ https://heimdalldata.com
● OR learn more about the open source Greenplum at
○ http://greenplum.org/
● OR give a try:
○ Amazon AWS, Azure, Google GCP or Heimdall website
● Check for the Heimdall Q/A Deep Dive (Date TBD)