Big Bad PostgreSQL @ Percona

Big Bad PostgreSQL: A Case Study

Moving a
“large,”
“complicated,” and
mission-critical
datawarehouse
from Oracle
to PostgreSQL
for cost control.

1

About the Speaker

• Principal @ OmniTI
S32699X_Scalable_Internet.qxd 6/23/06 3:31 PM Page 1

Scalable Internet Architectures
• Open Source
Theo Schlossnagle

With an estimated one billion users worldwide, the Internet today is nothing less than a
global subculture with immense diversity, incredible size, and wide geographic reach. With a
relatively low barrier to entry, almost anyone can register a domain name today and potentially
provide services to people around the entire world tomorrow. But easy entry to web-based
commerce and services can be a double-edged sword. In such a market, it is typically much
harder to gauge interest in advance, and the negative impact of unexpected customer traffic
can turn out to be devastating for the unprepared.

mod_backhand, spreadlogd,
In Scalable Internet Architectures, renowned software engineer and architect Theo
Schlossnagle outlines the steps and processes organizations can follow to build online
services that can scale well with demand—both quickly and economically. By making
intelligent decisions throughout the evolution of an architecture, scalability can be a matter

Scalable Internet
of engineering rather than redesign, costly purchasing, or black magic.

OpenSSH+SecurID, Daiquiri,
Filled with numerous examples, anecdotes, and lessons gleaned from the author’s years
of experience building large-scale Internet services, Scalable Internet Architectures is both
thought-provoking and instructional. Readers are challenged to understand first, before they

Architectures
start a large project, how what they are building will be used, so that from the beginning
they can design for scalability those parts which need to scale. With the right approach, it
should take no more effort to design and implement a solution that scales than it takes

Wackamole, libjlog, Spread,
to build something that will not—and if this is the case, Schlossnagle writes, respect
yourself and build it right.

Theo Schlossnagle is a principal at OmniTI Computer Consulting, where he provides
expert consulting services related to scalable Internet architectures, database replication,

Reconnoiter, etc.
and email infrastructure. He is the creator of the Backhand Project and the Ecelerity MTA,
and spends most of his time solving the scalability problems that arise in high performance
and highly distributed systems.

Internet/Programming Cover image © Digital Vision/Getty Images

• Closed Source
Schlossnagle
Scalability
$49.99 USA / $61.99 CAN / £35.99 Net UK
Performance
Security
www.omniti.com

DEVELOPER’S
LIBRARY
DEVELOPER’S
www.developers-library.com LIBRARY

Message Systems MTA,
Message Central

• Author

Overall Architecture

OLTP instance:
Oracle 8i

drives the site
0.5 TB 0.25 TB
Hitachi JBOD

OLTP

Log import and Oracle 8i

processing
Oracle 8i

0.75 TB
JBOD
MySQL
log importer 0.5 TB 1.5 TB
Hitachi MTI OLTP warm backup

Warm spare
1.2 TB
SATA Datawarehouse
RAID

Log Importer

MySQL 4.1

1.2 TB
IDE RAID

Data Exporter

bulk selects / data exports

Database Situation

• The problems:
• The database is growing.
• The OLTP and ODS/warehouse are too slow.
• A lot of application code against the OLTP system.
• Minimal application code against the ODS system.

Database Situation

• The problems:
• Oracle:
• Licensed per processor.
• Really, really, really expensive on a large scale.

Database Situation

• The problems:
• Oracle:
• Licensed per processor.
• Really, really, really expensive on a large scale.
• PostgreSQL:
• No licensing costs.
• Good support for complex queries.

Database Choices

• Must keep Oracle on OLTP
•Complex, Oracle-specific web application.
•Need more processors.

Database Choices

• Complex, Oracle-specific web application.
• Need more processors.
• ODS: Oracle not required.
• Complex queries from limited sources.
• Needs more space and power.

Database Choices

• Complex, Oracle-specific web application.
• Need more processors.
• ODS: Oracle not required.
• Complex queries from limited sources.
• Needs more space and power.
• Result:
• Move ODS Oracle licenses to OLTP
• Run PostgreSQL on ODS

PostgreSQL gotchas

• For an OLTP system that does thousands of
updates per second, vacuuming is a hassle.

PostgreSQL gotchas


• No upgrades?!

PostgreSQL gotchas


• No upgrades?!

• Less community experience with large
databases.

PostgreSQL gotchas


• No upgrades?!

• Less community experience with large
databases.

• Replication features less evolved.

PostgreSQL ♥ ODS

• Mostly inserts.

PostgreSQL ♥ ODS

• Mostly inserts.

• Updates/Deletes controlled, not real-time.

PostgreSQL ♥ ODS

• Mostly inserts.


• pl/perl (leverage DBI/DBD for remote
database connectivity).

PostgreSQL ♥ ODS

• Mostly inserts.



• Monster queries.

PostgreSQL ♥ ODS

• Mostly inserts.



• Monster queries.

• Extensible.

Choosing Linux

• Popular, liked, good community support.

Choosing Linux


• Chronic problems:

Choosing Linux



• kernel panics

Choosing Linux



• kernel panics

• filesystems remounting read-only

Choosing Linux



• kernel panics


• filesystems don’t support snapshots

Choosing Linux



• kernel panics



• LVM is clunky on enterprise storage

Choosing Linux



• kernel panics



• LVM is clunky on enterprise storage

• 20 outages in 4 months

Choosing Solaris 10

• Switched to Solaris 10

Choosing Solaris 10


• No crashes, better system-level tools.

Choosing Solaris 10



• prstat, iostat, vmstat, smf, fault-
management.

Choosing Solaris 10



management.

• ZFS

Choosing Solaris 10



management.

• ZFS

• snapshots (persistent), BLI backups.

Choosing Solaris 10



management.

• ZFS


• Excellent support for enterprise storage.

Choosing Solaris 10



management.

• ZFS



• DTrace.

Choosing Solaris 10



management.

• ZFS



• DTrace.

• Free (too).

Oracle features we need

• Partitioning


• Partitioning

• Statistics and Aggregations


• Partitioning


• rank over partition, lead, lag, etc.


• Partitioning



• Large selects (100GB)


• Partitioning




• Autonomous transactions


• Partitioning




• Autonomous transactions

• Replication from Oracle (to Oracle)

Partitioning

For large data sets:

Partitioning

pgods=# select count(1) from ods.ods_tblpick_super;

Partitioning

count
------------
1790994512
(1 row)

Partitioning

count
------------
1790994512
(1 row)

• Next biggest tables: 850m, 650m, 590m

Partitioning

count
------------
1790994512
(1 row)

• Allows us to cluster data over specific ranges
(by date in our case)

Partitioning

count
------------
1790994512
(1 row)

• Simple, cheap archiving and removal of data.

Partitioning

count
------------
1790994512
(1 row)

• Simple, cheap archiving and removal of data.
• Can put ranges used less often in different
tablespaces (slower, cheaper storage)

Partitioning PostgreSQL style

• PostgreSQL doesn’t support partition...



• It supports inheritance... (what’s this?)




• some crazy object-relation paradigm.





• We can use it to implement partitioning:






• One master table with no rows.







• Child tables that have our partition constraints.







• Child tables that have our partition constraints.

• Rules on the master table for insert/update/delete.

Partitioning PostgreSQL realized


• Cheaply add new empty partitions



• Cheaply remove old partitions




• Migrate less-often-accessed partitions to slower
storage




storage

• Different indexes strategies per partition




storage


• PostgreSQL >8.1 supports constraint checking on
inherited tables.




storage


inherited tables.
• smarter planning




storage


inherited tables.
• smarter planning

• smarter executing

RANK OVER PARTITION

• In Oracle:

• In PostgreSQL:

With 8.4, we have windowing functions

RANK OVER PARTITION

• In Oracle:
select userid, email from (

select u.userid, u.email,

row_number() over
(partition by u.email order by userid desc) as position

from (...)) where position = 1

• In PostgreSQL:


RANK OVER PARTITION

• In Oracle:
select userid, email from (

select u.userid, u.email,

row_number() over
(partition by u.email order by userid desc) as position

from (...)) where position = 1

• In PostgreSQL:
FOR v_row IN select u.userid, u.email from (...) order by email, userid desc
LOOP

IF v_row.email != v_last_email THEN

RETURN NEXT v_row;

v_last_email := v_row.email;

v_rownum := v_rownum + 1;

END IF;
END LOOP;


Large SELECTs

• Application code does:

Large SELECTs

select u.*, b.browser, m.lastmess
from ods.ods_users u,
ods.ods_browsers b,
( select userid, min(senddate) as senddate
from ods.ods_maillog
group by userid ) m,
ods.ods_maillog l
where u.userid = b.userid
and u.userid = m.userid
and u.userid = l.userid
and l.senddate = m.senddate;

Large SELECTs

ods.ods_browsers b,
ods.ods_maillog l

• The width of these rows is about 2k

Large SELECTs

ods.ods_browsers b,
ods.ods_maillog l


• 50 million row return set

Large SELECTs

ods.ods_browsers b,
ods.ods_maillog l


• 50 million row return set

• > 100 GB of data

The Large SELECT Problem

• libpq will buffer the entire result in memory.

• This affects language bindings (DBD::Pg).

• This is an utterly deficient default behavior.

• This can be avoided by using cursors

• Requires the app to be PostgreSQL specific.

• You open a cursor.

• Then FETCH the row count you desire.

Big SELECTs the Postgres way

The previous “big” query becomes:


DECLARE CURSOR bigdump FOR
ods.ods_browsers b,
ods.ods_maillog l


DECLARE CURSOR bigdump FOR
ods.ods_browsers b,
ods.ods_maillog l

Then, in a loop:
FETCH FORWARD 10000 FROM bigdump;

Autonomous Transactions

• In Oracle we have over 2000 custom stored procedures.


• During these procedures, we like to:


• COMMIT incrementally
Useful for long transactions (update/delete) that
need not be atomic -- incremental COMMITs.


• COMMIT incrementally
Useful for long transactions (update/delete) that
need not be atomic -- incremental COMMITs.

• start a new top-level txn that can COMMIT
Useful for logging progress in a stored procedure so
that you know how far you progessed and how long
each step took even if it rolls back.

PostgreSQL shortcoming

• PostgreSQL simply does not support
Autonomous transactions and to quote core
developers “that would be hard.”



• When in doubt, use brute force.



• When in doubt, use brute force.

• Use pl/perl to use DBD::Pg to connect to
ourselves (a new backend) and execute a new
top-level transaction.

Replication

• Cross vendor database replication isn’t too difficult.

Replication

• Helps a lot when you can do it inside the database.

Replication

• Using dbi-link (based on pl/perl and DBI) we can.

Replication

• We can connect to any remote database.

Replication

• INSERT into local tables directly from remote
SELECT statements.
[snapshots]

Replication

• INSERT into local tables directly from remote
SELECT statements.
[snapshots]
• LOOP over remote SELECT statements and
process them row-by-row.
[replaying remote DML logs]

Replication (really)

• Through a combination of snapshotting and DML
replay we:

• replicate over into over 2000 tables in PostgreSQL
from Oracle

• snapshot replication of 200

• DML replay logs for 1800

Replication (really)

• Through a combination of snapshotting and DML
replay we:

• replicate over into over 2000 tables in PostgreSQL
from Oracle

• snapshot replication of 200

• DML replay logs for 1800

• PostgreSQL to Oracle is a bit harder

• out-of-band export and imports

New Architecture

• Master: Sun v890 and Hitachi AMS + warm standby
running Oracle
(1TB)

• Logs: several customs
running MySQL instances
(2TB each)

• ODS BI: 2x Sun v40
running PostgreSQL 8.3
(6TB on Sun JBODs on ZFS each)

• ODS archive: 2x custom
running PostgreSQL 8.3
(14TB internal storage on ZFS each)

PostgreSQL is Lacking

• No upgrades (AYFKM).

• pg_dump is too intrusive.

• Poor system-level instrumentation.

• Poor methods to determine specific contention.

• It relies on the operating system’s filesystem cache.
(which make PostgreSQL inconsistent across it’s
supported OS base)

Enter Solaris

• Solaris is a UNIX from Sun Microsystems.

• Is it different than other UNIX/UNIX-like systems?

• Mostly it isn’t different (hence the term UNIX)

• It does have extremely strong ABI backward
compatibility.

• It’s stable and works well on large machines.

• Solaris 10 shakes things up a bit:
• DTrace

• ZFS

• Zones

Solaris / ZFS

• ZFS: Zettaback Filesystem.

• 264 snapshots, 248 files/directory, 264 bytes/filesystem,
278 (256 ZiB) bytes in a pool, 264 devices/pool, 264 pools/system

• Extremely cheap differential backups.

• I have a 5 TB database, I need a backup!

• No rollback in your database? What is this? MySQL?

• No rollback in your filesystem?

• ZFS has snapshots, rollback, clone and promote.

• OMG! Life altering features.

• Caveat: ZFS is slower than alternatives, by about 10% with tuning.

Solaris / Zones

• Zones: Virtual Environments.

• Shared kernel.

• Can share filesystems.

• Segregated processes and privileges.

• No big deal for databases, right?

But Wait!

Solaris / ZFS + Zones = Magic Juju
https://labs.omniti.com/trac/pgsoltools/browser/trunk/pitr_clone/clonedb_startclone.sh

• ZFS snapshot, clone, delegate to zone, boot and run.

• When done, halt zone, destroy clone.

• We get a point-in-time copy of our PostgreSQL database:

• read-write,

• low disk-space requirements,

• NO LOCKS! Welcome back pg_dump,
you don’t suck (as much) anymore.

• Fast snapshot to usable copy time:

• On our 20 GB database: 1 minute.

• On our 1.2 TB database: 2 minutes.

ZFS: how I saved my soul.

• Database crash. Bad. 1.2 TB of data... busted.
The reason Robert Treat looks a bit older than he
should.

• xlogs corrupted. catalog indexes corrupted.

• Fault? PostgreSQL bug? Bad memory? Who knows?

• Trial & error on a 1.2 TB data set is a cruel experience.

• In real-life, most recovery actions are destructive
actions.

• PostgreSQL is no different.

• Rollback to last checkpoint (ZFS), hack postgres code,
try, fail, repeat.

Let DTrace open your eyes

• DTrace: Dynamic Tracing

• Dynamically instrument “stuff” in the system:
• system calls (like strace/truss/ktrace).

• process/scheduler activity (on/off cpu, semaphores, conditions).

• see signals sent and received.

• trace kernel functions, networking.

• watch I/O down to the disk.

• user-space processes, each function... each machine instruction!

• Add probes into apps where it makes sense to you.

Can you see what I see?

• There is EXPLAIN... when that isn’t enough...

• There is EXPLAIN ANALYZE... when that isn’t enough.

• There is DTrace.

; dtrace -q -n ‘
postgresql*:::statement-start
{
self->query = copyinstr(arg0);
self->ok=1;
}
io:::start
/self->ok/
{
@[self->query,
args[0]->b_flags & B_READ ? quot;readquot; : quot;writequot;,
args[1]->dev_statname] = sum(args[0]->b_bcount);
}’
dtrace: description 'postgres*:::statement-start' matched 14 probes
^C

select count(1) from c2w_ods.tblusers where zipcode between 10000 and 11000;
read sd1 16384
select division, sum(amount), avg(amount) from ods.billings where txn_timestamp
between ‘2006-01-01 00:00:00’ and ‘2006-04-01 00:00:00’ group by division;
read sd2 71647232

OmniTI Labs / pgsoltools

• https://labs.omniti.com/trac/pgsoltools

• Where we stick out PostgreSQL on Solaris goodies...

• like pg_file_stress

FILENAME/DBOBJECT READS WRITES
# min avg max # min avg max
alldata1__idx_remove_domain_external 1 12 12 12 398 0 0 0
slowdata1__pg_rewrite 1 12 12 12 0 0 0 0
slowdata1__pg_class_oid_index 1 0 0 0 0 0 0 0
slowdata1__pg_attribute 2 0 0 0 0 0 0 0
alldata1__mv_users 0 0 0 0 4 0 0 0
slowdata1__pg_statistic 1 0 0 0 0 0 0 0
slowdata1__pg_index 1 0 0 0 0 0 0 0
slowdata1__pg_index_indexrelid_index 1 0 0 0 0 0 0 0
alldata1__remove_domain_external 0 0 0 0 502 0 0 0
alldata1__promo_15_tb_full_2 19 0 0 0 11 0 0 0
slowdata1__pg_class_relname_nsp_index 2 0 0 0 0 0 0 0
alldata1__promo_177intaoltest_tb 0 0 0 0 1053 0 0 0
slowdata1__pg_attribute_relid_attnum_index 2 0 0 0 0 0 0 0
alldata1__promo_15_tb_full_2_pk 2 0 0 0 0 0 0 0
alldata1__all_mailable_2 1403 0 0 423 0 0 0 0
alldata1__mv_users_pkey 0 0 0 0 4 0 0 0

Results


Results



• Save $800k in license costs.

Results




• Spend $100k in labor costs.

Results




• Spend $100k in labor costs.

• Learn a lot.

Thanks!

• Thank you.

• http://omniti.com/does/postgresql

• We’re hiring, but only if you love:

• lots of data on lots of disks on lots of big boxes

• smart people

• hard problems

• more than one database technology (including PostgreSQL)

• responsibility

Big Bad PostgreSQL @ Percona

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (13)

Similar to Big Bad PostgreSQL @ Percona

Similar to Big Bad PostgreSQL @ Percona (20)

More from Theo Schlossnagle

More from Theo Schlossnagle (20)

Recently uploaded

Recently uploaded (20)

Big Bad PostgreSQL @ Percona

Editor's Notes