Discuss building a multi-terabyte PostgreSQL instance in a high volume, mission-critical operational datastore that replaced Oracle. Learn about solving real-life problems such as a near-catastrophic hardware failure at terabyte level.
6. Database Situation
• The problems:
• The database is growing.
• The OLTP and ODS/warehouse are too slow.
• A lot of application code against the OLTP system.
• Minimal application code against the ODS system.
7. Database Situation
• The problems:
• The database is growing.
• The OLTP and ODS/warehouse are too slow.
• A lot of application code against the OLTP system.
• Minimal application code against the ODS system.
• Oracle:
• Licensed per processor.
• Really, really, really expensive on a large scale.
8. Database Situation
• The problems:
• The database is growing.
• The OLTP and ODS/warehouse are too slow.
• A lot of application code against the OLTP system.
• Minimal application code against the ODS system.
• Oracle:
• Licensed per processor.
• Really, really, really expensive on a large scale.
• PostgreSQL:
• No licensing costs.
• Good support for complex queries.
10. Database Choices
• Must keep Oracle on OLTP
•Complex, Oracle-specific web application.
•Need more processors.
11. Database Choices
• Must keep Oracle on OLTP
• Complex, Oracle-specific web application.
• Need more processors.
• ODS: Oracle not required.
• Complex queries from limited sources.
• Needs more space and power.
12. Database Choices
• Must keep Oracle on OLTP
• Complex, Oracle-specific web application.
• Need more processors.
• ODS: Oracle not required.
• Complex queries from limited sources.
• Needs more space and power.
• Result:
• Move ODS Oracle licenses to OLTP
• Run PostgreSQL on ODS
14. PostgreSQL gotchas
• For an OLTP system that does thousands of
updates per second, vacuuming is a hassle.
15. PostgreSQL gotchas
• For an OLTP system that does thousands of
updates per second, vacuuming is a hassle.
• No upgrades?!
16. PostgreSQL gotchas
• For an OLTP system that does thousands of
updates per second, vacuuming is a hassle.
• No upgrades?!
• Less community experience with large
databases.
17. PostgreSQL gotchas
• For an OLTP system that does thousands of
updates per second, vacuuming is a hassle.
• No upgrades?!
• Less community experience with large
databases.
• Replication features less evolved.
34. Choosing Solaris 10
• Switched to Solaris 10
• No crashes, better system-level tools.
35. Choosing Solaris 10
• Switched to Solaris 10
• No crashes, better system-level tools.
• prstat, iostat, vmstat, smf, fault-
management.
36. Choosing Solaris 10
• Switched to Solaris 10
• No crashes, better system-level tools.
• prstat, iostat, vmstat, smf, fault-
management.
• ZFS
37. Choosing Solaris 10
• Switched to Solaris 10
• No crashes, better system-level tools.
• prstat, iostat, vmstat, smf, fault-
management.
• ZFS
• snapshots (persistent), BLI backups.
38. Choosing Solaris 10
• Switched to Solaris 10
• No crashes, better system-level tools.
• prstat, iostat, vmstat, smf, fault-
management.
• ZFS
• snapshots (persistent), BLI backups.
• Excellent support for enterprise storage.
39. Choosing Solaris 10
• Switched to Solaris 10
• No crashes, better system-level tools.
• prstat, iostat, vmstat, smf, fault-
management.
• ZFS
• snapshots (persistent), BLI backups.
• Excellent support for enterprise storage.
• DTrace.
40. Choosing Solaris 10
• Switched to Solaris 10
• No crashes, better system-level tools.
• prstat, iostat, vmstat, smf, fault-
management.
• ZFS
• snapshots (persistent), BLI backups.
• Excellent support for enterprise storage.
• DTrace.
• Free (too).
44. Oracle features we need
• Partitioning
• Statistics and Aggregations
• rank over partition, lead, lag, etc.
45. Oracle features we need
• Partitioning
• Statistics and Aggregations
• rank over partition, lead, lag, etc.
• Large selects (100GB)
46. Oracle features we need
• Partitioning
• Statistics and Aggregations
• rank over partition, lead, lag, etc.
• Large selects (100GB)
• Autonomous transactions
47. Oracle features we need
• Partitioning
• Statistics and Aggregations
• rank over partition, lead, lag, etc.
• Large selects (100GB)
• Autonomous transactions
• Replication from Oracle (to Oracle)
49. Partitioning
For large data sets:
pgods=# select count(1) from ods.ods_tblpick_super;
50. Partitioning
For large data sets:
pgods=# select count(1) from ods.ods_tblpick_super;
count
------------
1790994512
(1 row)
51. Partitioning
For large data sets:
pgods=# select count(1) from ods.ods_tblpick_super;
count
------------
1790994512
(1 row)
• Next biggest tables: 850m, 650m, 590m
52. Partitioning
For large data sets:
pgods=# select count(1) from ods.ods_tblpick_super;
count
------------
1790994512
(1 row)
• Next biggest tables: 850m, 650m, 590m
• Allows us to cluster data over specific ranges
(by date in our case)
53. Partitioning
For large data sets:
pgods=# select count(1) from ods.ods_tblpick_super;
count
------------
1790994512
(1 row)
• Next biggest tables: 850m, 650m, 590m
• Allows us to cluster data over specific ranges
(by date in our case)
• Simple, cheap archiving and removal of data.
54. Partitioning
For large data sets:
pgods=# select count(1) from ods.ods_tblpick_super;
count
------------
1790994512
(1 row)
• Next biggest tables: 850m, 650m, 590m
• Allows us to cluster data over specific ranges
(by date in our case)
• Simple, cheap archiving and removal of data.
• Can put ranges used less often in different
tablespaces (slower, cheaper storage)
58. Partitioning PostgreSQL style
• PostgreSQL doesn’t support partition...
• It supports inheritance... (what’s this?)
• some crazy object-relation paradigm.
59. Partitioning PostgreSQL style
• PostgreSQL doesn’t support partition...
• It supports inheritance... (what’s this?)
• some crazy object-relation paradigm.
• We can use it to implement partitioning:
60. Partitioning PostgreSQL style
• PostgreSQL doesn’t support partition...
• It supports inheritance... (what’s this?)
• some crazy object-relation paradigm.
• We can use it to implement partitioning:
• One master table with no rows.
61. Partitioning PostgreSQL style
• PostgreSQL doesn’t support partition...
• It supports inheritance... (what’s this?)
• some crazy object-relation paradigm.
• We can use it to implement partitioning:
• One master table with no rows.
• Child tables that have our partition constraints.
62. Partitioning PostgreSQL style
• PostgreSQL doesn’t support partition...
• It supports inheritance... (what’s this?)
• some crazy object-relation paradigm.
• We can use it to implement partitioning:
• One master table with no rows.
• Child tables that have our partition constraints.
• Rules on the master table for insert/update/delete.
66. Partitioning PostgreSQL realized
• Cheaply add new empty partitions
• Cheaply remove old partitions
• Migrate less-often-accessed partitions to slower
storage
67. Partitioning PostgreSQL realized
• Cheaply add new empty partitions
• Cheaply remove old partitions
• Migrate less-often-accessed partitions to slower
storage
• Different indexes strategies per partition
68. Partitioning PostgreSQL realized
• Cheaply add new empty partitions
• Cheaply remove old partitions
• Migrate less-often-accessed partitions to slower
storage
• Different indexes strategies per partition
• PostgreSQL >8.1 supports constraint checking on
inherited tables.
69. Partitioning PostgreSQL realized
• Cheaply add new empty partitions
• Cheaply remove old partitions
• Migrate less-often-accessed partitions to slower
storage
• Different indexes strategies per partition
• PostgreSQL >8.1 supports constraint checking on
inherited tables.
• smarter planning
70. Partitioning PostgreSQL realized
• Cheaply add new empty partitions
• Cheaply remove old partitions
• Migrate less-often-accessed partitions to slower
storage
• Different indexes strategies per partition
• PostgreSQL >8.1 supports constraint checking on
inherited tables.
• smarter planning
• smarter executing
71. RANK OVER PARTITION
• In Oracle:
• In PostgreSQL:
With 8.4, we have windowing functions
72. RANK OVER PARTITION
• In Oracle:
select userid, email from (
select u.userid, u.email,
row_number() over
(partition by u.email order by userid desc) as position
from (...)) where position = 1
• In PostgreSQL:
With 8.4, we have windowing functions
73. RANK OVER PARTITION
• In Oracle:
select userid, email from (
select u.userid, u.email,
row_number() over
(partition by u.email order by userid desc) as position
from (...)) where position = 1
• In PostgreSQL:
FOR v_row IN select u.userid, u.email from (...) order by email, userid desc
LOOP
IF v_row.email != v_last_email THEN
RETURN NEXT v_row;
v_last_email := v_row.email;
v_rownum := v_rownum + 1;
END IF;
END LOOP;
With 8.4, we have windowing functions
75. Large SELECTs
• Application code does:
select u.*, b.browser, m.lastmess
from ods.ods_users u,
ods.ods_browsers b,
( select userid, min(senddate) as senddate
from ods.ods_maillog
group by userid ) m,
ods.ods_maillog l
where u.userid = b.userid
and u.userid = m.userid
and u.userid = l.userid
and l.senddate = m.senddate;
76. Large SELECTs
• Application code does:
select u.*, b.browser, m.lastmess
from ods.ods_users u,
ods.ods_browsers b,
( select userid, min(senddate) as senddate
from ods.ods_maillog
group by userid ) m,
ods.ods_maillog l
where u.userid = b.userid
and u.userid = m.userid
and u.userid = l.userid
and l.senddate = m.senddate;
• The width of these rows is about 2k
77. Large SELECTs
• Application code does:
select u.*, b.browser, m.lastmess
from ods.ods_users u,
ods.ods_browsers b,
( select userid, min(senddate) as senddate
from ods.ods_maillog
group by userid ) m,
ods.ods_maillog l
where u.userid = b.userid
and u.userid = m.userid
and u.userid = l.userid
and l.senddate = m.senddate;
• The width of these rows is about 2k
• 50 million row return set
78. Large SELECTs
• Application code does:
select u.*, b.browser, m.lastmess
from ods.ods_users u,
ods.ods_browsers b,
( select userid, min(senddate) as senddate
from ods.ods_maillog
group by userid ) m,
ods.ods_maillog l
where u.userid = b.userid
and u.userid = m.userid
and u.userid = l.userid
and l.senddate = m.senddate;
• The width of these rows is about 2k
• 50 million row return set
• > 100 GB of data
79. The Large SELECT Problem
• libpq will buffer the entire result in memory.
• This affects language bindings (DBD::Pg).
• This is an utterly deficient default behavior.
• This can be avoided by using cursors
• Requires the app to be PostgreSQL specific.
• You open a cursor.
• Then FETCH the row count you desire.
80. Big SELECTs the Postgres way
The previous “big” query becomes:
81. Big SELECTs the Postgres way
The previous “big” query becomes:
DECLARE CURSOR bigdump FOR
select u.*, b.browser, m.lastmess
from ods.ods_users u,
ods.ods_browsers b,
( select userid, min(senddate) as senddate
from ods.ods_maillog
group by userid ) m,
ods.ods_maillog l
where u.userid = b.userid
and u.userid = m.userid
and u.userid = l.userid
and l.senddate = m.senddate;
82. Big SELECTs the Postgres way
The previous “big” query becomes:
DECLARE CURSOR bigdump FOR
select u.*, b.browser, m.lastmess
from ods.ods_users u,
ods.ods_browsers b,
( select userid, min(senddate) as senddate
from ods.ods_maillog
group by userid ) m,
ods.ods_maillog l
where u.userid = b.userid
and u.userid = m.userid
and u.userid = l.userid
and l.senddate = m.senddate;
Then, in a loop:
FETCH FORWARD 10000 FROM bigdump;
85. Autonomous Transactions
• In Oracle we have over 2000 custom stored procedures.
• During these procedures, we like to:
86. Autonomous Transactions
• In Oracle we have over 2000 custom stored procedures.
• During these procedures, we like to:
• COMMIT incrementally
Useful for long transactions (update/delete) that
need not be atomic -- incremental COMMITs.
87. Autonomous Transactions
• In Oracle we have over 2000 custom stored procedures.
• During these procedures, we like to:
• COMMIT incrementally
Useful for long transactions (update/delete) that
need not be atomic -- incremental COMMITs.
• start a new top-level txn that can COMMIT
Useful for logging progress in a stored procedure so
that you know how far you progessed and how long
each step took even if it rolls back.
89. PostgreSQL shortcoming
• PostgreSQL simply does not support
Autonomous transactions and to quote core
developers “that would be hard.”
90. PostgreSQL shortcoming
• PostgreSQL simply does not support
Autonomous transactions and to quote core
developers “that would be hard.”
• When in doubt, use brute force.
91. PostgreSQL shortcoming
• PostgreSQL simply does not support
Autonomous transactions and to quote core
developers “that would be hard.”
• When in doubt, use brute force.
• Use pl/perl to use DBD::Pg to connect to
ourselves (a new backend) and execute a new
top-level transaction.
93. Replication
• Cross vendor database replication isn’t too difficult.
94. Replication
• Cross vendor database replication isn’t too difficult.
• Helps a lot when you can do it inside the database.
95. Replication
• Cross vendor database replication isn’t too difficult.
• Helps a lot when you can do it inside the database.
• Using dbi-link (based on pl/perl and DBI) we can.
96. Replication
• Cross vendor database replication isn’t too difficult.
• Helps a lot when you can do it inside the database.
• Using dbi-link (based on pl/perl and DBI) we can.
• We can connect to any remote database.
97. Replication
• Cross vendor database replication isn’t too difficult.
• Helps a lot when you can do it inside the database.
• Using dbi-link (based on pl/perl and DBI) we can.
• We can connect to any remote database.
• INSERT into local tables directly from remote
SELECT statements.
[snapshots]
98. Replication
• Cross vendor database replication isn’t too difficult.
• Helps a lot when you can do it inside the database.
• Using dbi-link (based on pl/perl and DBI) we can.
• We can connect to any remote database.
• INSERT into local tables directly from remote
SELECT statements.
[snapshots]
• LOOP over remote SELECT statements and
process them row-by-row.
[replaying remote DML logs]
100. Replication (really)
• Through a combination of snapshotting and DML
replay we:
• replicate over into over 2000 tables in PostgreSQL
from Oracle
• snapshot replication of 200
• DML replay logs for 1800
101. Replication (really)
• Through a combination of snapshotting and DML
replay we:
• replicate over into over 2000 tables in PostgreSQL
from Oracle
• snapshot replication of 200
• DML replay logs for 1800
• PostgreSQL to Oracle is a bit harder
• out-of-band export and imports
102. New Architecture
• Master: Sun v890 and Hitachi AMS + warm standby
running Oracle
(1TB)
• Logs: several customs
running MySQL instances
(2TB each)
• ODS BI: 2x Sun v40
running PostgreSQL 8.3
(6TB on Sun JBODs on ZFS each)
• ODS archive: 2x custom
running PostgreSQL 8.3
(14TB internal storage on ZFS each)
103. PostgreSQL is Lacking
• No upgrades (AYFKM).
• pg_dump is too intrusive.
• Poor system-level instrumentation.
• Poor methods to determine specific contention.
• It relies on the operating system’s filesystem cache.
(which make PostgreSQL inconsistent across it’s
supported OS base)
104. Enter Solaris
• Solaris is a UNIX from Sun Microsystems.
• Is it different than other UNIX/UNIX-like systems?
• Mostly it isn’t different (hence the term UNIX)
• It does have extremely strong ABI backward
compatibility.
• It’s stable and works well on large machines.
• Solaris 10 shakes things up a bit:
• DTrace
• ZFS
• Zones
105. Solaris / ZFS
• ZFS: Zettaback Filesystem.
• 264 snapshots, 248 files/directory, 264 bytes/filesystem,
278 (256 ZiB) bytes in a pool, 264 devices/pool, 264 pools/system
• Extremely cheap differential backups.
• I have a 5 TB database, I need a backup!
• No rollback in your database? What is this? MySQL?
• No rollback in your filesystem?
• ZFS has snapshots, rollback, clone and promote.
• OMG! Life altering features.
• Caveat: ZFS is slower than alternatives, by about 10% with tuning.
106. Solaris / Zones
• Zones: Virtual Environments.
• Shared kernel.
• Can share filesystems.
• Segregated processes and privileges.
• No big deal for databases, right?
But Wait!
107. Solaris / ZFS + Zones = Magic Juju
https://labs.omniti.com/trac/pgsoltools/browser/trunk/pitr_clone/clonedb_startclone.sh
• ZFS snapshot, clone, delegate to zone, boot and run.
• When done, halt zone, destroy clone.
• We get a point-in-time copy of our PostgreSQL database:
• read-write,
• low disk-space requirements,
• NO LOCKS! Welcome back pg_dump,
you don’t suck (as much) anymore.
• Fast snapshot to usable copy time:
• On our 20 GB database: 1 minute.
• On our 1.2 TB database: 2 minutes.
108. ZFS: how I saved my soul.
• Database crash. Bad. 1.2 TB of data... busted.
The reason Robert Treat looks a bit older than he
should.
• xlogs corrupted. catalog indexes corrupted.
• Fault? PostgreSQL bug? Bad memory? Who knows?
• Trial & error on a 1.2 TB data set is a cruel experience.
• In real-life, most recovery actions are destructive
actions.
• PostgreSQL is no different.
• Rollback to last checkpoint (ZFS), hack postgres code,
try, fail, repeat.
109. Let DTrace open your eyes
• DTrace: Dynamic Tracing
• Dynamically instrument “stuff” in the system:
• system calls (like strace/truss/ktrace).
• process/scheduler activity (on/off cpu, semaphores, conditions).
• see signals sent and received.
• trace kernel functions, networking.
• watch I/O down to the disk.
• user-space processes, each function... each machine instruction!
• Add probes into apps where it makes sense to you.
110. Can you see what I see?
• There is EXPLAIN... when that isn’t enough...
• There is EXPLAIN ANALYZE... when that isn’t enough.
• There is DTrace.
; dtrace -q -n ‘
postgresql*:::statement-start
{
self->query = copyinstr(arg0);
self->ok=1;
}
io:::start
/self->ok/
{
@[self->query,
args[0]->b_flags & B_READ ? quot;readquot; : quot;writequot;,
args[1]->dev_statname] = sum(args[0]->b_bcount);
}’
dtrace: description 'postgres*:::statement-start' matched 14 probes
^C
select count(1) from c2w_ods.tblusers where zipcode between 10000 and 11000;
read sd1 16384
select division, sum(amount), avg(amount) from ods.billings where txn_timestamp
between ‘2006-01-01 00:00:00’ and ‘2006-04-01 00:00:00’ group by division;
read sd2 71647232
114. Results
• Move ODS Oracle licenses to OLTP
• Run PostgreSQL on ODS
115. Results
• Move ODS Oracle licenses to OLTP
• Run PostgreSQL on ODS
• Save $800k in license costs.
116. Results
• Move ODS Oracle licenses to OLTP
• Run PostgreSQL on ODS
• Save $800k in license costs.
• Spend $100k in labor costs.
117. Results
• Move ODS Oracle licenses to OLTP
• Run PostgreSQL on ODS
• Save $800k in license costs.
• Spend $100k in labor costs.
• Learn a lot.
118. Thanks!
• Thank you.
• http://omniti.com/does/postgresql
• We’re hiring, but only if you love:
• lots of data on lots of disks on lots of big boxes
• smart people
• hard problems
• more than one database technology (including PostgreSQL)
• responsibility