SlideShare a Scribd company logo
1 of 118
Big Bad PostgreSQL: A Case Study


           Moving a
           “large,”
    “complicated,” and
     mission-critical
     datawarehouse
        from Oracle
      to PostgreSQL
     for cost control.



1
About the Speaker

                                                                                                                                                                              • Principal @ OmniTI
S32699X_Scalable_Internet.qxd      6/23/06   3:31 PM    Page 1




                                                                                                                        Scalable Internet Architectures
                                                                                                                                                                              • Open Source
                                                                                                                                                          Theo Schlossnagle

   Scalable Internet Architectures
     With an estimated one billion users worldwide, the Internet today is nothing less than a
     global subculture with immense diversity, incredible size, and wide geographic reach. With a
     relatively low barrier to entry, almost anyone can register a domain name today and potentially
     provide services to people around the entire world tomorrow. But easy entry to web-based
     commerce and services can be a double-edged sword. In such a market, it is typically much
     harder to gauge interest in advance, and the negative impact of unexpected customer traffic
     can turn out to be devastating for the unprepared.




                                                                                                                                                                                mod_backhand, spreadlogd,
     In Scalable Internet Architectures, renowned software engineer and architect Theo
     Schlossnagle outlines the steps and processes organizations can follow to build online
     services that can scale well with demand—both quickly and economically. By making
     intelligent decisions throughout the evolution of an architecture, scalability can be a matter




                                                                                                                                                          Scalable Internet
     of engineering rather than redesign, costly purchasing, or black magic.




                                                                                                                                                                                OpenSSH+SecurID, Daiquiri,
     Filled with numerous examples, anecdotes, and lessons gleaned from the author’s years
     of experience building large-scale Internet services, Scalable Internet Architectures is both
     thought-provoking and instructional. Readers are challenged to understand first, before they




                                                                                                                                                          Architectures
     start a large project, how what they are building will be used, so that from the beginning
     they can design for scalability those parts which need to scale. With the right approach, it
     should take no more effort to design and implement a solution that scales than it takes




                                                                                                                                                                                Wackamole, libjlog, Spread,
     to build something that will not—and if this is the case, Schlossnagle writes, respect
     yourself and build it right.


     Theo Schlossnagle is a principal at OmniTI Computer Consulting, where he provides
     expert consulting services related to scalable Internet architectures, database replication,




                                                                                                                                                                                Reconnoiter, etc.
     and email infrastructure. He is the creator of the Backhand Project and the Ecelerity MTA,
     and spends most of his time solving the scalability problems that arise in high performance
     and highly distributed systems.




   Internet/Programming                                                     Cover image © Digital Vision/Getty Images




                                                                                                                                                                              • Closed Source
                                                                                                                         Schlossnagle
                  Scalability
                                                                    $49.99 USA / $61.99 CAN / £35.99 Net UK
                 Performance
                   Security
                 www.omniti.com

   DEVELOPER’S
   LIBRARY
                                                                                                                         DEVELOPER’S
   www.developers-library.com                                                                                            LIBRARY




                                                                                                                                                                                Message Systems MTA,
                                                                                                                                                                                Message Central

                                                                                                                                                                              • Author
                                                                                                                                                                                Scalable Internet Architectures
Overall Architecture


                                                                                OLTP instance:
                                          Oracle 8i



                                                                                drives the site
                                0.5 TB              0.25 TB
                                Hitachi              JBOD


                                                        OLTP




Log import and                                                                        Oracle 8i




processing
                                                    Oracle 8i

                                                                                          0.75 TB
                                                                                           JBOD
                 MySQL
               log importer               0.5 TB                1.5 TB
                                          Hitachi                MTI              OLTP warm backup


                                                                                                     Warm spare
                     1.2 TB
                     SATA                             Datawarehouse
                     RAID


                 Log Importer


                                                                 MySQL 4.1




                                                                       1.2 TB
                                                                     IDE RAID


                                                                  Data Exporter




                       bulk selects / data exports
Overall Architecture


                                                                                OLTP instance:
                                          Oracle 8i



                                                                                drives the site
                                0.5 TB              0.25 TB
                                Hitachi              JBOD


                                                        OLTP




Log import and                                                                        Oracle 8i




processing
                                                    Oracle 8i

                                                                                          0.75 TB
                                                                                           JBOD
                 MySQL
               log importer               0.5 TB                1.5 TB
                                          Hitachi                MTI              OLTP warm backup


                                                                                                     Warm spare
                     1.2 TB
                     SATA                             Datawarehouse
                     RAID


                 Log Importer


                                                                 MySQL 4.1




                                                                       1.2 TB
                                                                     IDE RAID


                                                                  Data Exporter




                       bulk selects / data exports
Database Situation
Database Situation

 •   The problems:
     • The database is growing.
     • The OLTP and ODS/warehouse are too slow.
     • A lot of application code against the OLTP system.
     • Minimal application code against the ODS system.
Database Situation

 •   The problems:
     • The database is growing.
     • The OLTP and ODS/warehouse are too slow.
     • A lot of application code against the OLTP system.
     • Minimal application code against the ODS system.
 •   Oracle:
     • Licensed per processor.
     • Really, really, really expensive on a large scale.
Database Situation

 •   The problems:
     • The database is growing.
     • The OLTP and ODS/warehouse are too slow.
     • A lot of application code against the OLTP system.
     • Minimal application code against the ODS system.
 •   Oracle:
     • Licensed per processor.
     • Really, really, really expensive on a large scale.
 •   PostgreSQL:
     • No licensing costs.
     • Good support for complex queries.
Database Choices
Database Choices



    •   Must keep Oracle on OLTP
        •Complex, Oracle-specific web application.
        •Need more processors.
Database Choices



    •   Must keep Oracle on OLTP
      •  Complex, Oracle-specific web application.
      •  Need more processors.
    •   ODS: Oracle not required.
      •  Complex queries from limited sources.
      •  Needs more space and power.
Database Choices



    •   Must keep Oracle on OLTP
      •   Complex, Oracle-specific web application.
      •   Need more processors.
    •   ODS: Oracle not required.
      •   Complex queries from limited sources.
      •   Needs more space and power.
    •   Result:
      •   Move ODS Oracle licenses to OLTP
      •   Run PostgreSQL on ODS
PostgreSQL gotchas
PostgreSQL gotchas




    •   For an OLTP system that does thousands of
        updates per second, vacuuming is a hassle.
PostgreSQL gotchas




    •   For an OLTP system that does thousands of
        updates per second, vacuuming is a hassle.

    •   No upgrades?!
PostgreSQL gotchas




    •   For an OLTP system that does thousands of
        updates per second, vacuuming is a hassle.

    •   No upgrades?!

    •   Less community experience with large
        databases.
PostgreSQL gotchas




    •   For an OLTP system that does thousands of
        updates per second, vacuuming is a hassle.

    •   No upgrades?!

    •   Less community experience with large
        databases.

    •   Replication features less evolved.
PostgreSQL ♥ ODS
PostgreSQL ♥ ODS




   •   Mostly inserts.
PostgreSQL ♥ ODS




   •   Mostly inserts.

   •   Updates/Deletes controlled, not real-time.
PostgreSQL ♥ ODS




   •   Mostly inserts.

   •   Updates/Deletes controlled, not real-time.

   •   pl/perl (leverage DBI/DBD for remote
       database connectivity).
PostgreSQL ♥ ODS




   •   Mostly inserts.

   •   Updates/Deletes controlled, not real-time.

   •   pl/perl (leverage DBI/DBD for remote
       database connectivity).

   •   Monster queries.
PostgreSQL ♥ ODS




   •   Mostly inserts.

   •   Updates/Deletes controlled, not real-time.

   •   pl/perl (leverage DBI/DBD for remote
       database connectivity).

   •   Monster queries.

   •   Extensible.
Choosing Linux
Choosing Linux



    •   Popular, liked, good community support.
Choosing Linux



    •   Popular, liked, good community support.

    •   Chronic problems:
Choosing Linux



    •   Popular, liked, good community support.

    •   Chronic problems:

        •   kernel panics
Choosing Linux



    •   Popular, liked, good community support.

    •   Chronic problems:

        •   kernel panics

        •   filesystems remounting read-only
Choosing Linux



    •   Popular, liked, good community support.

    •   Chronic problems:

        •   kernel panics

        •   filesystems remounting read-only

        •   filesystems don’t support snapshots
Choosing Linux



    •   Popular, liked, good community support.

    •   Chronic problems:

        •   kernel panics

        •   filesystems remounting read-only

        •   filesystems don’t support snapshots

        •   LVM is clunky on enterprise storage
Choosing Linux



    •   Popular, liked, good community support.

    •   Chronic problems:

        •   kernel panics

        •   filesystems remounting read-only

        •   filesystems don’t support snapshots

        •   LVM is clunky on enterprise storage

        •   20 outages in 4 months
Choosing Solaris 10
Choosing Solaris 10

     •   Switched to Solaris 10
Choosing Solaris 10

     •   Switched to Solaris 10

         •   No crashes, better system-level tools.
Choosing Solaris 10

     •   Switched to Solaris 10

         •   No crashes, better system-level tools.

             •   prstat, iostat, vmstat, smf, fault-
                 management.
Choosing Solaris 10

     •   Switched to Solaris 10

         •   No crashes, better system-level tools.

             •   prstat, iostat, vmstat, smf, fault-
                 management.

         •   ZFS
Choosing Solaris 10

     •   Switched to Solaris 10

         •   No crashes, better system-level tools.

             •   prstat, iostat, vmstat, smf, fault-
                 management.

         •   ZFS

             •   snapshots (persistent), BLI backups.
Choosing Solaris 10

     •   Switched to Solaris 10

         •   No crashes, better system-level tools.

             •   prstat, iostat, vmstat, smf, fault-
                 management.

         •   ZFS

             •   snapshots (persistent), BLI backups.

         •   Excellent support for enterprise storage.
Choosing Solaris 10

     •   Switched to Solaris 10

         •   No crashes, better system-level tools.

             •   prstat, iostat, vmstat, smf, fault-
                 management.

         •   ZFS

             •   snapshots (persistent), BLI backups.

         •   Excellent support for enterprise storage.

         •   DTrace.
Choosing Solaris 10

     •   Switched to Solaris 10

         •   No crashes, better system-level tools.

             •   prstat, iostat, vmstat, smf, fault-
                 management.

         •   ZFS

             •   snapshots (persistent), BLI backups.

         •   Excellent support for enterprise storage.

         •   DTrace.

         •   Free (too).
Oracle features we need
Oracle features we need




     •   Partitioning
Oracle features we need




     •   Partitioning

     •   Statistics and Aggregations
Oracle features we need




     •   Partitioning

     •   Statistics and Aggregations

         •   rank over partition, lead, lag, etc.
Oracle features we need




     •   Partitioning

     •   Statistics and Aggregations

         •   rank over partition, lead, lag, etc.

     •   Large selects (100GB)
Oracle features we need




     •   Partitioning

     •   Statistics and Aggregations

         •   rank over partition, lead, lag, etc.

     •   Large selects (100GB)

     •   Autonomous transactions
Oracle features we need




     •   Partitioning

     •   Statistics and Aggregations

         •   rank over partition, lead, lag, etc.

     •   Large selects (100GB)

     •   Autonomous transactions

     •   Replication from Oracle (to Oracle)
Partitioning



    For large data sets:
Partitioning



    For large data sets:
     pgods=# select count(1) from ods.ods_tblpick_super;
Partitioning



    For large data sets:
     pgods=# select count(1) from ods.ods_tblpick_super;
        count
     ------------
      1790994512
     (1 row)
Partitioning



    For large data sets:
     pgods=# select count(1) from ods.ods_tblpick_super;
        count
     ------------
      1790994512
     (1 row)




     • Next biggest tables: 850m, 650m, 590m
Partitioning



    For large data sets:
     pgods=# select count(1) from ods.ods_tblpick_super;
        count
     ------------
      1790994512
     (1 row)




     • Next biggest tables: 850m, 650m, 590m
     • Allows us to cluster data over specific ranges
        (by date in our case)
Partitioning



    For large data sets:
     pgods=# select count(1) from ods.ods_tblpick_super;
        count
     ------------
      1790994512
     (1 row)




     • Next biggest tables: 850m, 650m, 590m
     • Allows us to cluster data over specific ranges
       (by date in our case)
     • Simple, cheap archiving and removal of data.
Partitioning



    For large data sets:
     pgods=# select count(1) from ods.ods_tblpick_super;
        count
     ------------
      1790994512
     (1 row)




     • Next biggest tables: 850m, 650m, 590m
     • Allows us to cluster data over specific ranges
       (by date in our case)
     • Simple, cheap archiving and removal of data.
     • Can put ranges used less often in different
        tablespaces (slower, cheaper storage)
Partitioning PostgreSQL style
Partitioning PostgreSQL style



  •   PostgreSQL doesn’t support partition...
Partitioning PostgreSQL style



  •   PostgreSQL doesn’t support partition...

  •   It supports inheritance... (what’s this?)
Partitioning PostgreSQL style



  •   PostgreSQL doesn’t support partition...

  •   It supports inheritance... (what’s this?)

      •   some crazy object-relation paradigm.
Partitioning PostgreSQL style



  •   PostgreSQL doesn’t support partition...

  •   It supports inheritance... (what’s this?)

      •   some crazy object-relation paradigm.

  •   We can use it to implement partitioning:
Partitioning PostgreSQL style



  •   PostgreSQL doesn’t support partition...

  •   It supports inheritance... (what’s this?)

      •   some crazy object-relation paradigm.

  •   We can use it to implement partitioning:

      •   One master table with no rows.
Partitioning PostgreSQL style



  •   PostgreSQL doesn’t support partition...

  •   It supports inheritance... (what’s this?)

      •   some crazy object-relation paradigm.

  •   We can use it to implement partitioning:

      •   One master table with no rows.

      •   Child tables that have our partition constraints.
Partitioning PostgreSQL style



  •   PostgreSQL doesn’t support partition...

  •   It supports inheritance... (what’s this?)

      •   some crazy object-relation paradigm.

  •   We can use it to implement partitioning:

      •   One master table with no rows.

      •   Child tables that have our partition constraints.

      •   Rules on the master table for insert/update/delete.
Partitioning PostgreSQL realized
Partitioning PostgreSQL realized

  •   Cheaply add new empty partitions
Partitioning PostgreSQL realized

  •   Cheaply add new empty partitions

  •   Cheaply remove old partitions
Partitioning PostgreSQL realized

  •   Cheaply add new empty partitions

  •   Cheaply remove old partitions

  •   Migrate less-often-accessed partitions to slower
      storage
Partitioning PostgreSQL realized

  •   Cheaply add new empty partitions

  •   Cheaply remove old partitions

  •   Migrate less-often-accessed partitions to slower
      storage

  •   Different indexes strategies per partition
Partitioning PostgreSQL realized

  •   Cheaply add new empty partitions

  •   Cheaply remove old partitions

  •   Migrate less-often-accessed partitions to slower
      storage

  •   Different indexes strategies per partition

  •   PostgreSQL >8.1 supports constraint checking on
      inherited tables.
Partitioning PostgreSQL realized

  •   Cheaply add new empty partitions

  •   Cheaply remove old partitions

  •   Migrate less-often-accessed partitions to slower
      storage

  •   Different indexes strategies per partition

  •   PostgreSQL >8.1 supports constraint checking on
      inherited tables.
      •   smarter planning
Partitioning PostgreSQL realized

  •   Cheaply add new empty partitions

  •   Cheaply remove old partitions

  •   Migrate less-often-accessed partitions to slower
      storage

  •   Different indexes strategies per partition

  •   PostgreSQL >8.1 supports constraint checking on
      inherited tables.
      •   smarter planning

      •   smarter executing
RANK OVER PARTITION



   • In Oracle:

   • In PostgreSQL:

           With 8.4, we have windowing functions
RANK OVER PARTITION



     • In Oracle:
  select userid, email from (
  
   
    select u.userid, u.email,
  
   
    row_number() over
                   (partition by u.email order by userid desc) as position
  
   
    from (...)) where position = 1



     • In PostgreSQL:

                     With 8.4, we have windowing functions
RANK OVER PARTITION



     • In Oracle:
  select userid, email from (
  
   
    select u.userid, u.email,
  
   
    row_number() over
                   (partition by u.email order by userid desc) as position
  
   
    from (...)) where position = 1



     • In PostgreSQL:
  FOR v_row IN select u.userid, u.email from (...) order by email, userid desc
  LOOP
  
    IF v_row.email != v_last_email THEN
  
    
   RETURN NEXT v_row;
  
    
   v_last_email := v_row.email;
  
    
   v_rownum := v_rownum + 1;
  
    END IF;
  END LOOP;

                     With 8.4, we have windowing functions
Large SELECTs


   • Application code does:
Large SELECTs


   • Application code does:
  select u.*, b.browser, m.lastmess
    from ods.ods_users u,
         ods.ods_browsers b,
         ( select userid, min(senddate) as senddate
              from ods.ods_maillog
          group by userid ) m,
         ods.ods_maillog l
   where u.userid = b.userid
     and u.userid = m.userid
     and u.userid = l.userid
     and l.senddate = m.senddate;
Large SELECTs


   • Application code does:
  select u.*, b.browser, m.lastmess
    from ods.ods_users u,
         ods.ods_browsers b,
         ( select userid, min(senddate) as senddate
              from ods.ods_maillog
          group by userid ) m,
         ods.ods_maillog l
   where u.userid = b.userid
     and u.userid = m.userid
     and u.userid = l.userid
     and l.senddate = m.senddate;




     •   The width of these rows is about 2k
Large SELECTs


   • Application code does:
  select u.*, b.browser, m.lastmess
    from ods.ods_users u,
         ods.ods_browsers b,
         ( select userid, min(senddate) as senddate
              from ods.ods_maillog
          group by userid ) m,
         ods.ods_maillog l
   where u.userid = b.userid
     and u.userid = m.userid
     and u.userid = l.userid
     and l.senddate = m.senddate;




     •   The width of these rows is about 2k

     •   50 million row return set
Large SELECTs


   • Application code does:
  select u.*, b.browser, m.lastmess
    from ods.ods_users u,
         ods.ods_browsers b,
         ( select userid, min(senddate) as senddate
              from ods.ods_maillog
          group by userid ) m,
         ods.ods_maillog l
   where u.userid = b.userid
     and u.userid = m.userid
     and u.userid = l.userid
     and l.senddate = m.senddate;




     •   The width of these rows is about 2k

     •   50 million row return set

     •   > 100 GB of data
The Large SELECT Problem


   •   libpq will buffer the entire result in memory.

       •   This affects language bindings (DBD::Pg).

       •   This is an utterly deficient default behavior.

   •   This can be avoided by using cursors

       •   Requires the app to be PostgreSQL specific.

       •   You open a cursor.

       •   Then FETCH the row count you desire.
Big SELECTs the Postgres way



  The previous “big” query becomes:
Big SELECTs the Postgres way



  The previous “big” query becomes:
   DECLARE CURSOR bigdump FOR
   select u.*, b.browser, m.lastmess
     from ods.ods_users u,
          ods.ods_browsers b,
          ( select userid, min(senddate) as senddate
               from ods.ods_maillog
           group by userid ) m,
          ods.ods_maillog l
    where u.userid = b.userid
      and u.userid = m.userid
      and u.userid = l.userid
      and l.senddate = m.senddate;
Big SELECTs the Postgres way



  The previous “big” query becomes:
   DECLARE CURSOR bigdump FOR
   select u.*, b.browser, m.lastmess
     from ods.ods_users u,
          ods.ods_browsers b,
          ( select userid, min(senddate) as senddate
               from ods.ods_maillog
           group by userid ) m,
          ods.ods_maillog l
    where u.userid = b.userid
      and u.userid = m.userid
      and u.userid = l.userid
      and l.senddate = m.senddate;


  Then, in a loop:
   FETCH FORWARD 10000 FROM bigdump;
Autonomous Transactions
Autonomous Transactions



 • In Oracle we have over 2000 custom stored procedures.
Autonomous Transactions



 • In Oracle we have over 2000 custom stored procedures.
 • During these procedures, we like to:
Autonomous Transactions



 • In Oracle we have over 2000 custom stored procedures.
 • During these procedures, we like to:
   • COMMIT incrementally
     Useful for long transactions (update/delete) that
     need not be atomic -- incremental COMMITs.
Autonomous Transactions



 • In Oracle we have over 2000 custom stored procedures.
 • During these procedures, we like to:
   • COMMIT incrementally
     Useful for long transactions (update/delete) that
     need not be atomic -- incremental COMMITs.

   • start a new top-level txn that can COMMIT
     Useful for logging progress in a stored procedure so
     that you know how far you progessed and how long
     each step took even if it rolls back.
PostgreSQL shortcoming
PostgreSQL shortcoming




    •   PostgreSQL simply does not support
        Autonomous transactions and to quote core
        developers “that would be hard.”
PostgreSQL shortcoming




    •   PostgreSQL simply does not support
        Autonomous transactions and to quote core
        developers “that would be hard.”

    •   When in doubt, use brute force.
PostgreSQL shortcoming




    •   PostgreSQL simply does not support
        Autonomous transactions and to quote core
        developers “that would be hard.”

    •   When in doubt, use brute force.

    •   Use pl/perl to use DBD::Pg to connect to
        ourselves (a new backend) and execute a new
        top-level transaction.
Replication
Replication


  • Cross vendor database replication isn’t too difficult.
Replication


  • Cross vendor database replication isn’t too difficult.
  • Helps a lot when you can do it inside the database.
Replication


  • Cross vendor database replication isn’t too difficult.
  • Helps a lot when you can do it inside the database.
  • Using dbi-link (based on pl/perl and DBI) we can.
Replication


  • Cross vendor database replication isn’t too difficult.
  • Helps a lot when you can do it inside the database.
  • Using dbi-link (based on pl/perl and DBI) we can.
    • We can connect to any remote database.
Replication


  • Cross vendor database replication isn’t too difficult.
  • Helps a lot when you can do it inside the database.
  • Using dbi-link (based on pl/perl and DBI) we can.
    • We can connect to any remote database.
    • INSERT into local tables directly from remote
      SELECT statements.
      [snapshots]
Replication


  • Cross vendor database replication isn’t too difficult.
  • Helps a lot when you can do it inside the database.
  • Using dbi-link (based on pl/perl and DBI) we can.
    • We can connect to any remote database.
    • INSERT into local tables directly from remote
      SELECT statements.
      [snapshots]
    • LOOP over remote SELECT statements and
      process them row-by-row.
      [replaying remote DML logs]
Replication (really)
Replication (really)



 •   Through a combination of snapshotting and DML
     replay we:

     •   replicate over into over 2000 tables in PostgreSQL
         from Oracle

         •   snapshot replication of 200

         •   DML replay logs for 1800
Replication (really)



 •   Through a combination of snapshotting and DML
     replay we:

     •   replicate over into over 2000 tables in PostgreSQL
         from Oracle

         •   snapshot replication of 200

         •   DML replay logs for 1800

 •   PostgreSQL to Oracle is a bit harder

     •   out-of-band export and imports
New Architecture


  •   Master: Sun v890 and Hitachi AMS + warm standby
      running Oracle
      (1TB)

  •   Logs: several customs
      running MySQL instances
      (2TB each)

  •   ODS BI: 2x Sun v40
      running PostgreSQL 8.3
      (6TB on Sun JBODs on ZFS each)

  •   ODS archive: 2x custom
      running PostgreSQL 8.3
      (14TB internal storage on ZFS each)
PostgreSQL is Lacking



  •   No upgrades (AYFKM).

  •   pg_dump is too intrusive.

  •   Poor system-level instrumentation.

  •   Poor methods to determine specific contention.

  •   It relies on the operating system’s filesystem cache.
      (which make PostgreSQL inconsistent across it’s
      supported OS base)
Enter Solaris

 •   Solaris is a UNIX from Sun Microsystems.

 •   Is it different than other UNIX/UNIX-like systems?

     •   Mostly it isn’t different (hence the term UNIX)

     •   It does have extremely strong ABI backward
         compatibility.

     •   It’s stable and works well on large machines.

 •   Solaris 10 shakes things up a bit:
     •   DTrace

     •   ZFS

     •   Zones
Solaris / ZFS


     •   ZFS: Zettaback Filesystem.

         •   264 snapshots, 248 files/directory, 264 bytes/filesystem,
             278 (256 ZiB) bytes in a pool, 264 devices/pool, 264 pools/system

     •   Extremely cheap differential backups.

         •   I have a 5 TB database, I need a backup!

     •   No rollback in your database? What is this? MySQL?

     •   No rollback in your filesystem?

         •   ZFS has snapshots, rollback, clone and promote.

         •   OMG! Life altering features.

     •   Caveat: ZFS is slower than alternatives, by about 10% with tuning.
Solaris / Zones



     •   Zones: Virtual Environments.

     •   Shared kernel.

     •   Can share filesystems.

     •   Segregated processes and privileges.

     •   No big deal for databases, right?


                          But Wait!
Solaris / ZFS + Zones = Magic Juju
    https://labs.omniti.com/trac/pgsoltools/browser/trunk/pitr_clone/clonedb_startclone.sh

•   ZFS snapshot, clone, delegate to zone, boot and run.

•   When done, halt zone, destroy clone.

•   We get a point-in-time copy of our PostgreSQL database:

    •   read-write,

    •   low disk-space requirements,

    •   NO LOCKS! Welcome back pg_dump,
        you don’t suck (as much) anymore.

    •   Fast snapshot to usable copy time:

        •   On our 20 GB database: 1 minute.

        •   On our 1.2 TB database: 2 minutes.
ZFS: how I saved my soul.

 •   Database crash. Bad. 1.2 TB of data... busted.
     The reason Robert Treat looks a bit older than he
     should.

 •   xlogs corrupted. catalog indexes corrupted.

 •   Fault? PostgreSQL bug? Bad memory? Who knows?

 •   Trial & error on a 1.2 TB data set is a cruel experience.

     •   In real-life, most recovery actions are destructive
         actions.

     •   PostgreSQL is no different.

 •   Rollback to last checkpoint (ZFS), hack postgres code,
     try, fail, repeat.
Let DTrace open your eyes

•   DTrace: Dynamic Tracing

•   Dynamically instrument “stuff” in the system:
    •   system calls (like strace/truss/ktrace).

    •   process/scheduler activity (on/off cpu, semaphores, conditions).

    •   see signals sent and received.

    •   trace kernel functions, networking.

    •   watch I/O down to the disk.

    •   user-space processes, each function... each machine instruction!

    •   Add probes into apps where it makes sense to you.
Can you see what I see?

     •   There is EXPLAIN... when that isn’t enough...

     •   There is EXPLAIN ANALYZE... when that isn’t enough.

     •   There is DTrace.

         ; dtrace -q -n ‘
         postgresql*:::statement-start
         {
            self->query = copyinstr(arg0);
            self->ok=1;
         }
         io:::start
         /self->ok/
         {
            @[self->query,
              args[0]->b_flags & B_READ ? quot;readquot; : quot;writequot;,
              args[1]->dev_statname] = sum(args[0]->b_bcount);
         }’
         dtrace: description 'postgres*:::statement-start' matched 14 probes
         ^C

         select count(1) from c2w_ods.tblusers where zipcode between 10000 and 11000;
             read sd1 16384
         select division, sum(amount), avg(amount) from ods.billings where txn_timestamp
         between ‘2006-01-01 00:00:00’ and ‘2006-04-01 00:00:00’ group by division;
             read sd2 71647232
OmniTI Labs / pgsoltools

       •    https://labs.omniti.com/trac/pgsoltools

           •    Where we stick out PostgreSQL on Solaris goodies...

           •    like pg_file_stress


             FILENAME/DBOBJECT                              READS                    WRITES
                                                  #   min    avg    max      #   min    avg   max
  alldata1__idx_remove_domain_external            1    12     12     12    398     0      0     0
  slowdata1__pg_rewrite                           1    12     12     12      0     0      0     0
  slowdata1__pg_class_oid_index                   1     0      0      0      0     0      0     0
  slowdata1__pg_attribute                         2     0      0      0      0     0      0     0
  alldata1__mv_users                              0     0      0      0      4     0      0     0
  slowdata1__pg_statistic                         1     0      0      0      0     0      0     0
  slowdata1__pg_index                             1     0      0      0      0     0      0     0
  slowdata1__pg_index_indexrelid_index            1     0      0      0      0     0      0     0
  alldata1__remove_domain_external                0     0      0      0    502     0      0     0
  alldata1__promo_15_tb_full_2                   19     0      0      0     11     0      0     0
  slowdata1__pg_class_relname_nsp_index           2     0      0      0      0     0      0     0
  alldata1__promo_177intaoltest_tb                0     0      0      0   1053     0      0     0
  slowdata1__pg_attribute_relid_attnum_index      2     0      0      0      0     0      0     0
  alldata1__promo_15_tb_full_2_pk                 2     0      0      0      0     0      0     0
  alldata1__all_mailable_2                     1403     0      0    423      0     0      0     0
  alldata1__mv_users_pkey                         0     0      0      0      4     0      0     0
Results
Results




    •     Move ODS Oracle licenses to OLTP
Results




    •     Move ODS Oracle licenses to OLTP

    •     Run PostgreSQL on ODS
Results




    •     Move ODS Oracle licenses to OLTP

    •     Run PostgreSQL on ODS

    •     Save $800k in license costs.
Results




    •     Move ODS Oracle licenses to OLTP

    •     Run PostgreSQL on ODS

    •     Save $800k in license costs.

    •     Spend $100k in labor costs.
Results




    •     Move ODS Oracle licenses to OLTP

    •     Run PostgreSQL on ODS

    •     Save $800k in license costs.

    •     Spend $100k in labor costs.

    •     Learn a lot.
Thanks!



    •   Thank you.

    •   http://omniti.com/does/postgresql

    •   We’re hiring, but only if you love:

        •   lots of data on lots of disks on lots of big boxes

        •   smart people

        •   hard problems

        •   more than one database technology (including PostgreSQL)

        •   responsibility

More Related Content

Viewers also liked

OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012Theo Schlossnagle
 
Scalable Internet Architecture
Scalable Internet ArchitectureScalable Internet Architecture
Scalable Internet ArchitectureTheo Schlossnagle
 
Monitoring and observability
Monitoring and observabilityMonitoring and observability
Monitoring and observabilityTheo Schlossnagle
 
Migration From Oracle to PostgreSQL
Migration From Oracle to PostgreSQLMigration From Oracle to PostgreSQL
Migration From Oracle to PostgreSQLPGConf APAC
 
The Great Debate: PostgreSQL vs MySQL
The Great Debate: PostgreSQL vs MySQLThe Great Debate: PostgreSQL vs MySQL
The Great Debate: PostgreSQL vs MySQLEDB
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data WarehousingEdureka!
 
Ten Reasons Why You Should Prefer PostgreSQL to MySQL
Ten Reasons Why You Should Prefer PostgreSQL to MySQLTen Reasons Why You Should Prefer PostgreSQL to MySQL
Ten Reasons Why You Should Prefer PostgreSQL to MySQLanandology
 
PostgreSQL vs MySQL: PostgreSQL como alternativa.
PostgreSQL vs MySQL: PostgreSQL como alternativa.PostgreSQL vs MySQL: PostgreSQL como alternativa.
PostgreSQL vs MySQL: PostgreSQL como alternativa.Arturo Espinosa
 

Viewers also liked (13)

OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012
 
Project reality
Project realityProject reality
Project reality
 
Scalable Internet Architecture
Scalable Internet ArchitectureScalable Internet Architecture
Scalable Internet Architecture
 
Monitoring and observability
Monitoring and observabilityMonitoring and observability
Monitoring and observability
 
Migration From Oracle to PostgreSQL
Migration From Oracle to PostgreSQLMigration From Oracle to PostgreSQL
Migration From Oracle to PostgreSQL
 
The Great Debate: PostgreSQL vs MySQL
The Great Debate: PostgreSQL vs MySQLThe Great Debate: PostgreSQL vs MySQL
The Great Debate: PostgreSQL vs MySQL
 
Why use PostgreSQL?
Why use PostgreSQL?Why use PostgreSQL?
Why use PostgreSQL?
 
Really Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DWReally Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DW
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
Esperwhispering
EsperwhisperingEsperwhispering
Esperwhispering
 
Ten Reasons Why You Should Prefer PostgreSQL to MySQL
Ten Reasons Why You Should Prefer PostgreSQL to MySQLTen Reasons Why You Should Prefer PostgreSQL to MySQL
Ten Reasons Why You Should Prefer PostgreSQL to MySQL
 
PostgreSQL vs MySQL: PostgreSQL como alternativa.
PostgreSQL vs MySQL: PostgreSQL como alternativa.PostgreSQL vs MySQL: PostgreSQL como alternativa.
PostgreSQL vs MySQL: PostgreSQL como alternativa.
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
 

Similar to Big Bad PostgreSQL @ Percona

Artic Startup
Artic StartupArtic Startup
Artic StartupBobsNJ
 
NYC Chalk Talk
NYC Chalk TalkNYC Chalk Talk
NYC Chalk TalkBobsNJ
 
Tagging and Folksonomy Schema Design for Scalability and Performance
Tagging and Folksonomy Schema Design for Scalability and PerformanceTagging and Folksonomy Schema Design for Scalability and Performance
Tagging and Folksonomy Schema Design for Scalability and PerformanceEduard Bondarenko
 
Internet World Web2
Internet World Web2Internet World Web2
Internet World Web2BobsNJ
 
Cloud Camp Feb 21 2013 - All Slides
Cloud Camp Feb 21 2013 - All SlidesCloud Camp Feb 21 2013 - All Slides
Cloud Camp Feb 21 2013 - All SlidesCloudCamp Chicago
 
What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications
What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web ApplicationsWhat Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications
What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web ApplicationsTodd Hoff
 
locotalk-whitepaper-2016
locotalk-whitepaper-2016locotalk-whitepaper-2016
locotalk-whitepaper-2016Anthony Wijnen
 
Life Beyond Rails: Creating Cross Platform Ruby Apps
Life Beyond Rails: Creating Cross Platform Ruby AppsLife Beyond Rails: Creating Cross Platform Ruby Apps
Life Beyond Rails: Creating Cross Platform Ruby AppsTristan Gomez
 
Stay productive while slicing up the monolith
Stay productive while slicing up the monolithStay productive while slicing up the monolith
Stay productive while slicing up the monolithMarkus Eisele
 
Project SpaceLock - Architecture & Design
Project SpaceLock - Architecture & DesignProject SpaceLock - Architecture & Design
Project SpaceLock - Architecture & DesignAbhishek Mishra
 
Stay productive while slicing up the monolith
Stay productive while slicing up the monolithStay productive while slicing up the monolith
Stay productive while slicing up the monolithMarkus Eisele
 
From Obvious to Ingenius: Incrementally Scaling Web Apps on PostgreSQL
From Obvious to Ingenius: Incrementally Scaling Web Apps on PostgreSQLFrom Obvious to Ingenius: Incrementally Scaling Web Apps on PostgreSQL
From Obvious to Ingenius: Incrementally Scaling Web Apps on PostgreSQLKonstantin Gredeskoul
 
Web 20- 2: Architecture Patterns And Models For The New Internet
Web 20- 2: Architecture Patterns And Models For The New InternetWeb 20- 2: Architecture Patterns And Models For The New Internet
Web 20- 2: Architecture Patterns And Models For The New Internettvawler
 
Connecting the Dots: How Blockchains Can Interoperate with Polkadot
Connecting the Dots: How Blockchains Can Interoperate with PolkadotConnecting the Dots: How Blockchains Can Interoperate with Polkadot
Connecting the Dots: How Blockchains Can Interoperate with PolkadotPureStake
 
Guide to NoSQL with MySQL
Guide to NoSQL with MySQLGuide to NoSQL with MySQL
Guide to NoSQL with MySQLSamuel Rohaut
 
OpenStack Explained: Learn OpenStack architecture and the secret of a success...
OpenStack Explained: Learn OpenStack architecture and the secret of a success...OpenStack Explained: Learn OpenStack architecture and the secret of a success...
OpenStack Explained: Learn OpenStack architecture and the secret of a success...Giuseppe Paterno'
 
Essential Node.js for Web Developers from Developer Week 2013
Essential Node.js for Web Developers from Developer Week 2013Essential Node.js for Web Developers from Developer Week 2013
Essential Node.js for Web Developers from Developer Week 2013CA API Management
 
Open stack in action enovance - cloudwatt - european ambitions for openstack
Open stack in action   enovance - cloudwatt - european ambitions for openstackOpen stack in action   enovance - cloudwatt - european ambitions for openstack
Open stack in action enovance - cloudwatt - european ambitions for openstackeNovance
 

Similar to Big Bad PostgreSQL @ Percona (20)

Artic Startup
Artic StartupArtic Startup
Artic Startup
 
NYC Chalk Talk
NYC Chalk TalkNYC Chalk Talk
NYC Chalk Talk
 
Tagging and Folksonomy Schema Design for Scalability and Performance
Tagging and Folksonomy Schema Design for Scalability and PerformanceTagging and Folksonomy Schema Design for Scalability and Performance
Tagging and Folksonomy Schema Design for Scalability and Performance
 
Internet World Web2
Internet World Web2Internet World Web2
Internet World Web2
 
Cloud Camp Feb 21 2013 - All Slides
Cloud Camp Feb 21 2013 - All SlidesCloud Camp Feb 21 2013 - All Slides
Cloud Camp Feb 21 2013 - All Slides
 
The NoSQL Movement
The NoSQL MovementThe NoSQL Movement
The NoSQL Movement
 
What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications
What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web ApplicationsWhat Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications
What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications
 
locotalk-whitepaper-2016
locotalk-whitepaper-2016locotalk-whitepaper-2016
locotalk-whitepaper-2016
 
Life Beyond Rails: Creating Cross Platform Ruby Apps
Life Beyond Rails: Creating Cross Platform Ruby AppsLife Beyond Rails: Creating Cross Platform Ruby Apps
Life Beyond Rails: Creating Cross Platform Ruby Apps
 
Why Cloud Computing is Different
Why Cloud Computing is DifferentWhy Cloud Computing is Different
Why Cloud Computing is Different
 
Stay productive while slicing up the monolith
Stay productive while slicing up the monolithStay productive while slicing up the monolith
Stay productive while slicing up the monolith
 
Project SpaceLock - Architecture & Design
Project SpaceLock - Architecture & DesignProject SpaceLock - Architecture & Design
Project SpaceLock - Architecture & Design
 
Stay productive while slicing up the monolith
Stay productive while slicing up the monolithStay productive while slicing up the monolith
Stay productive while slicing up the monolith
 
From Obvious to Ingenius: Incrementally Scaling Web Apps on PostgreSQL
From Obvious to Ingenius: Incrementally Scaling Web Apps on PostgreSQLFrom Obvious to Ingenius: Incrementally Scaling Web Apps on PostgreSQL
From Obvious to Ingenius: Incrementally Scaling Web Apps on PostgreSQL
 
Web 20- 2: Architecture Patterns And Models For The New Internet
Web 20- 2: Architecture Patterns And Models For The New InternetWeb 20- 2: Architecture Patterns And Models For The New Internet
Web 20- 2: Architecture Patterns And Models For The New Internet
 
Connecting the Dots: How Blockchains Can Interoperate with Polkadot
Connecting the Dots: How Blockchains Can Interoperate with PolkadotConnecting the Dots: How Blockchains Can Interoperate with Polkadot
Connecting the Dots: How Blockchains Can Interoperate with Polkadot
 
Guide to NoSQL with MySQL
Guide to NoSQL with MySQLGuide to NoSQL with MySQL
Guide to NoSQL with MySQL
 
OpenStack Explained: Learn OpenStack architecture and the secret of a success...
OpenStack Explained: Learn OpenStack architecture and the secret of a success...OpenStack Explained: Learn OpenStack architecture and the secret of a success...
OpenStack Explained: Learn OpenStack architecture and the secret of a success...
 
Essential Node.js for Web Developers from Developer Week 2013
Essential Node.js for Web Developers from Developer Week 2013Essential Node.js for Web Developers from Developer Week 2013
Essential Node.js for Web Developers from Developer Week 2013
 
Open stack in action enovance - cloudwatt - european ambitions for openstack
Open stack in action   enovance - cloudwatt - european ambitions for openstackOpen stack in action   enovance - cloudwatt - european ambitions for openstack
Open stack in action enovance - cloudwatt - european ambitions for openstack
 

More from Theo Schlossnagle

More from Theo Schlossnagle (20)

Adding Simplicity to Complexity
Adding Simplicity to ComplexityAdding Simplicity to Complexity
Adding Simplicity to Complexity
 
Put Some SRE in Your Shipped Software
Put Some SRE in Your Shipped SoftwarePut Some SRE in Your Shipped Software
Put Some SRE in Your Shipped Software
 
Monitoring 101
Monitoring 101Monitoring 101
Monitoring 101
 
Distributed Systems - Like It Or Not
Distributed Systems - Like It Or NotDistributed Systems - Like It Or Not
Distributed Systems - Like It Or Not
 
Applying SRE techniques to micro service design
Applying SRE techniques to micro service designApplying SRE techniques to micro service design
Applying SRE techniques to micro service design
 
SRECon Coherent Performance
SRECon Coherent PerformanceSRECon Coherent Performance
SRECon Coherent Performance
 
Commandments of scale
Commandments of scaleCommandments of scale
Commandments of scale
 
Adaptive availability
Adaptive availabilityAdaptive availability
Adaptive availability
 
Monitoring the #DevOps way
Monitoring the #DevOps wayMonitoring the #DevOps way
Monitoring the #DevOps way
 
Operational Software Design
Operational Software DesignOperational Software Design
Operational Software Design
 
A Coherent Discussion About Performance
A Coherent Discussion About PerformanceA Coherent Discussion About Performance
A Coherent Discussion About Performance
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
 
Understanding Slowness
Understanding SlownessUnderstanding Slowness
Understanding Slowness
 
Monitoring and observability
Monitoring and observabilityMonitoring and observability
Monitoring and observability
 
Xtreme Deployment
Xtreme DeploymentXtreme Deployment
Xtreme Deployment
 
Atldevops
AtldevopsAtldevops
Atldevops
 
It's all about telemetry
It's all about telemetryIt's all about telemetry
It's all about telemetry
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
 
Social improvements in monitoring
Social improvements in monitoringSocial improvements in monitoring
Social improvements in monitoring
 
What's in a number?
What's in a number?What's in a number?
What's in a number?
 

Recently uploaded

Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Recently uploaded (20)

Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

Big Bad PostgreSQL @ Percona

  • 1. Big Bad PostgreSQL: A Case Study Moving a “large,” “complicated,” and mission-critical datawarehouse from Oracle to PostgreSQL for cost control. 1
  • 2. About the Speaker • Principal @ OmniTI S32699X_Scalable_Internet.qxd 6/23/06 3:31 PM Page 1 Scalable Internet Architectures • Open Source Theo Schlossnagle Scalable Internet Architectures With an estimated one billion users worldwide, the Internet today is nothing less than a global subculture with immense diversity, incredible size, and wide geographic reach. With a relatively low barrier to entry, almost anyone can register a domain name today and potentially provide services to people around the entire world tomorrow. But easy entry to web-based commerce and services can be a double-edged sword. In such a market, it is typically much harder to gauge interest in advance, and the negative impact of unexpected customer traffic can turn out to be devastating for the unprepared. mod_backhand, spreadlogd, In Scalable Internet Architectures, renowned software engineer and architect Theo Schlossnagle outlines the steps and processes organizations can follow to build online services that can scale well with demand—both quickly and economically. By making intelligent decisions throughout the evolution of an architecture, scalability can be a matter Scalable Internet of engineering rather than redesign, costly purchasing, or black magic. OpenSSH+SecurID, Daiquiri, Filled with numerous examples, anecdotes, and lessons gleaned from the author’s years of experience building large-scale Internet services, Scalable Internet Architectures is both thought-provoking and instructional. Readers are challenged to understand first, before they Architectures start a large project, how what they are building will be used, so that from the beginning they can design for scalability those parts which need to scale. With the right approach, it should take no more effort to design and implement a solution that scales than it takes Wackamole, libjlog, Spread, to build something that will not—and if this is the case, Schlossnagle writes, respect yourself and build it right. Theo Schlossnagle is a principal at OmniTI Computer Consulting, where he provides expert consulting services related to scalable Internet architectures, database replication, Reconnoiter, etc. and email infrastructure. He is the creator of the Backhand Project and the Ecelerity MTA, and spends most of his time solving the scalability problems that arise in high performance and highly distributed systems. Internet/Programming Cover image © Digital Vision/Getty Images • Closed Source Schlossnagle Scalability $49.99 USA / $61.99 CAN / £35.99 Net UK Performance Security www.omniti.com DEVELOPER’S LIBRARY DEVELOPER’S www.developers-library.com LIBRARY Message Systems MTA, Message Central • Author Scalable Internet Architectures
  • 3. Overall Architecture OLTP instance: Oracle 8i drives the site 0.5 TB 0.25 TB Hitachi JBOD OLTP Log import and Oracle 8i processing Oracle 8i 0.75 TB JBOD MySQL log importer 0.5 TB 1.5 TB Hitachi MTI OLTP warm backup Warm spare 1.2 TB SATA Datawarehouse RAID Log Importer MySQL 4.1 1.2 TB IDE RAID Data Exporter bulk selects / data exports
  • 4. Overall Architecture OLTP instance: Oracle 8i drives the site 0.5 TB 0.25 TB Hitachi JBOD OLTP Log import and Oracle 8i processing Oracle 8i 0.75 TB JBOD MySQL log importer 0.5 TB 1.5 TB Hitachi MTI OLTP warm backup Warm spare 1.2 TB SATA Datawarehouse RAID Log Importer MySQL 4.1 1.2 TB IDE RAID Data Exporter bulk selects / data exports
  • 6. Database Situation • The problems: • The database is growing. • The OLTP and ODS/warehouse are too slow. • A lot of application code against the OLTP system. • Minimal application code against the ODS system.
  • 7. Database Situation • The problems: • The database is growing. • The OLTP and ODS/warehouse are too slow. • A lot of application code against the OLTP system. • Minimal application code against the ODS system. • Oracle: • Licensed per processor. • Really, really, really expensive on a large scale.
  • 8. Database Situation • The problems: • The database is growing. • The OLTP and ODS/warehouse are too slow. • A lot of application code against the OLTP system. • Minimal application code against the ODS system. • Oracle: • Licensed per processor. • Really, really, really expensive on a large scale. • PostgreSQL: • No licensing costs. • Good support for complex queries.
  • 10. Database Choices • Must keep Oracle on OLTP •Complex, Oracle-specific web application. •Need more processors.
  • 11. Database Choices • Must keep Oracle on OLTP • Complex, Oracle-specific web application. • Need more processors. • ODS: Oracle not required. • Complex queries from limited sources. • Needs more space and power.
  • 12. Database Choices • Must keep Oracle on OLTP • Complex, Oracle-specific web application. • Need more processors. • ODS: Oracle not required. • Complex queries from limited sources. • Needs more space and power. • Result: • Move ODS Oracle licenses to OLTP • Run PostgreSQL on ODS
  • 14. PostgreSQL gotchas • For an OLTP system that does thousands of updates per second, vacuuming is a hassle.
  • 15. PostgreSQL gotchas • For an OLTP system that does thousands of updates per second, vacuuming is a hassle. • No upgrades?!
  • 16. PostgreSQL gotchas • For an OLTP system that does thousands of updates per second, vacuuming is a hassle. • No upgrades?! • Less community experience with large databases.
  • 17. PostgreSQL gotchas • For an OLTP system that does thousands of updates per second, vacuuming is a hassle. • No upgrades?! • Less community experience with large databases. • Replication features less evolved.
  • 19. PostgreSQL ♥ ODS • Mostly inserts.
  • 20. PostgreSQL ♥ ODS • Mostly inserts. • Updates/Deletes controlled, not real-time.
  • 21. PostgreSQL ♥ ODS • Mostly inserts. • Updates/Deletes controlled, not real-time. • pl/perl (leverage DBI/DBD for remote database connectivity).
  • 22. PostgreSQL ♥ ODS • Mostly inserts. • Updates/Deletes controlled, not real-time. • pl/perl (leverage DBI/DBD for remote database connectivity). • Monster queries.
  • 23. PostgreSQL ♥ ODS • Mostly inserts. • Updates/Deletes controlled, not real-time. • pl/perl (leverage DBI/DBD for remote database connectivity). • Monster queries. • Extensible.
  • 25. Choosing Linux • Popular, liked, good community support.
  • 26. Choosing Linux • Popular, liked, good community support. • Chronic problems:
  • 27. Choosing Linux • Popular, liked, good community support. • Chronic problems: • kernel panics
  • 28. Choosing Linux • Popular, liked, good community support. • Chronic problems: • kernel panics • filesystems remounting read-only
  • 29. Choosing Linux • Popular, liked, good community support. • Chronic problems: • kernel panics • filesystems remounting read-only • filesystems don’t support snapshots
  • 30. Choosing Linux • Popular, liked, good community support. • Chronic problems: • kernel panics • filesystems remounting read-only • filesystems don’t support snapshots • LVM is clunky on enterprise storage
  • 31. Choosing Linux • Popular, liked, good community support. • Chronic problems: • kernel panics • filesystems remounting read-only • filesystems don’t support snapshots • LVM is clunky on enterprise storage • 20 outages in 4 months
  • 33. Choosing Solaris 10 • Switched to Solaris 10
  • 34. Choosing Solaris 10 • Switched to Solaris 10 • No crashes, better system-level tools.
  • 35. Choosing Solaris 10 • Switched to Solaris 10 • No crashes, better system-level tools. • prstat, iostat, vmstat, smf, fault- management.
  • 36. Choosing Solaris 10 • Switched to Solaris 10 • No crashes, better system-level tools. • prstat, iostat, vmstat, smf, fault- management. • ZFS
  • 37. Choosing Solaris 10 • Switched to Solaris 10 • No crashes, better system-level tools. • prstat, iostat, vmstat, smf, fault- management. • ZFS • snapshots (persistent), BLI backups.
  • 38. Choosing Solaris 10 • Switched to Solaris 10 • No crashes, better system-level tools. • prstat, iostat, vmstat, smf, fault- management. • ZFS • snapshots (persistent), BLI backups. • Excellent support for enterprise storage.
  • 39. Choosing Solaris 10 • Switched to Solaris 10 • No crashes, better system-level tools. • prstat, iostat, vmstat, smf, fault- management. • ZFS • snapshots (persistent), BLI backups. • Excellent support for enterprise storage. • DTrace.
  • 40. Choosing Solaris 10 • Switched to Solaris 10 • No crashes, better system-level tools. • prstat, iostat, vmstat, smf, fault- management. • ZFS • snapshots (persistent), BLI backups. • Excellent support for enterprise storage. • DTrace. • Free (too).
  • 42. Oracle features we need • Partitioning
  • 43. Oracle features we need • Partitioning • Statistics and Aggregations
  • 44. Oracle features we need • Partitioning • Statistics and Aggregations • rank over partition, lead, lag, etc.
  • 45. Oracle features we need • Partitioning • Statistics and Aggregations • rank over partition, lead, lag, etc. • Large selects (100GB)
  • 46. Oracle features we need • Partitioning • Statistics and Aggregations • rank over partition, lead, lag, etc. • Large selects (100GB) • Autonomous transactions
  • 47. Oracle features we need • Partitioning • Statistics and Aggregations • rank over partition, lead, lag, etc. • Large selects (100GB) • Autonomous transactions • Replication from Oracle (to Oracle)
  • 48. Partitioning For large data sets:
  • 49. Partitioning For large data sets: pgods=# select count(1) from ods.ods_tblpick_super;
  • 50. Partitioning For large data sets: pgods=# select count(1) from ods.ods_tblpick_super; count ------------ 1790994512 (1 row)
  • 51. Partitioning For large data sets: pgods=# select count(1) from ods.ods_tblpick_super; count ------------ 1790994512 (1 row) • Next biggest tables: 850m, 650m, 590m
  • 52. Partitioning For large data sets: pgods=# select count(1) from ods.ods_tblpick_super; count ------------ 1790994512 (1 row) • Next biggest tables: 850m, 650m, 590m • Allows us to cluster data over specific ranges (by date in our case)
  • 53. Partitioning For large data sets: pgods=# select count(1) from ods.ods_tblpick_super; count ------------ 1790994512 (1 row) • Next biggest tables: 850m, 650m, 590m • Allows us to cluster data over specific ranges (by date in our case) • Simple, cheap archiving and removal of data.
  • 54. Partitioning For large data sets: pgods=# select count(1) from ods.ods_tblpick_super; count ------------ 1790994512 (1 row) • Next biggest tables: 850m, 650m, 590m • Allows us to cluster data over specific ranges (by date in our case) • Simple, cheap archiving and removal of data. • Can put ranges used less often in different tablespaces (slower, cheaper storage)
  • 56. Partitioning PostgreSQL style • PostgreSQL doesn’t support partition...
  • 57. Partitioning PostgreSQL style • PostgreSQL doesn’t support partition... • It supports inheritance... (what’s this?)
  • 58. Partitioning PostgreSQL style • PostgreSQL doesn’t support partition... • It supports inheritance... (what’s this?) • some crazy object-relation paradigm.
  • 59. Partitioning PostgreSQL style • PostgreSQL doesn’t support partition... • It supports inheritance... (what’s this?) • some crazy object-relation paradigm. • We can use it to implement partitioning:
  • 60. Partitioning PostgreSQL style • PostgreSQL doesn’t support partition... • It supports inheritance... (what’s this?) • some crazy object-relation paradigm. • We can use it to implement partitioning: • One master table with no rows.
  • 61. Partitioning PostgreSQL style • PostgreSQL doesn’t support partition... • It supports inheritance... (what’s this?) • some crazy object-relation paradigm. • We can use it to implement partitioning: • One master table with no rows. • Child tables that have our partition constraints.
  • 62. Partitioning PostgreSQL style • PostgreSQL doesn’t support partition... • It supports inheritance... (what’s this?) • some crazy object-relation paradigm. • We can use it to implement partitioning: • One master table with no rows. • Child tables that have our partition constraints. • Rules on the master table for insert/update/delete.
  • 64. Partitioning PostgreSQL realized • Cheaply add new empty partitions
  • 65. Partitioning PostgreSQL realized • Cheaply add new empty partitions • Cheaply remove old partitions
  • 66. Partitioning PostgreSQL realized • Cheaply add new empty partitions • Cheaply remove old partitions • Migrate less-often-accessed partitions to slower storage
  • 67. Partitioning PostgreSQL realized • Cheaply add new empty partitions • Cheaply remove old partitions • Migrate less-often-accessed partitions to slower storage • Different indexes strategies per partition
  • 68. Partitioning PostgreSQL realized • Cheaply add new empty partitions • Cheaply remove old partitions • Migrate less-often-accessed partitions to slower storage • Different indexes strategies per partition • PostgreSQL >8.1 supports constraint checking on inherited tables.
  • 69. Partitioning PostgreSQL realized • Cheaply add new empty partitions • Cheaply remove old partitions • Migrate less-often-accessed partitions to slower storage • Different indexes strategies per partition • PostgreSQL >8.1 supports constraint checking on inherited tables. • smarter planning
  • 70. Partitioning PostgreSQL realized • Cheaply add new empty partitions • Cheaply remove old partitions • Migrate less-often-accessed partitions to slower storage • Different indexes strategies per partition • PostgreSQL >8.1 supports constraint checking on inherited tables. • smarter planning • smarter executing
  • 71. RANK OVER PARTITION • In Oracle: • In PostgreSQL: With 8.4, we have windowing functions
  • 72. RANK OVER PARTITION • In Oracle: select userid, email from ( select u.userid, u.email, row_number() over (partition by u.email order by userid desc) as position from (...)) where position = 1 • In PostgreSQL: With 8.4, we have windowing functions
  • 73. RANK OVER PARTITION • In Oracle: select userid, email from ( select u.userid, u.email, row_number() over (partition by u.email order by userid desc) as position from (...)) where position = 1 • In PostgreSQL: FOR v_row IN select u.userid, u.email from (...) order by email, userid desc LOOP IF v_row.email != v_last_email THEN RETURN NEXT v_row; v_last_email := v_row.email; v_rownum := v_rownum + 1; END IF; END LOOP; With 8.4, we have windowing functions
  • 74. Large SELECTs • Application code does:
  • 75. Large SELECTs • Application code does: select u.*, b.browser, m.lastmess from ods.ods_users u, ods.ods_browsers b, ( select userid, min(senddate) as senddate from ods.ods_maillog group by userid ) m, ods.ods_maillog l where u.userid = b.userid and u.userid = m.userid and u.userid = l.userid and l.senddate = m.senddate;
  • 76. Large SELECTs • Application code does: select u.*, b.browser, m.lastmess from ods.ods_users u, ods.ods_browsers b, ( select userid, min(senddate) as senddate from ods.ods_maillog group by userid ) m, ods.ods_maillog l where u.userid = b.userid and u.userid = m.userid and u.userid = l.userid and l.senddate = m.senddate; • The width of these rows is about 2k
  • 77. Large SELECTs • Application code does: select u.*, b.browser, m.lastmess from ods.ods_users u, ods.ods_browsers b, ( select userid, min(senddate) as senddate from ods.ods_maillog group by userid ) m, ods.ods_maillog l where u.userid = b.userid and u.userid = m.userid and u.userid = l.userid and l.senddate = m.senddate; • The width of these rows is about 2k • 50 million row return set
  • 78. Large SELECTs • Application code does: select u.*, b.browser, m.lastmess from ods.ods_users u, ods.ods_browsers b, ( select userid, min(senddate) as senddate from ods.ods_maillog group by userid ) m, ods.ods_maillog l where u.userid = b.userid and u.userid = m.userid and u.userid = l.userid and l.senddate = m.senddate; • The width of these rows is about 2k • 50 million row return set • > 100 GB of data
  • 79. The Large SELECT Problem • libpq will buffer the entire result in memory. • This affects language bindings (DBD::Pg). • This is an utterly deficient default behavior. • This can be avoided by using cursors • Requires the app to be PostgreSQL specific. • You open a cursor. • Then FETCH the row count you desire.
  • 80. Big SELECTs the Postgres way The previous “big” query becomes:
  • 81. Big SELECTs the Postgres way The previous “big” query becomes: DECLARE CURSOR bigdump FOR select u.*, b.browser, m.lastmess from ods.ods_users u, ods.ods_browsers b, ( select userid, min(senddate) as senddate from ods.ods_maillog group by userid ) m, ods.ods_maillog l where u.userid = b.userid and u.userid = m.userid and u.userid = l.userid and l.senddate = m.senddate;
  • 82. Big SELECTs the Postgres way The previous “big” query becomes: DECLARE CURSOR bigdump FOR select u.*, b.browser, m.lastmess from ods.ods_users u, ods.ods_browsers b, ( select userid, min(senddate) as senddate from ods.ods_maillog group by userid ) m, ods.ods_maillog l where u.userid = b.userid and u.userid = m.userid and u.userid = l.userid and l.senddate = m.senddate; Then, in a loop: FETCH FORWARD 10000 FROM bigdump;
  • 84. Autonomous Transactions • In Oracle we have over 2000 custom stored procedures.
  • 85. Autonomous Transactions • In Oracle we have over 2000 custom stored procedures. • During these procedures, we like to:
  • 86. Autonomous Transactions • In Oracle we have over 2000 custom stored procedures. • During these procedures, we like to: • COMMIT incrementally Useful for long transactions (update/delete) that need not be atomic -- incremental COMMITs.
  • 87. Autonomous Transactions • In Oracle we have over 2000 custom stored procedures. • During these procedures, we like to: • COMMIT incrementally Useful for long transactions (update/delete) that need not be atomic -- incremental COMMITs. • start a new top-level txn that can COMMIT Useful for logging progress in a stored procedure so that you know how far you progessed and how long each step took even if it rolls back.
  • 89. PostgreSQL shortcoming • PostgreSQL simply does not support Autonomous transactions and to quote core developers “that would be hard.”
  • 90. PostgreSQL shortcoming • PostgreSQL simply does not support Autonomous transactions and to quote core developers “that would be hard.” • When in doubt, use brute force.
  • 91. PostgreSQL shortcoming • PostgreSQL simply does not support Autonomous transactions and to quote core developers “that would be hard.” • When in doubt, use brute force. • Use pl/perl to use DBD::Pg to connect to ourselves (a new backend) and execute a new top-level transaction.
  • 93. Replication • Cross vendor database replication isn’t too difficult.
  • 94. Replication • Cross vendor database replication isn’t too difficult. • Helps a lot when you can do it inside the database.
  • 95. Replication • Cross vendor database replication isn’t too difficult. • Helps a lot when you can do it inside the database. • Using dbi-link (based on pl/perl and DBI) we can.
  • 96. Replication • Cross vendor database replication isn’t too difficult. • Helps a lot when you can do it inside the database. • Using dbi-link (based on pl/perl and DBI) we can. • We can connect to any remote database.
  • 97. Replication • Cross vendor database replication isn’t too difficult. • Helps a lot when you can do it inside the database. • Using dbi-link (based on pl/perl and DBI) we can. • We can connect to any remote database. • INSERT into local tables directly from remote SELECT statements. [snapshots]
  • 98. Replication • Cross vendor database replication isn’t too difficult. • Helps a lot when you can do it inside the database. • Using dbi-link (based on pl/perl and DBI) we can. • We can connect to any remote database. • INSERT into local tables directly from remote SELECT statements. [snapshots] • LOOP over remote SELECT statements and process them row-by-row. [replaying remote DML logs]
  • 100. Replication (really) • Through a combination of snapshotting and DML replay we: • replicate over into over 2000 tables in PostgreSQL from Oracle • snapshot replication of 200 • DML replay logs for 1800
  • 101. Replication (really) • Through a combination of snapshotting and DML replay we: • replicate over into over 2000 tables in PostgreSQL from Oracle • snapshot replication of 200 • DML replay logs for 1800 • PostgreSQL to Oracle is a bit harder • out-of-band export and imports
  • 102. New Architecture • Master: Sun v890 and Hitachi AMS + warm standby running Oracle (1TB) • Logs: several customs running MySQL instances (2TB each) • ODS BI: 2x Sun v40 running PostgreSQL 8.3 (6TB on Sun JBODs on ZFS each) • ODS archive: 2x custom running PostgreSQL 8.3 (14TB internal storage on ZFS each)
  • 103. PostgreSQL is Lacking • No upgrades (AYFKM). • pg_dump is too intrusive. • Poor system-level instrumentation. • Poor methods to determine specific contention. • It relies on the operating system’s filesystem cache. (which make PostgreSQL inconsistent across it’s supported OS base)
  • 104. Enter Solaris • Solaris is a UNIX from Sun Microsystems. • Is it different than other UNIX/UNIX-like systems? • Mostly it isn’t different (hence the term UNIX) • It does have extremely strong ABI backward compatibility. • It’s stable and works well on large machines. • Solaris 10 shakes things up a bit: • DTrace • ZFS • Zones
  • 105. Solaris / ZFS • ZFS: Zettaback Filesystem. • 264 snapshots, 248 files/directory, 264 bytes/filesystem, 278 (256 ZiB) bytes in a pool, 264 devices/pool, 264 pools/system • Extremely cheap differential backups. • I have a 5 TB database, I need a backup! • No rollback in your database? What is this? MySQL? • No rollback in your filesystem? • ZFS has snapshots, rollback, clone and promote. • OMG! Life altering features. • Caveat: ZFS is slower than alternatives, by about 10% with tuning.
  • 106. Solaris / Zones • Zones: Virtual Environments. • Shared kernel. • Can share filesystems. • Segregated processes and privileges. • No big deal for databases, right? But Wait!
  • 107. Solaris / ZFS + Zones = Magic Juju https://labs.omniti.com/trac/pgsoltools/browser/trunk/pitr_clone/clonedb_startclone.sh • ZFS snapshot, clone, delegate to zone, boot and run. • When done, halt zone, destroy clone. • We get a point-in-time copy of our PostgreSQL database: • read-write, • low disk-space requirements, • NO LOCKS! Welcome back pg_dump, you don’t suck (as much) anymore. • Fast snapshot to usable copy time: • On our 20 GB database: 1 minute. • On our 1.2 TB database: 2 minutes.
  • 108. ZFS: how I saved my soul. • Database crash. Bad. 1.2 TB of data... busted. The reason Robert Treat looks a bit older than he should. • xlogs corrupted. catalog indexes corrupted. • Fault? PostgreSQL bug? Bad memory? Who knows? • Trial & error on a 1.2 TB data set is a cruel experience. • In real-life, most recovery actions are destructive actions. • PostgreSQL is no different. • Rollback to last checkpoint (ZFS), hack postgres code, try, fail, repeat.
  • 109. Let DTrace open your eyes • DTrace: Dynamic Tracing • Dynamically instrument “stuff” in the system: • system calls (like strace/truss/ktrace). • process/scheduler activity (on/off cpu, semaphores, conditions). • see signals sent and received. • trace kernel functions, networking. • watch I/O down to the disk. • user-space processes, each function... each machine instruction! • Add probes into apps where it makes sense to you.
  • 110. Can you see what I see? • There is EXPLAIN... when that isn’t enough... • There is EXPLAIN ANALYZE... when that isn’t enough. • There is DTrace. ; dtrace -q -n ‘ postgresql*:::statement-start { self->query = copyinstr(arg0); self->ok=1; } io:::start /self->ok/ { @[self->query, args[0]->b_flags & B_READ ? quot;readquot; : quot;writequot;, args[1]->dev_statname] = sum(args[0]->b_bcount); }’ dtrace: description 'postgres*:::statement-start' matched 14 probes ^C select count(1) from c2w_ods.tblusers where zipcode between 10000 and 11000; read sd1 16384 select division, sum(amount), avg(amount) from ods.billings where txn_timestamp between ‘2006-01-01 00:00:00’ and ‘2006-04-01 00:00:00’ group by division; read sd2 71647232
  • 111. OmniTI Labs / pgsoltools • https://labs.omniti.com/trac/pgsoltools • Where we stick out PostgreSQL on Solaris goodies... • like pg_file_stress FILENAME/DBOBJECT READS WRITES # min avg max # min avg max alldata1__idx_remove_domain_external 1 12 12 12 398 0 0 0 slowdata1__pg_rewrite 1 12 12 12 0 0 0 0 slowdata1__pg_class_oid_index 1 0 0 0 0 0 0 0 slowdata1__pg_attribute 2 0 0 0 0 0 0 0 alldata1__mv_users 0 0 0 0 4 0 0 0 slowdata1__pg_statistic 1 0 0 0 0 0 0 0 slowdata1__pg_index 1 0 0 0 0 0 0 0 slowdata1__pg_index_indexrelid_index 1 0 0 0 0 0 0 0 alldata1__remove_domain_external 0 0 0 0 502 0 0 0 alldata1__promo_15_tb_full_2 19 0 0 0 11 0 0 0 slowdata1__pg_class_relname_nsp_index 2 0 0 0 0 0 0 0 alldata1__promo_177intaoltest_tb 0 0 0 0 1053 0 0 0 slowdata1__pg_attribute_relid_attnum_index 2 0 0 0 0 0 0 0 alldata1__promo_15_tb_full_2_pk 2 0 0 0 0 0 0 0 alldata1__all_mailable_2 1403 0 0 423 0 0 0 0 alldata1__mv_users_pkey 0 0 0 0 4 0 0 0
  • 113. Results • Move ODS Oracle licenses to OLTP
  • 114. Results • Move ODS Oracle licenses to OLTP • Run PostgreSQL on ODS
  • 115. Results • Move ODS Oracle licenses to OLTP • Run PostgreSQL on ODS • Save $800k in license costs.
  • 116. Results • Move ODS Oracle licenses to OLTP • Run PostgreSQL on ODS • Save $800k in license costs. • Spend $100k in labor costs.
  • 117. Results • Move ODS Oracle licenses to OLTP • Run PostgreSQL on ODS • Save $800k in license costs. • Spend $100k in labor costs. • Learn a lot.
  • 118. Thanks! • Thank you. • http://omniti.com/does/postgresql • We’re hiring, but only if you love: • lots of data on lots of disks on lots of big boxes • smart people • hard problems • more than one database technology (including PostgreSQL) • responsibility

Editor's Notes