Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Greenplum: A Pivotal Moment on Wall Street - Greenplum Summit 2018

Greenplum: A Pivotal Moment on Wall Street
Greenplum Summit at PostgresConf US 2018
Howard Goldberg, Head of Greenplum Engineering, Morgan Stanley

  • Be the first to comment

Greenplum: A Pivotal Moment on Wall Street - Greenplum Summit 2018

  1. 1. ​Greenplum: A Pivotal Moment on Wall Street Howard Goldberg 2018
  2. 2. 22018 GREENPLUM: A PIVOTAL MOMENT Master Server Data Protection § Replicated logs for server failures Segment Server Data Protection § Mirrored segments for server failures § RAID protection for drive failures Network Protection § Dual 25 Gige MS Configuration § Hot/Hot - dual etl config is a huge win § Segment Host Building Block § 20 or 40 hosts using a S/M/L building blocks Standby Segment Segment Segment Segment Master -- Small -- 24 cores 348GB memory / 25 Gige 45 TB ‘s per host -- Medium -- 48 cores 750GB memory / 25 Gige 45 TB ‘s per host -- Large -- 56 cores 1500GB memory / 25 Gige 45 TB ‘s per host GP Architecture and High Availability
  3. 3. 32018 GREENPLUM: A PIVOTAL MOMENT 4.2.4 – 4.2.8 (5) 2014 – 2015 EOL @ MS 4.3.2 – 4.3.22 (21+) 2015 – 2017+ Release every ~2 months 5.0 – 5.6 (5+) 2017 – 2018+ • Benefits from being on the latest release: • Outstanding issues fixed in a very agile manner • Receive rollup fixes not discovered at Morgan Stanley • Stability is improved • New releases provide enhanced functionality • Orca optimizer is constantly improving • Experimental features such as: • gpbackup, recursive queries, Gpload external table check 4.2 Our Greenplum Journey
  4. 4. 42018 GREENPLUM: A PIVOTAL MOMENT . 8 6 183 2 6 test qa prod 0 5 10 15 20 25 30 Instance counts by Version 5.6+ 4.3.22+ 75 170 350 1000 300 300 300 fdw W M capo/viper R isk M iFid etsdb Athena 0 200 400 600 800 1000 1200 Size (TB) 15 9 20 40 0 2 4 6 8 10 12 14 16 PROD Environment Counts 20 40 2 4 10 20 40 2014 2015 2016 2017 2018 0 10 20 30 40 50 PROD Cumulative Usable PB Storage 1. 24 prod instances 2. 2 offerings: 20 or 40 hosts 3. 600+ hosts, 13k+ cores, 81PB of storage and growing 4. 2.5PB or 25PB raw using 10x comp ratio 1 2 3 4 Greenplum By the Numbers
  5. 5. 52018 GREENPLUM: A PIVOTAL MOMENT 5 Deep Archives - Large loads and light query. Cheaper, faster archives with analytical capabilities! (Athena, Etsdb, PetsDB, MiFid, Oedm …) SQL & In-Database analytics – Concurrent loads and high query volume with seconds to minute(s) SLAs (Risk, FDW) Micro-batching – Very high concurrent loads and queries with second(s) - minute SLA (Viper) Analytics Workbench / Predictive Modeling– Adhoc loads / adhoc queries with minute(s) – hour(s) SLAs (Wealth Management) Morgan Stanley Greenplum Use Cases
  6. 6. 62018 GREENPLUM: A PIVOTAL MOMENT • New government regulation are: • Increasing data management controls • Driving data retention • MiFID- Markets in Financial Instruments Directive • Designed to offer greater protection for investors and inject more transparency into all asset classes: from equities to fixed income, exchange traded funds and foreign exchange. Institutions will have to report more information about most trades including price and volume. • GDPR - General Data Protection Regulation • Seeks to create a harmonized data protection law framework across the EU and aims to give citizens back the control of their personal data, whilst imposing strict rules on those hosting and ‘processing’ this data, anywhere in the world. This will increase data retention requirements for our data stores. New Regulations
  7. 7. 72018 GREENPLUM: A PIVOTAL MOMENT • Go BIG… • 50TB minimum entry point • More MEMORY is better for everything! • Catalog bloat • Occurs because of MVCC and frequent object creation and destruction • Can significantly degrade performance for gploads and analyze utilities • Can cause indexes to become larger than their related tables - Reindex • Mitigate • Minimize external table creation - Use gpload with reuse_tables / staging option or gpfdist • Vacuum analyze the catalog frequently: @MS we V/A 12 times a day • Number of files matter • File count drivers - #segments (P+M) * # RP’s * # columns in AO tables • One database @Morgan Stanley has 600k objects and 220 million files per host! • Effect: Slower backups , restores, gprecoverseg and space analysis • Mitigate • Control # of RP’s: Avoid Multi-level RP and decrease RP granularity for historical ranges • Rebuild tables that have high concurrent insert or load activity Lessons Learned (1 of 2)
  8. 8. 82018 GREENPLUM: A PIVOTAL MOMENT • Vacuum User tables • User tables must be vacuumed to free dead rows and avoid TXID wraparound • WARNING: database “xxxx" must be vacuumed within xxxx transactions • Lower the xid_warn_limit to increase warning threshold • 2 billion – 500 million (xid_warn_limit) – 1 billion (xid_stop_limit) = warning when age >= 500 million • Sample function: SELECT * FROM msgp.fn_vacuumdb('vacuumage', 5, 'age=400000000,top=20');. • Root partition maintenance • Analyze of root partition can be lengthy for very large tables for ORCA • Vacuum required on root partition to reduce age in the root partition • Compression • Table Organization (heap / Append Optimized Columnar / Append Optimized Row) • Use default storage GUC to save space • gpconfig -c 'gp_default_storage_options' -v "'appendonly=true, orientation=column‘ Lessons Learned (2 of 2) Parent/root Jan Feb Mar Object Other Vendor GP Comp Ratio Saving DB 47TB 19TB 2.4X 28TB Table 1.9TB 228GB 8.3X 1.6TB
  9. 9. 92018 GREENPLUM: A PIVOTAL MOMENT • Version management • GP 5.x for all new GP instances • Migrate all 4.3 instances to 5.x instances - requires cutover HW • GP 6.0 coming in 2019 - more PG (9.x) features and segment file replication replacement • Optimizer • Make ORCA the default optimizer for 4.3.22+, ORCA is the default for 5.x • Explore use of bitmap indexes more in ORCA • WLM & Utility management • Migrate from RQ’s to RG’s (RG uses cgroups) • Migrate from WLM 1.8.2 to new next gen WLM in GP 5.x+ • Use gpbackup and gprestore instead of gpcrondump and gpdbrestore • Vast improvement in parallelism and catalog metadata access • Unix platform • Migrate RH6 to RH7 - Cgroups enhancements needed for WLM/RG • Separate compute from storage • Working with Pivotal on the Spark to GP parallel connector • Provides burst compute layer • Encryption for Data at Rest • Use SED drives and explore use of Pgcrypto Where We Want To Go
  10. 10. 102018 GREENPLUM: A PIVOTAL MOMENT

×