SlideShare a Scribd company logo
1 of 60
Download to read offline
The ninja elephant
Scaling the analytics database in Transferwise
Federico Campoli
Transferwise
25th January 2017
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 1 / 1
First rule about talks, don’t talk about the speaker
Born in 1972
Passionate about IT since 1982 mostly because of TRON movie
Joined the Oracle DBA secret society in 2004
Fell in love with PostgreSQL in 2006
Currently runs the Brighton PostgreSQL User group
Works at Transferwise as Data Engineer
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 2 / 1
Table of contents
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 3 / 1
Table of contents
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 4 / 1
We have an appointment, and we are late!
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 5 / 1
The Gordian Knot of analytics db
The data engineer started in July 2016
He was involved in a task not customer facing
However the task was very critical to the business
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 6 / 1
The Gordian Knot of analytics db
The data engineer started in July 2016
He was involved in a task not customer facing
However the task was very critical to the business
To solve the performance issues on the MySQL analytics database
Which were bad despite the resources assigned to the VM were considerable
And the data set was medium size
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 6 / 1
Tactical assessment
The existing database had the following configuration
MySQL 5.6 on innodb
Innodb buffer size 60 GB
RAM available 70 GB
20 CPU
600 GB used on disk
Analytic queries performed via Looker and Tableau
The main live MySQL schema replicated into the analytics database
Several schemas from the service database imported on a regular basis
One schema used for obfuscating PII and denormalising the heavy queries
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 7 / 1
The frog effect
If you drop a frog in a pot of boiling water, it will of course frantically try to
clamber out. But if you place it gently in a pot of tepid water and turn the heat
will be slowly boiled to death.
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 8 / 1
The frog effect
If you drop a frog in a pot of boiling water, it will of course frantically try to
clamber out. But if you place it gently in a pot of tepid water and turn the heat
will be slowly boiled to death.
The performance issues worsened over a two years span
The obfuscation was made via custom views
The data size on the MySQL master increased over time
Causing the optimiser to switch on materialise when accessing the views
The analytics tools struggled just under normal load
In busy periods the database became almost unusable
Analysts were busy to tune existing queries rather writing new
A new solution was needed
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 8 / 1
Table of contents
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 9 / 1
The eye of the storm
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 10 / 1
One size doesn’t fits all
It was clear that MySQL was no longer a good fit.
However the new solution’s requirements had to meet some specific needs.
Data updated in almost real time from the live database
PII obfuscated for the analysts
PII available in clear for the power users
The system should be able to scale out for several years
Modern SQL for better analytics queries
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 11 / 1
May the best database win
The analysts team shortlisted few solutions.
Each solution covered partially the requirements.
Google BigQuery
Amazon RedShift
Snowflake
PostgreSQL
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 12 / 1
Shortlisting the shortlist
Google BigQuery and Amazon RedShift did not suffice the analytics requirements
and were removed from the list.
Both PostgreSQL and Snowflake offered very good performance and modern SQL.
Neither of them offered a replication system from the MySQL system.
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 13 / 1
Straight into the cloud
Snowflake is a cloud based data warehouse service. It’s based on Amazon S3 and
comes with different sizing.
Their pricing system is very appealing and the preliminary tests shown Snowflake
outperforming PostgreSQL1
.
1PostgreSQL single machine vs cloud based parallel processing
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 14 / 1
Streaming copy
Using FiveTran, an impressive multi technology data pipeline, the data would flow
in real time from our production server to Snowflake.
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 15 / 1
Streaming copy
Using FiveTran, an impressive multi technology data pipeline, the data would flow
in real time from our production server to Snowflake.
Unfortunately there was just one little catch.
There was no support for obfuscation.
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 15 / 1
Customer comes first
In Transferwise we really care about the customer’s data security.
Our policy for the PII data is that any personal information moving outside our
perimeter shall be obfuscated.
The third party extraction and replica for Snowflake required full read access to
our live systems or at least a database configured in cascading replica .
The data should have been obfuscated before allowing the third party replicator
access.
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 16 / 1
Proactive development
The data engineer foreseeing the issue developed in his spare time a proof of
concept based on the replica tool pg chameleon which uses a python library to
read the MySQL replica.
The tests on a small copy of the live database were successful.
The tool’s simple structure allowed to add the obfuscation in real time with
minimal changes.
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 17 / 1
And the winner is...
In this scenario PostgreSQL would be the replicated and obfuscated data source
for FiveTran.
However, because the performance on PostgreSQL were quite good, and the
system have good margin for scaling up, the decision was to keep the data
analytics data behind our perimeter.
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 18 / 1
Table of contents
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 19 / 1
MySQL Replica in a nutshell
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 20 / 1
A quick look to the replication system
Let’s have a quick overview on how the MySQL replica works and how the
replicator interacts with it.
The following slides are related to pg chameleon because the custom obfuscator
tool share with pg chameleon most of its concepts and code.
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 21 / 1
MySQL Replica
The MySQL replica protocol is logical
When MySQL is configured properly the RDBMS saves the data changed
into binary log files
The slave connects to the master and gets the replication data
The replication’s data are saved into the slave’s local relay logs
The local relay logs are replayed into the slave
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 22 / 1
MySQL Replica
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 23 / 1
A chameleon in the middle
pg chameleon mimics a mysql slave’s behaviour
Connects to the master and reads data changes
It stores the row images into a PostgreSQL table using the jsonb format
A plpgSQL function decodes the rows and replay the changes
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 24 / 1
A chameleon in the middle
pg chameleon mimics a mysql slave’s behaviour
Connects to the master and reads data changes
It stores the row images into a PostgreSQL table using the jsonb format
A plpgSQL function decodes the rows and replay the changes
PostgreSQL acts as relay log and replication slave
With an extra cool feature.
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 24 / 1
A chameleon in the middle
pg chameleon mimics a mysql slave’s behaviour
Connects to the master and reads data changes
It stores the row images into a PostgreSQL table using the jsonb format
A plpgSQL function decodes the rows and replay the changes
PostgreSQL acts as relay log and replication slave
With an extra cool feature.
Initialises the PostgreSQL replica schema in just one command
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 24 / 1
MySQL replica + pg chameleon
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 25 / 1
Log formats
MySQL supports different formats for the binary logs.
The STATEMENT format. It logs the statements which are replayed on the
slave.
It seems the best solution for performance.
However replaying queries with not deterministic elements generate
inconsistent slaves (e.g. insert with uuid).
The ROW format is deterministic. It logs the row image and the DDL queries.
This is the format required for pg chameleon to work.
MIXED takes the best of both worlds. The master logs the statements unless
a not deterministic element is used. In that case it logs the row image.
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 26 / 1
Table of contents
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 27 / 1
Maximum effort
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 28 / 1
Replica and obfuscation
The data engineer worked on pg chameleon and built a minimum viable product.
The project was forked into a transferwise owned repository for adding the
obfuscation capabilities and other specific functionalities like the daily procedures
for the pre aggregated schema.
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 29 / 1
Mighty morphing power elephant
The replica initialisation locks the mysql tables in read only mode.
To avoid the main database to be locked for several hours a secondary MySQL
replica is setup with the local query logging enabled.
The cascading replica also allowed to use the ROW binlog format as the master
uses MIXED for performance reasons.
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 30 / 1
This is what awesome looks like!
A MySQL master is replicated into a MySQL slave
The slave’s data is copied and obfuscated using a PostgreSQL database!
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 31 / 1
This is what awesome looks like!
A MySQL master is replicated into a MySQL slave
The slave’s data is copied and obfuscated using a PostgreSQL database!
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 31 / 1
Replica initialisation
The replica initialisation follows the same rules of any mysql replica setup
Flush the tables with read lock
Get the master’s coordinates
Copy the data
Release the locks
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 32 / 1
Tricky SQL
The data copy pulls the data out from mysql using the CSV format with a very
tricky SQL statement.
SELECT
CASE
WHEN data_type="enum"
THEN
SUBSTRING(COLUMN_TYPE,5)
END AS enum_list,
CASE
WHEN
data_type IN (’"""+"’,’".join(self.hexify)+"""’)
THEN
concat(’hex(’,column_name,’)’)
WHEN
data_type IN (’bit’)
THEN
concat(’cast(‘’,column_name,’‘ AS unsigned)’)
ELSE
concat(’‘’,column_name,’‘’)
END
AS column_csv
FROM
information_schema.COLUMNS
WHERE
table_schema=%s
AND table_name=%s
ORDER BY
ordinal_position
;
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 33 / 1
Fallback on failure
The CSV data is pulled out in slices in order to avoid memory overload.
The file is then pushed into PostgreSQL using the COPY command.
However...
COPY is fast but is single transaction
One failure and the entire batch is rolled back
If this happens the procedure loads the same data using the INSERT
statements
Which can be very slow
But at least discards only the problematic rows
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 34 / 1
Obfuscation when initialising
The obfuscation process is quite simple and uses the extension pgcrypt for hashing
in sha256.
When the replica is initialised the data is copied into the schema in clear
The table locks are released
The tables with PII are copied and obfuscated in a separate schema
The process builds the indices on the schemas with data in clear and
obfuscated
The tables without PII data are exposed to the normal users using simple
views
All the varchar fields in the obfuscated schema are converted in text fields
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 35 / 1
Obfuscation on the fly
The obfuscation is also applied when the data is replicated.
The approach is very simple.
When a row image is captured the process checks if the table contains PII
data
In that case the process generates a second jsonb element with the PII data
obfuscated
The jsonb element carries the complete informations about the destination
schema
The plpgSQL function executes the change on the schema in clear and the
schema with obfuscated data
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 36 / 1
The DDL. A real pain in the back
The DDL replica is possible with a little trick.
MySQL even in ROW format emits the DDL as statements
A regular expression traps the DDL like CREATE/DROP TABLE or ALTER
TABLE.
The mysql library gets the table’s metadata from the information schema
The metadata is used to build the DDL in the PostgreSQL dialect
This approach may not be elegant but is quite robust.
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 37 / 1
Timing
Query MySQL PostgreSQL PostgreSQL cached
Master procedure 20 hours 4 hours N/A
Extracting sharing ibans didn’t complete 3 minutes 1 minute
Adyen notification 6 minutes 2 minutes 6 seconds
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 38 / 1
Resource comparison
Resource MySQL PostgreSQL
Storage Size 940 GB 664 GB
Server CPUs 18 8
Server Memory 68 GB 48 GB
Shared Memory 50 GB 5 GB
Max connections 500 100
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 39 / 1
Advantages using PostgreSQL
Stronger security model
Better resource optimisation (See previous slide)
No invalid views
No performance issues with views
Complex analytics functions
partitioning (thanks pg pathman!)
BRIN indices
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 40 / 1
Advantages using PostgreSQL
Stronger security model
Better resource optimisation (See previous slide)
No invalid views
No performance issues with views
Complex analytics functions
partitioning (thanks pg pathman!)
BRIN indices
some code was optimised inside, but actually very little - maybe 10-20% was
improved. We’ll do more of that in the future, but not yet. The good thing is that
the performance gains we have can mostly be attributed just to PG vs MySQL. So
there’s a lot of scope to improve further.
Jeff McClelland - Growth Analyst, data guru
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 40 / 1
Table of contents
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 41 / 1
Lessons learned
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 42 / 1
init replica tune
The replica initialisation required several improvements.
The first init replica implementation didn’t complete.
The OOM killer killed the process when the memory usage was too high.
In order to speed up the replica, some large tables not required in the
analytics db were excluded from the init replica.
Some tables required a custom slice size because the row length triggered
again the OOM killer.
Estimating the total rows for user’s feedback is faster but the output can be
odd.
Using not buffered cursors improves the speed and the memory usage.
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 43 / 1
init replica tune
The replica initialisation required several improvements.
The first init replica implementation didn’t complete.
The OOM killer killed the process when the memory usage was too high.
In order to speed up the replica, some large tables not required in the
analytics db were excluded from the init replica.
Some tables required a custom slice size because the row length triggered
again the OOM killer.
Estimating the total rows for user’s feedback is faster but the output can be
odd.
Using not buffered cursors improves the speed and the memory usage.
However.... even after fixing the memory issues the initial copy took 6 days.
Tuning the copy speed with the unbuffered cursors and the row number estimates
improved the initial copy speed which now completes in 30 hours.
Including the time required for the index build.
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 43 / 1
Strictness is an illusion. MySQL doubly so
MySQL’s lack of strictness is not a mystery.
The replica broke down several times because of the funny way the NOT NULL is
managed by MySQL.
To prevent any further replica breakdown the fields with NOT NULL added with
ALTER TABLE, in PostgreSQL are always as NULLable.
MySQL truncates the strings of characters at the varchar size automatically. This
is a problem if the field is obfuscated on PostgreSQL because the hashed string
could not fit into the corresponding varchar field. Therefore all the character
varying on the obfuscated schema are converted to text.
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 44 / 1
I feel your lack of constraint disturbing
Rubbish data in MySQL can be stored without errors raised by the DBMS.
When this happens the replicator traps the error when the change is replayed on
PostgreSQL and discards the problematic row.
The value is logged on the replica’s log, available for further actions.
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 45 / 1
Table of contents
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 46 / 1
Wrap up
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 47 / 1
Did you say hire?
WE ARE HIRING!
https://transferwise.com/jobs/
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 48 / 1
That’s all folks!
QUESTIONS?
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 49 / 1
Contacts and license
Twitter: 4thdoctor scarf
Transferwise: https://transferwise.com/
Blog:http://www.pgdba.co.uk
Meetup: http://www.meetup.com/Brighton-PostgreSQL-Meetup/
This document is distributed under the terms of the Creative Commons
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 50 / 1
Boring legal stuff
The 4th doctor meme - source memecrunch.com
The eye, phantom playground, light end tunnel - Copyright Federico Campoli
The dolphin picture - Copyright artnoose
Deadpool Maximum Effort - source Deadpool Zoeiro
Deadpool Clap - source memegenerator
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 51 / 1
The ninja elephant
Scaling the analytics database in Transferwise
Federico Campoli
Transferwise
25th January 2017
Federico Campoli (Transferwise) The ninja elephant 25th January 2017 52 / 1

More Related Content

What's hot

The hitchhiker's guide to PostgreSQL
The hitchhiker's guide to PostgreSQLThe hitchhiker's guide to PostgreSQL
The hitchhiker's guide to PostgreSQLFederico Campoli
 
PostgreSql query planning and tuning
PostgreSql query planning and tuningPostgreSql query planning and tuning
PostgreSql query planning and tuningFederico Campoli
 
Don't panic! - Postgres introduction
Don't panic! - Postgres introductionDon't panic! - Postgres introduction
Don't panic! - Postgres introductionFederico Campoli
 
Introduction to PostgreSQL
Introduction to PostgreSQLIntroduction to PostgreSQL
Introduction to PostgreSQLMark Wong
 
Whats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
Whats wrong with postgres | PGConf EU 2019 | Craig KerstiensWhats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
Whats wrong with postgres | PGConf EU 2019 | Craig KerstiensCitus Data
 
JPA Week3 Entity Mapping / Hexagonal Architecture
JPA Week3 Entity Mapping / Hexagonal ArchitectureJPA Week3 Entity Mapping / Hexagonal Architecture
JPA Week3 Entity Mapping / Hexagonal ArchitectureCovenant Ko
 
Wait! What’s going on inside my database?
Wait! What’s going on inside my database?Wait! What’s going on inside my database?
Wait! What’s going on inside my database?Jeremy Schneider
 
Velox: Models in Action
Velox: Models in ActionVelox: Models in Action
Velox: Models in ActionDan Crankshaw
 
GREAT STEP 1. 테스트 코드를 향한 위대한 발걸음
GREAT STEP 1. 테스트 코드를 향한 위대한 발걸음GREAT STEP 1. 테스트 코드를 향한 위대한 발걸음
GREAT STEP 1. 테스트 코드를 향한 위대한 발걸음Covenant Ko
 
떠먹는 '오브젝트' Ch02 객체지향 프로그래밍
떠먹는 '오브젝트' Ch02 객체지향 프로그래밍떠먹는 '오브젝트' Ch02 객체지향 프로그래밍
떠먹는 '오브젝트' Ch02 객체지향 프로그래밍Covenant Ko
 
JPA 스터디 Week1 - 하이버네이트, 캐시
JPA 스터디 Week1 - 하이버네이트, 캐시JPA 스터디 Week1 - 하이버네이트, 캐시
JPA 스터디 Week1 - 하이버네이트, 캐시Covenant Ko
 
Tackling a 1 billion member social network
Tackling a 1 billion member social networkTackling a 1 billion member social network
Tackling a 1 billion member social networkArtur Bańkowski
 
Tuning PostgreSQL for High Write Throughput
Tuning PostgreSQL for High Write Throughput Tuning PostgreSQL for High Write Throughput
Tuning PostgreSQL for High Write Throughput Grant McAlister
 
Ireland OUG Meetup May 2017
Ireland OUG Meetup May 2017Ireland OUG Meetup May 2017
Ireland OUG Meetup May 2017Brendan Tierney
 
PostgreSQL, your NoSQL database
PostgreSQL, your NoSQL databasePostgreSQL, your NoSQL database
PostgreSQL, your NoSQL databaseReuven Lerner
 
NoSQL and Triple Stores
NoSQL and Triple StoresNoSQL and Triple Stores
NoSQL and Triple Storesandyseaborne
 
Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Chris Fregly
 
Pg 95 new capabilities
Pg 95 new capabilitiesPg 95 new capabilities
Pg 95 new capabilitiesJamey Hanson
 

What's hot (20)

The hitchhiker's guide to PostgreSQL
The hitchhiker's guide to PostgreSQLThe hitchhiker's guide to PostgreSQL
The hitchhiker's guide to PostgreSQL
 
Pg big fast ugly acid
Pg big fast ugly acidPg big fast ugly acid
Pg big fast ugly acid
 
Hitchikers guide handout
Hitchikers guide handoutHitchikers guide handout
Hitchikers guide handout
 
PostgreSql query planning and tuning
PostgreSql query planning and tuningPostgreSql query planning and tuning
PostgreSql query planning and tuning
 
Don't panic! - Postgres introduction
Don't panic! - Postgres introductionDon't panic! - Postgres introduction
Don't panic! - Postgres introduction
 
Introduction to PostgreSQL
Introduction to PostgreSQLIntroduction to PostgreSQL
Introduction to PostgreSQL
 
Whats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
Whats wrong with postgres | PGConf EU 2019 | Craig KerstiensWhats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
Whats wrong with postgres | PGConf EU 2019 | Craig Kerstiens
 
JPA Week3 Entity Mapping / Hexagonal Architecture
JPA Week3 Entity Mapping / Hexagonal ArchitectureJPA Week3 Entity Mapping / Hexagonal Architecture
JPA Week3 Entity Mapping / Hexagonal Architecture
 
Wait! What’s going on inside my database?
Wait! What’s going on inside my database?Wait! What’s going on inside my database?
Wait! What’s going on inside my database?
 
Velox: Models in Action
Velox: Models in ActionVelox: Models in Action
Velox: Models in Action
 
GREAT STEP 1. 테스트 코드를 향한 위대한 발걸음
GREAT STEP 1. 테스트 코드를 향한 위대한 발걸음GREAT STEP 1. 테스트 코드를 향한 위대한 발걸음
GREAT STEP 1. 테스트 코드를 향한 위대한 발걸음
 
떠먹는 '오브젝트' Ch02 객체지향 프로그래밍
떠먹는 '오브젝트' Ch02 객체지향 프로그래밍떠먹는 '오브젝트' Ch02 객체지향 프로그래밍
떠먹는 '오브젝트' Ch02 객체지향 프로그래밍
 
JPA 스터디 Week1 - 하이버네이트, 캐시
JPA 스터디 Week1 - 하이버네이트, 캐시JPA 스터디 Week1 - 하이버네이트, 캐시
JPA 스터디 Week1 - 하이버네이트, 캐시
 
Tackling a 1 billion member social network
Tackling a 1 billion member social networkTackling a 1 billion member social network
Tackling a 1 billion member social network
 
Tuning PostgreSQL for High Write Throughput
Tuning PostgreSQL for High Write Throughput Tuning PostgreSQL for High Write Throughput
Tuning PostgreSQL for High Write Throughput
 
Ireland OUG Meetup May 2017
Ireland OUG Meetup May 2017Ireland OUG Meetup May 2017
Ireland OUG Meetup May 2017
 
PostgreSQL, your NoSQL database
PostgreSQL, your NoSQL databasePostgreSQL, your NoSQL database
PostgreSQL, your NoSQL database
 
NoSQL and Triple Stores
NoSQL and Triple StoresNoSQL and Triple Stores
NoSQL and Triple Stores
 
Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016Atlanta Spark User Meetup 09 22 2016
Atlanta Spark User Meetup 09 22 2016
 
Pg 95 new capabilities
Pg 95 new capabilitiesPg 95 new capabilities
Pg 95 new capabilities
 

Viewers also liked

Postgresql database administration volume 1
Postgresql database administration volume 1Postgresql database administration volume 1
Postgresql database administration volume 1Federico Campoli
 
Scaling transferwise
Scaling transferwiseScaling transferwise
Scaling transferwiseNilan Peiris
 
Product engineering @ TransferWise
Product engineering @ TransferWiseProduct engineering @ TransferWise
Product engineering @ TransferWiseAlvar Lumberg
 
MuCEM
MuCEMMuCEM
MuCEMOban_
 
Menopause nine by sanjana
Menopause nine by sanjana Menopause nine by sanjana
Menopause nine by sanjana sanjukpt92
 
PostgreSQL, The Big, The Fast and The Ugly
PostgreSQL, The Big, The Fast and The UglyPostgreSQL, The Big, The Fast and The Ugly
PostgreSQL, The Big, The Fast and The UglyFederico Campoli
 
Vacuum precision positioning systems brochure
Vacuum precision positioning systems brochureVacuum precision positioning systems brochure
Vacuum precision positioning systems brochureJohn Mike
 
SISTEMA DE IDENTIDADE VISUAL - SIV - CANNIBAL Sex Shop
SISTEMA DE IDENTIDADE VISUAL - SIV - CANNIBAL Sex ShopSISTEMA DE IDENTIDADE VISUAL - SIV - CANNIBAL Sex Shop
SISTEMA DE IDENTIDADE VISUAL - SIV - CANNIBAL Sex ShopMarcio Lampert
 

Viewers also liked (14)

Postgresql database administration volume 1
Postgresql database administration volume 1Postgresql database administration volume 1
Postgresql database administration volume 1
 
Scaling transferwise
Scaling transferwiseScaling transferwise
Scaling transferwise
 
Product engineering @ TransferWise
Product engineering @ TransferWiseProduct engineering @ TransferWise
Product engineering @ TransferWise
 
Streaming replication
Streaming replicationStreaming replication
Streaming replication
 
Media planning
Media planningMedia planning
Media planning
 
Media planning f
Media planning fMedia planning f
Media planning f
 
MuCEM
MuCEMMuCEM
MuCEM
 
Nclex 5
Nclex 5Nclex 5
Nclex 5
 
Menopause nine by sanjana
Menopause nine by sanjana Menopause nine by sanjana
Menopause nine by sanjana
 
PostgreSQL, The Big, The Fast and The Ugly
PostgreSQL, The Big, The Fast and The UglyPostgreSQL, The Big, The Fast and The Ugly
PostgreSQL, The Big, The Fast and The Ugly
 
Vacuum precision positioning systems brochure
Vacuum precision positioning systems brochureVacuum precision positioning systems brochure
Vacuum precision positioning systems brochure
 
Cl teleflostrainers
Cl teleflostrainersCl teleflostrainers
Cl teleflostrainers
 
SISTEMA DE IDENTIDADE VISUAL - SIV - CANNIBAL Sex Shop
SISTEMA DE IDENTIDADE VISUAL - SIV - CANNIBAL Sex ShopSISTEMA DE IDENTIDADE VISUAL - SIV - CANNIBAL Sex Shop
SISTEMA DE IDENTIDADE VISUAL - SIV - CANNIBAL Sex Shop
 
καστορια
καστοριακαστορια
καστορια
 

Similar to The ninja elephant, scaling the analytics database in Transwerwise

pg_chameleon MySQL to PostgreSQL replica made easy
pg_chameleon  MySQL to PostgreSQL replica made easypg_chameleon  MySQL to PostgreSQL replica made easy
pg_chameleon MySQL to PostgreSQL replica made easyFederico Campoli
 
WTF is Modeling, Anyway!?
WTF is Modeling, Anyway!?WTF is Modeling, Anyway!?
WTF is Modeling, Anyway!?Neil Gunther
 
MySQL Community Meetup in China : Innovation driven by the Community
MySQL Community Meetup in China : Innovation driven by the CommunityMySQL Community Meetup in China : Innovation driven by the Community
MySQL Community Meetup in China : Innovation driven by the CommunityFrederic Descamps
 
NoSQL – Back to the Future or Yet Another DB Feature?
NoSQL – Back to the Future or Yet Another DB Feature?NoSQL – Back to the Future or Yet Another DB Feature?
NoSQL – Back to the Future or Yet Another DB Feature?Martin Scholl
 
Topic 12: NoSQL in Action
Topic 12: NoSQL in ActionTopic 12: NoSQL in Action
Topic 12: NoSQL in ActionZubair Nabi
 
Productionize spark structured streaming
Productionize spark structured streamingProductionize spark structured streaming
Productionize spark structured streamingIvan Kosianenko
 
20190817 coscup-oracle my sql innodb cluster sharing
20190817 coscup-oracle my sql innodb cluster sharing20190817 coscup-oracle my sql innodb cluster sharing
20190817 coscup-oracle my sql innodb cluster sharingIvan Ma
 
NoSQL in MySQL
NoSQL in MySQLNoSQL in MySQL
NoSQL in MySQLUlf Wendel
 
Evoloution of Ideas
Evoloution of IdeasEvoloution of Ideas
Evoloution of IdeasWooga
 
Why Wordnik went non-relational
Why Wordnik went non-relationalWhy Wordnik went non-relational
Why Wordnik went non-relationalTony Tam
 
Vote NO for MySQL
Vote NO for MySQLVote NO for MySQL
Vote NO for MySQLUlf Wendel
 
Oracle Active Data Guard 12cR2. Is it the best option?
Oracle Active Data Guard 12cR2. Is it the best option?Oracle Active Data Guard 12cR2. Is it the best option?
Oracle Active Data Guard 12cR2. Is it the best option?Ludovico Caldara
 
Scaling, Tuning and Maintaining the Monolith
Scaling, Tuning and Maintaining the MonolithScaling, Tuning and Maintaining the Monolith
Scaling, Tuning and Maintaining the MonolithRoss McFadyen
 
MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!Vitor Oliveira
 
ECL-Watch: A Big Data Application Performance Tuning Tool in the HPCC Systems...
ECL-Watch: A Big Data Application Performance Tuning Tool in the HPCC Systems...ECL-Watch: A Big Data Application Performance Tuning Tool in the HPCC Systems...
ECL-Watch: A Big Data Application Performance Tuning Tool in the HPCC Systems...HPCC Systems
 
Drizzle Keynote from O'Reilly's MySQL's Conference
Drizzle Keynote from O'Reilly's MySQL's ConferenceDrizzle Keynote from O'Reilly's MySQL's Conference
Drizzle Keynote from O'Reilly's MySQL's ConferenceBrian Aker
 
MySQL InnoDB Cluster: High Availability Made Easy!
MySQL InnoDB Cluster: High Availability Made Easy!MySQL InnoDB Cluster: High Availability Made Easy!
MySQL InnoDB Cluster: High Availability Made Easy!Vittorio Cioe
 

Similar to The ninja elephant, scaling the analytics database in Transwerwise (20)

pg_chameleon MySQL to PostgreSQL replica made easy
pg_chameleon  MySQL to PostgreSQL replica made easypg_chameleon  MySQL to PostgreSQL replica made easy
pg_chameleon MySQL to PostgreSQL replica made easy
 
WTF is Modeling, Anyway!?
WTF is Modeling, Anyway!?WTF is Modeling, Anyway!?
WTF is Modeling, Anyway!?
 
MySQL Community Meetup in China : Innovation driven by the Community
MySQL Community Meetup in China : Innovation driven by the CommunityMySQL Community Meetup in China : Innovation driven by the Community
MySQL Community Meetup in China : Innovation driven by the Community
 
NoSQL – Back to the Future or Yet Another DB Feature?
NoSQL – Back to the Future or Yet Another DB Feature?NoSQL – Back to the Future or Yet Another DB Feature?
NoSQL – Back to the Future or Yet Another DB Feature?
 
Topic 12: NoSQL in Action
Topic 12: NoSQL in ActionTopic 12: NoSQL in Action
Topic 12: NoSQL in Action
 
Rapid Home Provisioning
Rapid Home ProvisioningRapid Home Provisioning
Rapid Home Provisioning
 
Productionize spark structured streaming
Productionize spark structured streamingProductionize spark structured streaming
Productionize spark structured streaming
 
20190817 coscup-oracle my sql innodb cluster sharing
20190817 coscup-oracle my sql innodb cluster sharing20190817 coscup-oracle my sql innodb cluster sharing
20190817 coscup-oracle my sql innodb cluster sharing
 
NoSQL in MySQL
NoSQL in MySQLNoSQL in MySQL
NoSQL in MySQL
 
Evoloution of Ideas
Evoloution of IdeasEvoloution of Ideas
Evoloution of Ideas
 
Why Wordnik went non-relational
Why Wordnik went non-relationalWhy Wordnik went non-relational
Why Wordnik went non-relational
 
Vote NO for MySQL
Vote NO for MySQLVote NO for MySQL
Vote NO for MySQL
 
Oracle Active Data Guard 12cR2. Is it the best option?
Oracle Active Data Guard 12cR2. Is it the best option?Oracle Active Data Guard 12cR2. Is it the best option?
Oracle Active Data Guard 12cR2. Is it the best option?
 
NoSql Databases
NoSql DatabasesNoSql Databases
NoSql Databases
 
Scaling, Tuning and Maintaining the Monolith
Scaling, Tuning and Maintaining the MonolithScaling, Tuning and Maintaining the Monolith
Scaling, Tuning and Maintaining the Monolith
 
MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!MySQL Replication Performance Tuning for Fun and Profit!
MySQL Replication Performance Tuning for Fun and Profit!
 
2010 Sopac Cosugi
2010 Sopac Cosugi2010 Sopac Cosugi
2010 Sopac Cosugi
 
ECL-Watch: A Big Data Application Performance Tuning Tool in the HPCC Systems...
ECL-Watch: A Big Data Application Performance Tuning Tool in the HPCC Systems...ECL-Watch: A Big Data Application Performance Tuning Tool in the HPCC Systems...
ECL-Watch: A Big Data Application Performance Tuning Tool in the HPCC Systems...
 
Drizzle Keynote from O'Reilly's MySQL's Conference
Drizzle Keynote from O'Reilly's MySQL's ConferenceDrizzle Keynote from O'Reilly's MySQL's Conference
Drizzle Keynote from O'Reilly's MySQL's Conference
 
MySQL InnoDB Cluster: High Availability Made Easy!
MySQL InnoDB Cluster: High Availability Made Easy!MySQL InnoDB Cluster: High Availability Made Easy!
MySQL InnoDB Cluster: High Availability Made Easy!
 

Recently uploaded

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Recently uploaded (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

The ninja elephant, scaling the analytics database in Transwerwise

  • 1. The ninja elephant Scaling the analytics database in Transferwise Federico Campoli Transferwise 25th January 2017 Federico Campoli (Transferwise) The ninja elephant 25th January 2017 1 / 1
  • 2. First rule about talks, don’t talk about the speaker Born in 1972 Passionate about IT since 1982 mostly because of TRON movie Joined the Oracle DBA secret society in 2004 Fell in love with PostgreSQL in 2006 Currently runs the Brighton PostgreSQL User group Works at Transferwise as Data Engineer Federico Campoli (Transferwise) The ninja elephant 25th January 2017 2 / 1
  • 3. Table of contents Federico Campoli (Transferwise) The ninja elephant 25th January 2017 3 / 1
  • 4. Table of contents Federico Campoli (Transferwise) The ninja elephant 25th January 2017 4 / 1
  • 5. We have an appointment, and we are late! Federico Campoli (Transferwise) The ninja elephant 25th January 2017 5 / 1
  • 6. The Gordian Knot of analytics db The data engineer started in July 2016 He was involved in a task not customer facing However the task was very critical to the business Federico Campoli (Transferwise) The ninja elephant 25th January 2017 6 / 1
  • 7. The Gordian Knot of analytics db The data engineer started in July 2016 He was involved in a task not customer facing However the task was very critical to the business To solve the performance issues on the MySQL analytics database Which were bad despite the resources assigned to the VM were considerable And the data set was medium size Federico Campoli (Transferwise) The ninja elephant 25th January 2017 6 / 1
  • 8. Tactical assessment The existing database had the following configuration MySQL 5.6 on innodb Innodb buffer size 60 GB RAM available 70 GB 20 CPU 600 GB used on disk Analytic queries performed via Looker and Tableau The main live MySQL schema replicated into the analytics database Several schemas from the service database imported on a regular basis One schema used for obfuscating PII and denormalising the heavy queries Federico Campoli (Transferwise) The ninja elephant 25th January 2017 7 / 1
  • 9. The frog effect If you drop a frog in a pot of boiling water, it will of course frantically try to clamber out. But if you place it gently in a pot of tepid water and turn the heat will be slowly boiled to death. Federico Campoli (Transferwise) The ninja elephant 25th January 2017 8 / 1
  • 10. The frog effect If you drop a frog in a pot of boiling water, it will of course frantically try to clamber out. But if you place it gently in a pot of tepid water and turn the heat will be slowly boiled to death. The performance issues worsened over a two years span The obfuscation was made via custom views The data size on the MySQL master increased over time Causing the optimiser to switch on materialise when accessing the views The analytics tools struggled just under normal load In busy periods the database became almost unusable Analysts were busy to tune existing queries rather writing new A new solution was needed Federico Campoli (Transferwise) The ninja elephant 25th January 2017 8 / 1
  • 11. Table of contents Federico Campoli (Transferwise) The ninja elephant 25th January 2017 9 / 1
  • 12. The eye of the storm Federico Campoli (Transferwise) The ninja elephant 25th January 2017 10 / 1
  • 13. One size doesn’t fits all It was clear that MySQL was no longer a good fit. However the new solution’s requirements had to meet some specific needs. Data updated in almost real time from the live database PII obfuscated for the analysts PII available in clear for the power users The system should be able to scale out for several years Modern SQL for better analytics queries Federico Campoli (Transferwise) The ninja elephant 25th January 2017 11 / 1
  • 14. May the best database win The analysts team shortlisted few solutions. Each solution covered partially the requirements. Google BigQuery Amazon RedShift Snowflake PostgreSQL Federico Campoli (Transferwise) The ninja elephant 25th January 2017 12 / 1
  • 15. Shortlisting the shortlist Google BigQuery and Amazon RedShift did not suffice the analytics requirements and were removed from the list. Both PostgreSQL and Snowflake offered very good performance and modern SQL. Neither of them offered a replication system from the MySQL system. Federico Campoli (Transferwise) The ninja elephant 25th January 2017 13 / 1
  • 16. Straight into the cloud Snowflake is a cloud based data warehouse service. It’s based on Amazon S3 and comes with different sizing. Their pricing system is very appealing and the preliminary tests shown Snowflake outperforming PostgreSQL1 . 1PostgreSQL single machine vs cloud based parallel processing Federico Campoli (Transferwise) The ninja elephant 25th January 2017 14 / 1
  • 17. Streaming copy Using FiveTran, an impressive multi technology data pipeline, the data would flow in real time from our production server to Snowflake. Federico Campoli (Transferwise) The ninja elephant 25th January 2017 15 / 1
  • 18. Streaming copy Using FiveTran, an impressive multi technology data pipeline, the data would flow in real time from our production server to Snowflake. Unfortunately there was just one little catch. There was no support for obfuscation. Federico Campoli (Transferwise) The ninja elephant 25th January 2017 15 / 1
  • 19. Customer comes first In Transferwise we really care about the customer’s data security. Our policy for the PII data is that any personal information moving outside our perimeter shall be obfuscated. The third party extraction and replica for Snowflake required full read access to our live systems or at least a database configured in cascading replica . The data should have been obfuscated before allowing the third party replicator access. Federico Campoli (Transferwise) The ninja elephant 25th January 2017 16 / 1
  • 20. Proactive development The data engineer foreseeing the issue developed in his spare time a proof of concept based on the replica tool pg chameleon which uses a python library to read the MySQL replica. The tests on a small copy of the live database were successful. The tool’s simple structure allowed to add the obfuscation in real time with minimal changes. Federico Campoli (Transferwise) The ninja elephant 25th January 2017 17 / 1
  • 21. And the winner is... In this scenario PostgreSQL would be the replicated and obfuscated data source for FiveTran. However, because the performance on PostgreSQL were quite good, and the system have good margin for scaling up, the decision was to keep the data analytics data behind our perimeter. Federico Campoli (Transferwise) The ninja elephant 25th January 2017 18 / 1
  • 22. Table of contents Federico Campoli (Transferwise) The ninja elephant 25th January 2017 19 / 1
  • 23. MySQL Replica in a nutshell Federico Campoli (Transferwise) The ninja elephant 25th January 2017 20 / 1
  • 24. A quick look to the replication system Let’s have a quick overview on how the MySQL replica works and how the replicator interacts with it. The following slides are related to pg chameleon because the custom obfuscator tool share with pg chameleon most of its concepts and code. Federico Campoli (Transferwise) The ninja elephant 25th January 2017 21 / 1
  • 25. MySQL Replica The MySQL replica protocol is logical When MySQL is configured properly the RDBMS saves the data changed into binary log files The slave connects to the master and gets the replication data The replication’s data are saved into the slave’s local relay logs The local relay logs are replayed into the slave Federico Campoli (Transferwise) The ninja elephant 25th January 2017 22 / 1
  • 26. MySQL Replica Federico Campoli (Transferwise) The ninja elephant 25th January 2017 23 / 1
  • 27. A chameleon in the middle pg chameleon mimics a mysql slave’s behaviour Connects to the master and reads data changes It stores the row images into a PostgreSQL table using the jsonb format A plpgSQL function decodes the rows and replay the changes Federico Campoli (Transferwise) The ninja elephant 25th January 2017 24 / 1
  • 28. A chameleon in the middle pg chameleon mimics a mysql slave’s behaviour Connects to the master and reads data changes It stores the row images into a PostgreSQL table using the jsonb format A plpgSQL function decodes the rows and replay the changes PostgreSQL acts as relay log and replication slave With an extra cool feature. Federico Campoli (Transferwise) The ninja elephant 25th January 2017 24 / 1
  • 29. A chameleon in the middle pg chameleon mimics a mysql slave’s behaviour Connects to the master and reads data changes It stores the row images into a PostgreSQL table using the jsonb format A plpgSQL function decodes the rows and replay the changes PostgreSQL acts as relay log and replication slave With an extra cool feature. Initialises the PostgreSQL replica schema in just one command Federico Campoli (Transferwise) The ninja elephant 25th January 2017 24 / 1
  • 30. MySQL replica + pg chameleon Federico Campoli (Transferwise) The ninja elephant 25th January 2017 25 / 1
  • 31. Log formats MySQL supports different formats for the binary logs. The STATEMENT format. It logs the statements which are replayed on the slave. It seems the best solution for performance. However replaying queries with not deterministic elements generate inconsistent slaves (e.g. insert with uuid). The ROW format is deterministic. It logs the row image and the DDL queries. This is the format required for pg chameleon to work. MIXED takes the best of both worlds. The master logs the statements unless a not deterministic element is used. In that case it logs the row image. Federico Campoli (Transferwise) The ninja elephant 25th January 2017 26 / 1
  • 32. Table of contents Federico Campoli (Transferwise) The ninja elephant 25th January 2017 27 / 1
  • 33. Maximum effort Federico Campoli (Transferwise) The ninja elephant 25th January 2017 28 / 1
  • 34. Replica and obfuscation The data engineer worked on pg chameleon and built a minimum viable product. The project was forked into a transferwise owned repository for adding the obfuscation capabilities and other specific functionalities like the daily procedures for the pre aggregated schema. Federico Campoli (Transferwise) The ninja elephant 25th January 2017 29 / 1
  • 35. Mighty morphing power elephant The replica initialisation locks the mysql tables in read only mode. To avoid the main database to be locked for several hours a secondary MySQL replica is setup with the local query logging enabled. The cascading replica also allowed to use the ROW binlog format as the master uses MIXED for performance reasons. Federico Campoli (Transferwise) The ninja elephant 25th January 2017 30 / 1
  • 36. This is what awesome looks like! A MySQL master is replicated into a MySQL slave The slave’s data is copied and obfuscated using a PostgreSQL database! Federico Campoli (Transferwise) The ninja elephant 25th January 2017 31 / 1
  • 37. This is what awesome looks like! A MySQL master is replicated into a MySQL slave The slave’s data is copied and obfuscated using a PostgreSQL database! Federico Campoli (Transferwise) The ninja elephant 25th January 2017 31 / 1
  • 38. Replica initialisation The replica initialisation follows the same rules of any mysql replica setup Flush the tables with read lock Get the master’s coordinates Copy the data Release the locks Federico Campoli (Transferwise) The ninja elephant 25th January 2017 32 / 1
  • 39. Tricky SQL The data copy pulls the data out from mysql using the CSV format with a very tricky SQL statement. SELECT CASE WHEN data_type="enum" THEN SUBSTRING(COLUMN_TYPE,5) END AS enum_list, CASE WHEN data_type IN (’"""+"’,’".join(self.hexify)+"""’) THEN concat(’hex(’,column_name,’)’) WHEN data_type IN (’bit’) THEN concat(’cast(‘’,column_name,’‘ AS unsigned)’) ELSE concat(’‘’,column_name,’‘’) END AS column_csv FROM information_schema.COLUMNS WHERE table_schema=%s AND table_name=%s ORDER BY ordinal_position ; Federico Campoli (Transferwise) The ninja elephant 25th January 2017 33 / 1
  • 40. Fallback on failure The CSV data is pulled out in slices in order to avoid memory overload. The file is then pushed into PostgreSQL using the COPY command. However... COPY is fast but is single transaction One failure and the entire batch is rolled back If this happens the procedure loads the same data using the INSERT statements Which can be very slow But at least discards only the problematic rows Federico Campoli (Transferwise) The ninja elephant 25th January 2017 34 / 1
  • 41. Obfuscation when initialising The obfuscation process is quite simple and uses the extension pgcrypt for hashing in sha256. When the replica is initialised the data is copied into the schema in clear The table locks are released The tables with PII are copied and obfuscated in a separate schema The process builds the indices on the schemas with data in clear and obfuscated The tables without PII data are exposed to the normal users using simple views All the varchar fields in the obfuscated schema are converted in text fields Federico Campoli (Transferwise) The ninja elephant 25th January 2017 35 / 1
  • 42. Obfuscation on the fly The obfuscation is also applied when the data is replicated. The approach is very simple. When a row image is captured the process checks if the table contains PII data In that case the process generates a second jsonb element with the PII data obfuscated The jsonb element carries the complete informations about the destination schema The plpgSQL function executes the change on the schema in clear and the schema with obfuscated data Federico Campoli (Transferwise) The ninja elephant 25th January 2017 36 / 1
  • 43. The DDL. A real pain in the back The DDL replica is possible with a little trick. MySQL even in ROW format emits the DDL as statements A regular expression traps the DDL like CREATE/DROP TABLE or ALTER TABLE. The mysql library gets the table’s metadata from the information schema The metadata is used to build the DDL in the PostgreSQL dialect This approach may not be elegant but is quite robust. Federico Campoli (Transferwise) The ninja elephant 25th January 2017 37 / 1
  • 44. Timing Query MySQL PostgreSQL PostgreSQL cached Master procedure 20 hours 4 hours N/A Extracting sharing ibans didn’t complete 3 minutes 1 minute Adyen notification 6 minutes 2 minutes 6 seconds Federico Campoli (Transferwise) The ninja elephant 25th January 2017 38 / 1
  • 45. Resource comparison Resource MySQL PostgreSQL Storage Size 940 GB 664 GB Server CPUs 18 8 Server Memory 68 GB 48 GB Shared Memory 50 GB 5 GB Max connections 500 100 Federico Campoli (Transferwise) The ninja elephant 25th January 2017 39 / 1
  • 46. Advantages using PostgreSQL Stronger security model Better resource optimisation (See previous slide) No invalid views No performance issues with views Complex analytics functions partitioning (thanks pg pathman!) BRIN indices Federico Campoli (Transferwise) The ninja elephant 25th January 2017 40 / 1
  • 47. Advantages using PostgreSQL Stronger security model Better resource optimisation (See previous slide) No invalid views No performance issues with views Complex analytics functions partitioning (thanks pg pathman!) BRIN indices some code was optimised inside, but actually very little - maybe 10-20% was improved. We’ll do more of that in the future, but not yet. The good thing is that the performance gains we have can mostly be attributed just to PG vs MySQL. So there’s a lot of scope to improve further. Jeff McClelland - Growth Analyst, data guru Federico Campoli (Transferwise) The ninja elephant 25th January 2017 40 / 1
  • 48. Table of contents Federico Campoli (Transferwise) The ninja elephant 25th January 2017 41 / 1
  • 49. Lessons learned Federico Campoli (Transferwise) The ninja elephant 25th January 2017 42 / 1
  • 50. init replica tune The replica initialisation required several improvements. The first init replica implementation didn’t complete. The OOM killer killed the process when the memory usage was too high. In order to speed up the replica, some large tables not required in the analytics db were excluded from the init replica. Some tables required a custom slice size because the row length triggered again the OOM killer. Estimating the total rows for user’s feedback is faster but the output can be odd. Using not buffered cursors improves the speed and the memory usage. Federico Campoli (Transferwise) The ninja elephant 25th January 2017 43 / 1
  • 51. init replica tune The replica initialisation required several improvements. The first init replica implementation didn’t complete. The OOM killer killed the process when the memory usage was too high. In order to speed up the replica, some large tables not required in the analytics db were excluded from the init replica. Some tables required a custom slice size because the row length triggered again the OOM killer. Estimating the total rows for user’s feedback is faster but the output can be odd. Using not buffered cursors improves the speed and the memory usage. However.... even after fixing the memory issues the initial copy took 6 days. Tuning the copy speed with the unbuffered cursors and the row number estimates improved the initial copy speed which now completes in 30 hours. Including the time required for the index build. Federico Campoli (Transferwise) The ninja elephant 25th January 2017 43 / 1
  • 52. Strictness is an illusion. MySQL doubly so MySQL’s lack of strictness is not a mystery. The replica broke down several times because of the funny way the NOT NULL is managed by MySQL. To prevent any further replica breakdown the fields with NOT NULL added with ALTER TABLE, in PostgreSQL are always as NULLable. MySQL truncates the strings of characters at the varchar size automatically. This is a problem if the field is obfuscated on PostgreSQL because the hashed string could not fit into the corresponding varchar field. Therefore all the character varying on the obfuscated schema are converted to text. Federico Campoli (Transferwise) The ninja elephant 25th January 2017 44 / 1
  • 53. I feel your lack of constraint disturbing Rubbish data in MySQL can be stored without errors raised by the DBMS. When this happens the replicator traps the error when the change is replayed on PostgreSQL and discards the problematic row. The value is logged on the replica’s log, available for further actions. Federico Campoli (Transferwise) The ninja elephant 25th January 2017 45 / 1
  • 54. Table of contents Federico Campoli (Transferwise) The ninja elephant 25th January 2017 46 / 1
  • 55. Wrap up Federico Campoli (Transferwise) The ninja elephant 25th January 2017 47 / 1
  • 56. Did you say hire? WE ARE HIRING! https://transferwise.com/jobs/ Federico Campoli (Transferwise) The ninja elephant 25th January 2017 48 / 1
  • 57. That’s all folks! QUESTIONS? Federico Campoli (Transferwise) The ninja elephant 25th January 2017 49 / 1
  • 58. Contacts and license Twitter: 4thdoctor scarf Transferwise: https://transferwise.com/ Blog:http://www.pgdba.co.uk Meetup: http://www.meetup.com/Brighton-PostgreSQL-Meetup/ This document is distributed under the terms of the Creative Commons Federico Campoli (Transferwise) The ninja elephant 25th January 2017 50 / 1
  • 59. Boring legal stuff The 4th doctor meme - source memecrunch.com The eye, phantom playground, light end tunnel - Copyright Federico Campoli The dolphin picture - Copyright artnoose Deadpool Maximum Effort - source Deadpool Zoeiro Deadpool Clap - source memegenerator Federico Campoli (Transferwise) The ninja elephant 25th January 2017 51 / 1
  • 60. The ninja elephant Scaling the analytics database in Transferwise Federico Campoli Transferwise 25th January 2017 Federico Campoli (Transferwise) The ninja elephant 25th January 2017 52 / 1