Failover is the process of moving to a healthy standby component, during a failure or maintenance event, in order to preserve uptime. The quicker it can be done, the faster you can be back online. However, failover can be tricky for transactional database systems as we strive to preserve data integrity - especially in asynchronous or semi-synchronous topologies. There are risks associated, from diverging datasets to loss of data. Failing over due to incorrect reasoning, e.g., failed heartbeats in the case of network partitioning, can also cause significant harm.
This webinar replay gives a detailed overview of what failover processes may look like in MySQL, MariaDB and PostgreSQL replication setups. We’ve covered the dangers related to the failover process, and discuss the tradeoffs between failover speed and data integrity. We’ve found out about how to shield applications from database failures with the help of proxies. And we've finally had a look at how ClusterControl manages the failover process, and how it can be configured for both assisted and automated failover.
So if you’re looking at minimizing downtime and meet your SLAs through an automated or semi-automated approach, then this webinar replay is for you!
AGENDA
- An introduction to failover - what, when, how
- in MySQL / MariaDB
- in PostgreSQL
- To automate or not to automate
- Understanding the failover process
- Orchestrating failover across the whole HA stack
- Difficult problems
- Network partitioning
- Missed heartbeats
- Split brain
- From assisted to fully automated failover with ClusterControl
- Demo
SPEAKER
Krzysztof Książek, Senior Support Engineer at Severalnines, is a MySQL DBA with experience managing complex database environments for companies like Zendesk, Chegg, Pinterest and Flipboard.
SCM Symposium PPT Format Customer loyalty is predi
Webinar slides: How to Manage Replication Failover Processes for MySQL, MariaDB & PostgreSQL
1. krzysztof@severalnines.com
Copyright 2018 Severalnines AB
Presenter
Krzysztof Książek, Senior Support Engineer @Severalnines
How to Manage Replication Failover
Processes for MySQL, MariaDB &
PostgreSQL
December 11th, 2018
2. Copyright 2018 Severalnines AB
I'm JJ from the Severalnines Team and I'm your host for
today's webinar!
Feel free to ask any questions in the Questions section
of this application or via the Chat box.
You can also contact me directly via the chat box or via
email: jj@severalnines.com during or after the
webinar.
Your host & some logistics
5. Copyright 2019 Severalnines ABCopyright 2017 Severalnines AB
About ClusterControl
# Free to Download
# Initial 30 Days Enterprise
Trial
# Reverts to Free
Community Edition
# Enterprise / Paid Versions
Available
6. Copyright 2019 Severalnines ABCopyright 2017 Severalnines AB
ClusterControl Automation & Management
Deployment (Free Community)
# Deploy a Cluster in Minutes
○ On-Prem
○ Cloud (AWS/Azure/Google) - paid
Monitoring (Free Community)
# Systems View with 1 sec Resolution
# Agentless via SSH, or agent-based with Prometheus
# DB / OS stats & Performance Advisors
# Configurable Dashboards
# Query Analyzer
# Real-time / historical
Management (Paid Features)
# Backup Management
# Upgrades & Patching
# Security & Compliance
# Operational Reports
# Automatic Recovery & Repair
# Performance Management
# Automatic Performance Advisors
9. krzysztof@severalnines.com
Copyright 2018 Severalnines AB
Presenter
Krzysztof Książek, Senior Support Engineer @Severalnines
How to Manage Replication Failover
Processes for MySQL, MariaDB &
PostgreSQL
December 11th, 2018
10. Copyright 2018 Severalnines AB
•An introduction to failover - what, when, how
in MySQL / MariaDB
in PostgreSQL
•To automate or not to automate
•Understanding the failover process
•Orchestrating failover across the whole HA stack
•Difficult problems
Network partitioning
Missed heartbeats
Split brain
•From assisted to fully automated failover with ClusterControl
Demo
Agenda
12. Copyright 2018 Severalnines AB
•A switchover is the process of switching a
master role to another server through the
process of a slave promotion
•A failover is the process of switching a master
role to another server through the process of a
slave promotion. Old master is not available or
its availability is limited
This is worse scenario as you cannot
assume all the slaves are in sync
•Today the we will focus on the failover process
An introduction to replication failover - what, when, how
13. Copyright 2018 Severalnines AB
•The failover is performed when the old master became
unavailable. Both in MySQL and PostgreSQL replication,
writes have to be sent to the master therefore its crash
affects the whole cluster, making it not available
•What is important, you should verify the master
connectivity from the point of the slaves
It may happen that the monitoring node cannot reach
the master while slaves are happily replicating from it
Failover should be triggered only if the master is
indeed not reachable neither by the application nor
by the slaves
An introduction to replication failover - what, when, how
14. Copyright 2018 Severalnines AB
•After a master crash you end up with one or more slaves
•Verify that the master is indeed not reachable
•Decide which slave is the most up to date and pick it as master candidate
•Ensure there are no errant transactions on the master candidate
•Collect missing data from the master (if it is possible) and replay them on the master
candidate
•Reslave all remaining slaves off the new master
•Ensure to the best of your abilities that the old master will not be started again before it can
be investigated
•Rebuild the old master as a slave using the data from the new master
Failover in MySQL
15. Copyright 2018 Severalnines AB
•After an active server crash you end up with one or more standby servers
•Verify that the active server is indeed not reachable
•Find the most advanced standby server
•Trigger the failover using either pg_ctl promote or the trigger_file
•pg_rewind for remaining standby servers to make them in sync with the new master
•Reslave remaining standby servers to the new master
•Ensure to the best of your abilities that the old master will not be started again before it can
be investigated
•Rebuild the old master as a slave using the data from the new master
Failover in PostgreSQL
17. Copyright 2018 Severalnines AB
•As shown in last two slides, the failover requires couple of steps to be performed
As usual, more steps and more complex they are, the higher chance for human error
•Scripts can easily perform all the tasks required, run all the checks and do it way faster and
more reliable than human can do
•Scripts are as smart as we wrote them, though. Humans tend to be more flexible and can
handle unpredictable situations better
•Should we automate the failover or not? That’s the question!
•Let’s go through some pros and cons of automated failover
To automate or not to automate?
18. Copyright 2018 Severalnines AB
•Pros
Way faster reaction on the issue
Higher reliability for typical situations
When configured correctly, may handle
majority of the cases in a proper way
Reduce oncall burnout - even though
you page your staff, it’s not as critical
given that the systems are up and
running
To automate or not to automate?
•Cons
Limited situation awareness - does not
understand the large picture (or
understand what has been coded in)
Decisions made are not always correct
Requires intensive tests to ensure
reliability
Has to be maintained (if it is your own
script)
19. Copyright 2018 Severalnines AB
•The main differencing factors are the reaction time and lack of the situation awareness
•Automated failover will be faster but may take actions user would not take
•But the logic can be improved and safety features like white/blacklists can be use in attempt
to reduce incorrect behaviour
•Better visibility can also be implemented:
Access tests through multiple hosts (slaves, proxies)
Utilising clustering protocol like Raft or Paxos for network split detection
•Don’t expect automated failover to cover correctly 100% of the cases though
•A third way may also be applicable - assisted failover
Does everything automatically but is initiated by the user, after the initial assessment
To automate or not to automate?
21. Copyright 2018 Severalnines AB
•Ensuring that the master is indeed down is critical
•You never want to run two writable masters at the same time!
•You may want to implement some sort of STONITH (Shoot The Other Node In The Head) to
ensure dead master will stay dead
•You can leverage data from multiple sources. Are slaves replicating? Do proxies see the
master?
Understanding the failover process
Ensure that the master is indeed down
22. Copyright 2018 Severalnines AB
•Picking correct slave as the master candidate is critical
•You want to use the most advanced slave to avoid data loss
•You want to ensure there are no errant transactions (in GTID setup)
•You want to allow slave to apply the events from relay logs (as long as it does not take too
long)
•You want to try and reach the master to see if there are non-replicated binary log events
Master failure not always mean you cannot SSH there and parse binlogs for missing
transactions
Understanding the failover process
Pick the correct slave as the master candidate
23. Copyright 2018 Severalnines AB
•Correct usage of whitelists and blacklists is critical
•You may not want to promote any slave that you have
•Better to stay within the same datacenter to avoid split brain scenario with two masters
•Better to stay within the same datastore version for compatibility reasons
•Better to stay within the same hardware for performance reasons
•While executing a failover use the standard procedures for marking masters and slaves
read_only and super_read_only = 0 or 1?
Understanding the failover process
Correct usage of whitelists and blacklists
24. Copyright 2018 Severalnines AB
•Automated failover process can sometimes be augmented by the use of pre- or post-failover
actions
•Do you want to perform some action when the master failed?
•Do you need to reconfigure some application when a new master is promoted?
•Do you want to remove old master entry from your Consul key/value store?
•Most of the main tools that support failover handling support also pre- and post-failover
actions
MHA
Orchestrator
ClusterControl
Understanding the failover process
Pre- and post-failover actions
25. Copyright 2017 Severalnines AB
Copyright 2018 Severalnines AB
Orchestrating failover across the whole HA stack
26. Copyright 2018 Severalnines AB
•Databases do not exist in vacuum, they are surrounded by other services to create a highly
available environment
•Proxies need a way to distinguish between the master and a slave
In PostgreSQL streaming replication this is typically the existence of a recovery.conf file
In MySQL it can be, for example a value of read_only and super_read_only: 1 or 0
•When failover is happening, you have to make sure you manage the variable’s value
correctly
You don’t want loadbalancers to send the traffic to your databases while failover is
happening
Orchestrating failover across the whole HA stack
29. Copyright 2018 Severalnines AB
•All loadbalancers deployed by ClusterControl follow those rules
recovery.conf file on PostgreSQL
read_only value on MySQL
•ClusterControl ensures that the values in MySQL are defined accordingly to the stage of the
process
in switchover, the master is demoted through read_only=1. In failover this cannot be done
still, read_only=1 is configured in MySQL configuration on all nodes to minimise the chance
of old master returning as writable host
new master is marked with read_only=0
•This process works but it does not cover all the situations
Orchestrating failover across the whole HA stack
31. Copyright 2018 Severalnines AB
•Networks can be unstable and packets may be lost in the transfer
•Replication itself is robust and it will work quite well even if there are network problems
•Health checks performed over the replication also have to take such conditions under
consideration
•Make sure you do not take any actions based on just a single health check
•Make sure you do not take any actions based on just a single host’s point of view
•Expect network problems and try to understand their severity before an action will be taken
Difficult problems - network issues
32. Copyright 2018 Severalnines AB
•Every cluster type has its own problems.
For MySQL and PostgreSQL replication one
of the biggest issues is the lack of cluster
awareness and lack of quorum support
•Replication clusters are prone to the
network split issues
•Automated topology detection by proxies
can make things even more tricky
•There’s no easy, standard way to avoid this
problem
Difficult problems - network split
33. Copyright 2018 Severalnines AB
•Network split happens when there’s lack of connectivity between one part of the cluster and
the other part
For example, the master cannot reach slaves, slaves cannot reach the master
•Master is unavailable therefore cluster cannot handle writes
Failover should be performed to restore cluster’s ability to handle traffic
•Master is still running though, when networks converge two writeable hosts will show up
•Standard topology detection logic will not be enough. Two nodes will have read_only=0, two
nodes will not have the recovery.conf file
Without additional measures to ensure the old master won’t get the traffic, a split brain is
imminent
Difficult problems - network split
34. Copyright 2018 Severalnines AB
•Split brain is a condition in which two writable nodes take the traffic and, as a result, their
data sets drift apart
•There’s no easy solution to recover from such condition
Shut down rogue master as soon as possible to minimise the data drift
Manual action will be required to converge the data sets
•Make sure that whatever solution you choose, it works
You can do better than GitHub!
Difficult problems - split brain
36. Copyright 2018 Severalnines AB
•There are numerous ways in which you can reduce (but not avoid) the impact and probability
that your data will be affected by the network issues
•Collect as much data about the state of the replication topology before an action is taken
Utilize multiple nodes as the point of view on the topology
•Try to implement STONITH to reduce the chance that old master will show up
Some kind of Lights-Out solution (iLO for example) might work in physical environment
Kill scripts (destroy given virtual instance) may work in the cloud
•Modify configuration of the proxies to remove old master after it’s deemed as dead
•No solution will be 100% bullet proof
You may not be able to reach all the proxies, the node itself or cloud service to kill the master
Difficult problems - how to avoid them?