MySQL topology healing at OLA.

MySQL Topology Healing
Ola Story
Anil Yadav
Krishna R

Motivation
● Uncertainties in Public Cloud
● Business Continuity
● Data Consistency

High availability objectives
● How much outage time can you tolerate?
● How reliable is crash detection? Can you tolerate false positives (premature
failovers)?
● How reliable is failover? Where can it fail?
● How well does the solution work cross-data-center? On low and high latency
networks?
● Can you afford data loss? To what extent?

MHA
● Pros
■ Adoption
■ Data Healing
● Cons
■ Dormant community
■ Topology Awareness
■ Compatibility with Maxscale

MaxScale
● Pros
○ Resident in our Architecture
○ Pluggable
○ Backed By MariaDB
● Cons
○ Latency
○ Topology Awareness
○ No Community

ProxySQL
● Pros
○ Feature Rich
○ Vibrant Community
○ Percona Backed
● Cons
○ Latency
○ Topology awareness

The Chosen One
● MySQL Orchestrator
○ Pros
■ Adoption
■ Topology Awareness
■ Large Installations
● Booking.com
● Github
○ Cons
■ Needs GTID or MaxScale for healing

Building Blocks
● MySQL Orchestrator
● MaxScale Binlog Servers
● Semi Sync Replication
● NVme Storage

Orchestrator In Action
● Pre-Failover Process
● Healing
● Post-Failover Process

orchestrator.conf.json
"FailureDetectionPeriodBlockMinutes": 5,
"RecoveryPeriodBlockSeconds": 1800,
"RecoveryIgnoreHostnameFilters": [‘slave’],
"RecoverMasterClusterFilters": ["orch-master"],
"RecoverIntermediateMasterClusterFilters": ["orch-master"],
"OnFailureDetectionProcesses": [
"echo 'Detected {failureType} on {failureCluster}. Affected replicas: {countSlaves}, We dont panic' >> /usr/local/orchestrator/recovery.log","/eni_modules/orch_sendmail.py 'Master {failedHost} detected for {failureType}'"
],
"PreFailoverProcesses": [
"echo 'Will recover from {failureType} on {failureCluster}, Failed Host is : {failedHost}' >> /usr/local/orchestrator/recovery.log","/eni_modules/eni_detach.sh {failedHost} {failureType}>> /usr/local/orchestrator/recovery.log"
],
"PostFailoverProcesses": [
"echo '(for all types) Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}, Recovered from faliure>> /usr/local/orchestrator/recovery.log"
],
"PostUnsuccessfulFailoverProcesses": [],
"PostMasterFailoverProcesses": [
"echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /usr/local/orchestrator/recovery.log","/eni_modules/eni_attach.sh {failedHost} {successorHost}
>>/usr/local/orchestrator/recovery.log"
],
"PostIntermediateMasterFailoverProcesses": [
"echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /usr/local/orchestrator/recovery.log"
],

● Read only set On MySQL Master.
● ENI detached from the Master through AWS CLI.
○ This prevents the chances of split-brain
● Connections are killed.
Pre-Failover Process

Healing
● The most ahead binlog server is chosen
● Other binlog servers are grouped under it
○ This makes the topology consistent

Healing
● The new candidate master is chosen
○ This happens through “PromotionIgnoreHostnameFilters” setting, eg :
"PromotionIgnoreHostnameFilters": ["slave","lytic","backup"]
● The new Master’s binlog is flushed and the binlog servers are pointed under it

Post-Failover Process
● ENI is attached to the new master through AWS CLI.
● Connections can be seen on the new master at this point.
● This marks the end of the recovery process.

Challenges
● Orchestrator’s upstream does not support Maxscale Binlog servers
● Had to move to the previous version
○ https://github.com/outbrain/orchestrator
● A dead master because of Ec2 failure can reach the state -
“checkAndRecoverUnreachableMasterWithStaleSlaves”.
● It was patched to arrive at the state - “checkAndRecoverDeadMaster”
● Orchestrator’s force takeover was failing, so it was patched to follow the same
path as a “DeadMaster”
● The forked branch with these changes is at -
https://github.com/varunarora123/orchestrator

We are expanding our team.
Reach us out @anil.yadav1@olacabs.com / @krishna.r@olacabs.com

MySQL topology healing at OLA.

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MySQL topology healing at OLA.

Similar to MySQL topology healing at OLA. (20)

More from Mydbops

More from Mydbops (20)

Recently uploaded

Recently uploaded (20)

MySQL topology healing at OLA.