An issue of all slaves stop replication

An issue of all slaves stop replication
Kentoku SHIBA

- All slaves stop replication with “Got fatal error 1236 from
master when reading data from binary log: 'unknown error
reading log event on the master; the first event
'binlog.004215' at 8610479, the last event read from
'./binlog.004226' at 154, the last byte read from
'./binlog.004226' at 154.‘”. We can see this error by “show
slave status”.
- When you execute “START SLAVE”, position slightly
advances and replication stops with the same error again.
- Recovery by executing “START SLAVE” on the slaves after
binary log rotation on the master.
Abstract of the issue

Don’t panic.
Let’s explain about detail.
Abstract of the issue

When a slave requested the latest binary log
Master Slave
Request
binary log
The master referenced MYSQL_BIN_LOG::binlog_end_pos for
checking sendable position of the latest binary log to the slave.
Send
binary log event
MYSQL_BIN_LOG
::binlog_end_pos

A behavior of the COMMIT
Master
FLUSH stage
A case of SYNC_BINLOG=1, end position of the binlog is held by
THD::m_trans_end_pos on FLUSH stage, then it’s copied to
MYSQL_BIN_LOG::binlog_end_pos on SYNC stage.
COMMIT stage
THD::
m_trans_end_pos
SYNC stage
MYSQL_BIN_LOG
::binlog_end_pos

- FLUSH stage
Write events from transactions to the binlog.
(Physical writes are not guaranteed on this stage)
- SYNC stage
Write the binlog events physically.
- COMMIT stage
Finalize of the COMMIT on each storage engine.
(Transactions have PREPARE status before this stage)
Each stage can work independently like a transaction is
on COMMIT stage, next transaction is on SYNC stage,
another transaction is on FLUSH stage.
The abstract of stages at COMMIT

The condition of causing this issue
Master
FLUSH stage
When a binlog rotation is occurred before updating
MYSQL_BIN_LOG::binlog_end_pos on SYNC stage, this issue occurs.
COMMIT stage
THD::
m_trans_end_pos
SYNC stage
MYSQL_BIN_LOG
::binlog_end_pos
Rotate a binlog

- The binlog is not broken, the readable position is wrong. So,
replication can restart by command, read and send binarylog
event if it is written after previous error position, then stop by
causing same error.
- When the binlog is rotated again,
MYSQL_BIN_LOG::binlog_end_pos is resetted for new
binlog file. If this time no transaction is on SYNC stage, the
problem is gone.
- If transaction has update tables that supports XA transaction,
MYSQL_BIN_LOG::m_prep_xids is incremented on FLUSH
stage. It makes binlog rotation waiting when it is decremented
on COMMIT stage. So this case does not cause this issue.
Behaviors after causing the issue

Base conditions
- MySQL 5.7 or 8.0 (Include Percona Server etc. Amazon
RDS? Aurora? No source code available. MariaDB doesn’t
have this issue)
- SYNC_BINLOG=1
- Outputting binary logs
The binlog is rotated at the same time of the following actions.
(It includes rotation by FLUSH command)
- Updating to tables that does not support XA transaction like
MyISAM and MEMORY.
- Executing DDL. (Except atomic DDL at MySQL 8.0)
(“CREATE TABLE … SELECT …” does not include atomic DDL)
- Executing commands like ANALYZE TABLE.
The detail condition of causing the issue

MySQL 5.7 & MySQL 8.0
ANALYZE TABLE, REPAIR TABLE, OPTIMIZE TABLE, FLUSH TABLES,
FLUSH PRIVILEGES, FLUSH ENGINE LOGS, FLUSH ERROR LOGS,
FLUSH GENERAL LOGS, FLUSH HOSTS, FLUSH OPTIMIZER_COSTS,
FLUSH RELAY LOGS, FLUSH SLOW LOGS, FLUSH SLOW LOGS,
FLUSH STATUS, FLUSH USER_RESOURCES,
ALTER INSTANCE,
CREATE TABLE ... SELECT ...
MySQL 5.7 & MySQL 8.0 (Tables that does not support atomic DDL)
CREATE TABLE, DROP TABLE, ALTER TABLE, RENAME TABLE,
TRUNCATE, CREATE INDEX, DROP INDEX
MySQL 5.7 & MySQL 8.0 (Tables that does not support XA transaction)
INSERT, UPDATE, DELETE, REPLACE, LOAD DATA
The statements of causing the issue (No.1)

MySQL 5.7
CREATE USER, DROP USER, RENAME USER, ALTER USER,
SET PASSWORD, GRANT, REVOKE (ALL),
CREATE DATABASE, DROP DATABASE, ALTER DATABASE,
CREATE VIEW, DROP VIEW,
CREATE TABLESPACE, DROP TABLESPACE, ALTER TABLESPACE,
CREATE FUNCTION, DROP FUNCTION, ALTER FUNCTION,
CREATE PROCEDURE, DROP PROCEDURE, ALTER PROCEDURE,
CREATE TRIGGER, DROP TRIGGER,
CREATE EVENT, DROP EVENT, ALTER EVENT,
FLUSH DES_KEY_FILE, FLUSH QUERY CACHE
The statements of causing the issue (No.2)

It is possible to do like the following approaches.
(Sometimes, it is possible to do other approaches)
1. Execute “FLUSH BINARY LOGS” command on the master for
rotating the binlog, then execute “START SLAVE
(IO_THREAD)” on the slaves.
If you get same error on the slave, re-execute “FLUSH
BINARY LOGS” and “START SLAVE”.
2. Keep executing “START SLAVE (IO_THREAD)” until there
are no more errors.
Approach of recovering or avoiding from the issue

It is possible to do like the following approaches.
(Sometimes, it is possible to do other approaches)
3. Change “SYNC_BINLOG” from 1. Requires to take care of
data inconsistency in case of master failure.
4. If you know in advance that problems do not occur during
normal operation, execute “FLUSH BINARY LOGS” before
executing DDL etc in maintenance for reducing the possibility
of binary log rotation during DDL execution.
Approach of recovering or avoiding from the issue

An issue of all slaves stop replication

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to An issue of all slaves stop replication

Similar to An issue of all slaves stop replication (20)

More from Kentoku

More from Kentoku (20)

Recently uploaded

Recently uploaded (20)

An issue of all slaves stop replication