A Simple Way to Make MySQL Replication More Crash-safe
Here’s the problem that inspired me to make this change.
It started when one of our passive MySQL master database hosts restarted unexpectedly. The host came back up fairly quickly, and MySQL started up cleanly once InnoDB finished its crash recovery.
However, all of this master’s slave databases were in a failed state with the same error:
Apparently some replication events were lost from the end of the binary log when the host went down, so the slave was asking for a position that no longer existed. Since this was a passive master we didn’t lose any data (the relevant events were still on the active master and came through in the next binary log on the passive master), but replication from this master was temporarily broken.
I manually rolled all of the slaves forward to the appropriate binary log using
CHANGE MASTER, but they were behind for several minutes before I did that.
Digging through the configuration on the master I discovered that sync_binlog was disabled, as it is by default in MySQL 5.5. Here’s how that option is explained in the manual:
If the value of this variable is greater than 0, the MySQL server synchronizes its binary log to disk (using fdatasync()) after every sync_binlog writes to the binary log. There is one write to the binary log per statement if autocommit is enabled, and one write per transaction otherwise. The default value of sync_binlog is 0, which does no synchronizing to disk—in this case, the server relies on the operating system to flush the binary log’s contents from to time as for any other file. A value of 1 is the safest choice because in the event of a crash you lose at most one statement or transaction from the binary log. However, it is also the slowest choice (unless the disk has a battery-backed cache, which makes synchronization very fast).
That’s a pretty good description of the trade-offs involved, and given my write volume I figured it would be worth it for us to enable that option. I did some testing to verify that our masters can handle the extra IO overhead and then enabled those settings on all master databases where it was disabled. Now replication will be more likely to resume automatically after a master failure.
I also updated all of our slave databases to enable sync_relay_log and sync_relay_log_info so if a slave db restarts it’s more likely that replication will resume automatically. As expected, the manual enumerates the same “safest” and “slowest” arguments for sync_relay_log.
The manual is less clear about how to set sync_master_info and sync_relay_log_info. It recommends disabling sync_master_info and enabling sync_relay_log_info, even though both are disabled by default. And it doesn’t explain the reasoning behind those recommendations. I guess that’s left as an exercise for the reader :)
In my opinion, if you enable sync_relay_log it makes sense to enable sync_master_info and sync_relay_log_info, too.