fixing/restablishing mysql replication with rsync

One approach to fixing a broken MySQL replication setup…. using rsync.

So, with a standard MySQL master/slave (or master/replica if you’re more PC than me) sometimes the replication gets messed up (eg if the master is power cycled).

A dangerous fix is to try running these queries on the slave, which hopefully skips the troublesome query:

STOP SLAVE
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1
START SLAVE

But, needing to skip queries on the client almost certainly means your data is out of sync (as after all a query that fails on one node should fail on the other and a query that succeeds on one node should not fail on the other)…. So doing this should result in you needing to skip more and more…

mysqldump / reload variant

Most guides online suggest something like the following which has the benefit of not requiring the master node be stopped (but it does have to become read only while a dump takes place).

dump the master ( mysqldump –master-data=1 –all-databases > something.sql)
1. Note –master-data implies single transaction and lock-all-tables etc which may cause problems with MyISAM tables that are being written to.

And on the slave:

<import the something.sql dump file from the master>
START SLAVE

and then run ‘show slave status \G’ a few times …. and make sure everything’s looking good.

The problem with this is that dumping the entire database could take a long time (perhaps an hour or two, depending on hardware and database size) – during which time the database is likely to only available for read queries (so most web clients will block/timeout as they often use the database as a session store).

In my case I had a master server with about 6Gb in /var/lib/mysql and a few hundred databases. I was worried that dumping the database while holding a read lock would take a long time (unfortunately most of it’s data is not in InnoDB).

Therefore, I opted to modify the above a bit, and use rsync to synchronise the data between nodes. This had the advantage that I could have the two servers nearly identical before taking the master offline to perform the final sync (at which point I know they’re identical). It also saves me having to do the CHANGE MASTER …. stuff on the slave as it’s not necessary.

rsync variant

(For MySQL 5.1…)

So, on the master:

RESET MASTER

(just to delete all the now unnecessary binary log files on the master – depending on how long your replication has been broken, they might be taking up quite a bit of space!)

On the slave :

/etc/init.d/mysqld stop
rsync --delete --delete-excluded -e ssh --exclude=$MASTER_HOSTNAME-* --exclude=relay-log.info -arv root@$MASTER_IP:/var/lib/mysql/ /var/lib/mysql/

After you’ve got a basic initial rsync replica, it’s time to stop the master node so we know they’re identical.

On the master:

FLUSH TABLES WITH READ LOCK;
RESET MASTER;
/etc/init.d/mysqld stop (from a separate terminal)

Then on the slave, we just need to do the final sync :

Re-run the rsync command above (should be pretty quick this time)

Then on the master:

/etc/init.d/mysqld start

Then on the slave :

/etc/init.d/mysqld start

Running:

show slave status \G

shows it’s 0 seconds behind the master and synchronisation is working correctly.