Flite Culture

A Fail-fast First Test of MySQL Backup Integrity

This morning I was planning to check the integrity of a snapshot of a MySQL database that runs on EC2 using EBS, and I accidentally stumbled upon a fast way to quickly find certain types of problems with a backup.

My plan was to use CHECK TABLE on all of the tables to verify the integrity of the backup. The simplest way to run CHECK TABLES on an entire database is like this:

1
mysqlcheck -c --all-databases

Rather than using that command, I decided to generate a bunch of individual CHECK TABLE statements dynamically by running a query on the information_schema.tables table. Why do it that way? It gives me the flexibility to check a specific schema first, or check the smallest or most frequently updated tables first, etc.

So I ran this command to generate my CHECK TABLE script:

1
2
3
4
5
select concat('check table ', table_schema, '.', table_name, ';') as check_stmt
into outfile '/tmp/check_tables.sql'
from information_schema.tables
where table_type = 'BASE TABLE'
order by (data_length + index_length) asc;

And I planned to execute the script like this:

1
mysql -f < /tmp/check_tables.sql

However, when I ran the first query it crashed the MySQL server!

1
2
3
4
5
6
mysql> select concat('check table ', table_schema, '.', table_name, ';') as check_stmt
    -> into outfile '/tmp/check_tables.sql'
    -> from information_schema.tables
    -> where table_type = 'BASE TABLE'
    -> order by (data_length + index_length) asc;
ERROR 2013 (HY000): Lost connection to MySQL server during query

Of course that error message doesn’t always mean that mysqld crashed. Sometimes it just means your session was killed. But in this case I checked, and mysqld did indeed crash.

It was a harmless crash since it was on a passive server, but it pointed out a problem with the backup so I had to find out exactly what went wrong so I could fix it.

I checked the error log and found the source of the crash:

1
2
3
4
5
6
7
8
9
10
11
12
InnoDB: Error: tablespace id is 17722 in the data dictionary
InnoDB: but in file ./test/promo_history_old.ibd it is 18280!
130221  9:39:55  InnoDB: Assertion failure in thread 139984392570624 in file fil0fil.c line 765
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
130221  9:39:55 - mysqld got signal 6 ;
...

I’ve omitted the rest of the error output (including the backtrace) for brevity, but it’s clear which table caused the problem: test.promo_history_old.

After confirming that table is no longer needed I dropped it and re-ran the information_schema query successfully so I could move on with the rest of my integrity check.

In the future I’ll run a quick fail-fast test on information_schema before doing the more time-consuming check of running CHECK TABLE on all of the tables. I think this should be sufficient:

1
select * from information_schema.tables;

Of course I’ll still run CHECK TABLE on all of the tables, but the information_schema test will help me find some problems faster.

Comments

Comments