Monitoring Auto Increment Capacity in MySQL
Recently we came uncomfortably close to running out of auto increment values in a MySQL table. The following is a description of how it happened and what we did about it.
We thought the table had been created with plenty of auto increment values. We used an unsigned int for the auto increment primary key, so it had a maximum value of about 4 billion. We happen to use an
auto_increment_increment of 4 in this particular environment, so I figured the table could accomodate about 1 billion rows before we ran out of values. Six years of data in that table totalled 100 Million rows, and given the nature of the data the row count will not increase much faster than that even if our business grows dramatically, so I wasn’t worried about running out of auto increment values in this table for at least 20 years.
Those assumptions are fine as long as we’re only inserting data into the table and never deleting from it, which seemed fairly safe since it’s a fact table that is updated only once per day. The problem is that about a year ago we made a code change to update the current day’s data hourly during the day. The way that change was implemented, rather than updating the existing rows and keeping the auto increment values, we bulk load the new data and overwrite the existing rows using
load data infile...replace. The replace is implemented as a delete and then an insert, which means we’re now us using about 20 times as many new auto increment values per day as we were before.
Luckily the failure occurred in a test environment rather than production. Here’s the error we got when we ran out of auto increment values:
Once I saw that error I checked the production database and found that the max(id) was dangerously close to the upper range of unsigned int values.
The typical fix for an issue like this is to increase the size of the column from a 4 byte INT to an 8 byte BIGINT. An unsigned BIGINT can store a maximum value of 18 quintillion.
In this case I decided to drop the auto increment primary key altogether. We added it years ago based on my misunderstanding at the time of the InnoDB clustered index, because I thought having a surrogate key would make the table smaller. The table already had a separate unique index that we could use as a natural primary key, and we did not have any code referencing the surrogate key, so we just decided to drop it.
The relevant table is about 15 GB, so I expected it to take several hours to drop the column. Given how close we were to running out of auto increment values we decided to drop the column using pt-online-schema-change so the application could continue writing to and reading from the table as we were dropping the column. That tool worked perfectly for this task, and we were able to fix the table in all of our production databases relatively quickly.
In addition to proper planning upfront when designing the schema, we also need to monitor all of our auto increment columns to assure that we never run out of values.
I came up with an ad hoc query to do this a while back, and others have devised similar queries. But we’re really looking for a more standardized way to monitor this. I saw a presentation on common_schema at the Percona Live MySQL Conference back in April, so I thought this would be a good chance to put that tool to use. Here’s a quick query I wrote against common_schema to find the 10 tables in my database that were the closest to running out of auto increment values:
1 2 3 4
The query was easy to write, and it gave correct results, but it was slow. Looking at the code I saw that common_schema uses information_schema a lot, which makes sense, but information_schema is notoriously slow and can even impact end users if you query it on an active production database. So I was hesitant to use this query in production, but then I remembered a great blog post that Peter Zeitsev wrote a few months ago about improving information_schema performance. I tried the suggestion from that blog post:
And then re-ran the query against common_schema.
The query took 47 seconds to run before, and less than 1 second after. With that change in place I am comfortable running the common_schema query in production.
I ran it manually and found a couple other tables that were about 50% of the way to running out of auto increment values, so I will either drop the surrogate keys or change them to BIGINT as necessary.
We’ve solved the short term problem, so the next step is to add a Nagios check to monitor the databases and send an alert if an auto increment column gets to a certain threshold. We haven’t decided on the alert threshold yet, but we’ll probably start at or below 50%.