Alternatives for Chunking Bulk Deletes in Common_schema
I’ve blogged about common_schema multiple times in the past, and it’s a tool I use frequently. Last week I had a project to delete millions of rows from multiple rollup tables in a star schema. Since the tables are not partitioned I needed to use
DELETE instead of
DROP PARTITION, but I didn’t want to delete millions of rows in a single transaction. My first instinct was to use common_schema’s
split() function to break the deletes into chunks. So I ran a query on
INFORMATION_SCHEMA to generate a bunch of statements like this:
1 2 3 4
That’s the simplest way to do deletes with
split(), and the tool will automatically determine which index and what chunk size to use. If I were running this on an active database (or a master) I would probably use
throttle to control the speed of the deletes, but in this case it was running on passive replicas so I just used
pass to run the deletes with no sleep time in between them. I sorted the deletes by table size, from smallest to largest, and had a total of 33 tables to process.
I started running the SQL on a stage database and it deleted data from the first 32 tables with no problem, but it got stuck on the 33rd table. I checked the process list and found this query running (table and column names have been changed for simplicity):
1 2 3 4 5 6 7 8
The relevant table has about 100 million rows, but I would expect the above query to be fairly fast since there is a unique index on the columns being selected. I ran an explain and found that the query was doing a full table scan. I’m not sure exactly why it was doing a full table scan, but the table does have some quirks:
- Two of the dimension columns use the TEXT data type, and thus only a substring from each fo those column is indexed
- The default charset for the table is latin1, but for the TEXT columns it is utf8
- The table uses ROW_FORMAT=DYNAMIC
Rather than trying to figure out exactly why that query was doing a full table scan, I checked the common_schema documentation to see if there were any options I could use to avoid running this particular query. The parameters for split are fairly limited, but I did try using
start to see if that would get around it:
I thought telling
split() where to start and what chunk size to use might help, but it still ran the same query with the full table scan.
1 2 3 4
So I decided to try that with my table. Here’s the command I ran:
1 2 3 4 5
As I expected that delete started fast and slowed down over time, but within 3 hours the deletes were done!
At some point I intend to go back and figure out why my first approach was so slow, and might file a bug report against common_schema. But for now I’m happy that I could get the tool to do what I needed thanks to its flexibility.