Flite Careers

Alternatives for Chunking Bulk Deletes in Common_schema

I’ve blogged about common_schema multiple times in the past, and it’s a tool I use frequently. Last week I had a project to delete millions of rows from multiple rollup tables in a star schema. Since the tables are not partitioned I needed to use DELETE instead of DROP PARTITION, but I didn’t want to delete millions of rows in a single transaction. My first instinct was to use common_schema’s split() function to break the deletes into chunks. So I ran a query on INFORMATION_SCHEMA to generate a bunch of statements like this:

1
2
3
4
call common_schema.run("split(delete from rollup_table1 where the_date > '2013-03-30') pass;");
call common_schema.run("split(delete from rollup_table2 where the_date > '2013-03-30') pass;");
call common_schema.run("split(delete from rollup_table3 where the_date > '2013-03-30') pass;");
...

That’s the simplest way to do deletes with split(), and the tool will automatically determine which index and what chunk size to use. If I were running this on an active database (or a master) I would probably use throttle to control the speed of the deletes, but in this case it was running on passive replicas so I just used pass to run the deletes with no sleep time in between them. I sorted the deletes by table size, from smallest to largest, and had a total of 33 tables to process.

I started running the SQL on a stage database and it deleted data from the first 32 tables with no problem, but it got stuck on the 33rd table. I checked the process list and found this query running (table and column names have been changed for simplicity):

1
2
3
4
5
6
7
8
select REQ_DATE, DIMENSION1, DIMENSION2, DIMENSION3, DIMENSION4, DIMENSION5 
from `flite`.`rollup_table33` 
order by REQ_DATE ASC, DIMENSION1 ASC, DIMENSION2 ASC, DIMENSION3 ASC, 
  DIMENSION4 ASC, DIMENSION5 ASC 
limit 1  
into @_split_column_variable_min_1, @_split_column_variable_min_2, 
  @_split_column_variable_min_3, @_split_column_variable_min_4, 
  @_split_column_variable_min_5, @_split_column_variable_min_6

The relevant table has about 100 million rows, but I would expect the above query to be fairly fast since there is a unique index on the columns being selected. I ran an explain and found that the query was doing a full table scan. I’m not sure exactly why it was doing a full table scan, but the table does have some quirks:

  • Two of the dimension columns use the TEXT data type, and thus only a substring from each fo those column is indexed
  • The default charset for the table is latin1, but for the TEXT columns it is utf8
  • The table uses ROW_FORMAT=DYNAMIC

Rather than trying to figure out exactly why that query was doing a full table scan, I checked the common_schema documentation to see if there were any options I could use to avoid running this particular query. The parameters for split are fairly limited, but I did try using start to see if that would get around it:

1
call common_schema.run("split({start:'2013-03-30,0,,0,0,', size: 10000} : delete from rollup_table3 where the_date > '2013-03-30') pass;");

I thought telling split() where to start and what chunk size to use might help, but it still ran the same query with the full table scan.

At that point I started looking for alternatives to split(). I remembered seeing an example in the documentation using WHILE to do a delete with a WHERE clause and a LIMIT. Here’s the example:

1
2
3
4
while (DELETE FROM world.Country WHERE Continent = 'Asia' LIMIT 10)
{
  throttle 2;
}

So I decided to try that with my table. Here’s the command I ran:

1
2
3
4
5
call common_schema.run("
while (delete from rollup_table33 where req_date > '2013-06-14' LIMIT 10000) 
{ 
  pass; 
}");

As I expected that delete started fast and slowed down over time, but within 3 hours the deletes were done!

At some point I intend to go back and figure out why my first approach was so slow, and might file a bug report against common_schema. But for now I’m happy that I could get the tool to do what I needed thanks to its flexibility.


Comments