Flite Culture

3 Methods to Extract a Subset of Your Data Using Mysqldump

A few years ago I wrote a tool to extract a subset of data from a production database for use in QA and development environments. My goal was to have some real data for functional testing, but to make the data set small enough that a developer could easily download it and install it on their laptop in a few minutes.

I implemented the tool using mysqldump. As I have maintained the tool over the years I’ve employed a couple of different approaches, each of which I will describe in more detail below.

The first step was to identify the records I wanted to include in my subset. I created a couple of tables to store the ids of the records I wanted, and then some queries to populate those tables based on various criteria.

For example, say I want to dump the data for the 10 shortest PG rated movies in the sakila database. Here’s an example of populating a single ID table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
use sakila

drop table if exists temp_film;

create table if not exists temp_film (
  film_id smallint unsigned not null primary key
);

insert into temp_film 
select film_id 
from film 
where rating = 'PG' 
order by length asc 
limit 10;

It gets more complicated when we want to dump the data. The biggest question is: How should I form the where clause in the mysqldump statement? Running mysqldump with --where is useful, but we’re limited it what type of where clauses we can use with mysqldump.

Here’s one way to do it:

1) Use EXISTS

My earliest approach was to use an EXISTS query.

1
2
3
4
5
mysqldump \
  --single-transaction \
  sakila \
  film \
  --where="exists (select NULL from temp_film where temp_film.film_id = film.film_id)" 

That works, but it’s not necessarily going to be fast. It also gets more complicated for child tables. For example how do I get the data for only the actors who appeared in those films?

1
2
3
4
5
6
7
8
mysqldump \
  --single-transaction \
  sakila \
  actor \
  --where="exists (select NULL from temp_film \
  inner join film on film.film_id = temp_film.film_id \
  inner join film_actor on film_actor.film_id = film.film_id \
  where film_actor.actor_id = actor.actor_id)" 

Again, that works, but it’s adding complexity and has the potential to be slow for large tables.

Those commands are pretty similar to what I ended up using in my tool, and it worked pretty well for a few years. Eventually some of the EXISTS queries caused the tool to be slower than I wanted it to be, so I revisited the implementation. What I found was that in many cases it was faster to use an IN clause with the actual IDs rather than the EXISTS queries I had been using before.

2) Use IN

Here are my previous examples re-written using an IN clause. Here I dynamically generate a mysqldump command using my id table:

1
2
3
4
5
6
7
8
set group_concat_max_len = 4294967295;

select concat('mysqldump --single-transaction sakila film --where="film_id in (',
  (select group_concat(film_id) from sakila.temp_film),
  ')"') as dump_command
from information_schema.tables
where table_schema = 'sakila'
  and table_name = 'film';

Here’s the output:

1
2
3
4
5
mysqldump \
  --single-transaction \
  sakila \
  film \
  --where="film_id in (2,83,237,247,292,402,430,542,575,794)"

Okay, that’s pretty straightforward. How about the actors? Why not just use a separate id table for them?

1
2
3
4
5
6
7
8
9
10
11
12
use sakila

drop table if exists temp_actor;

create table if not exists temp_actor (
  actor_id smallint unsigned not null primary key
);

insert into temp_actor 
select distinct film_actor.actor_id 
from temp_film
  inner join film_actor on film_actor.film_id = temp_film.film_id;

Then I can generate the mysqldump statement using the same method:

1
2
3
4
5
6
7
8
set group_concat_max_len = 4294967295;

select concat('mysqldump --single-transaction sakila actor --where="actor_id in (',
  (select group_concat(actor_id) from sakila.temp_actor),
  ')"') as dump_command
from information_schema.tables
where table_schema = 'sakila'
  and table_name = 'actor';

That gives us the following command:

1
2
3
4
5
mysqldump \
  --single-transaction \
  sakila actor \
  --where="actor_id in (7,14,17,19,22,24,30,34,49,54,61,69,72,79,82,83,85,89,90,94,96, \
  106,107,110,113,114,122,129,136,142,145,146,147,150,160,161,167,168,177,181,186,195,196)"

In general I try to minimize the number of id tables. For one-to-many relationships I dump the child records using the parent’s id table. But for many-to-many relationships or multi-level relationships I consider adding a separate id table.

I did some profiling to determine which tables benefitted from using IN instead of EXISTS, but in some cases I had to use EXISTS even though IN would have been faster, because mysqldump only supports IN clauses up to a certain size. I forget what the exact limit was, maybe 10,000 values?

3) Use a Separate Schema

This method frees you from one of the biggest constraints of the other methods. It allows you to efficiently use joins to select the related rows you want, and it makes the mysqldump commands trivial.

The approach is to create a copy of every table in your schema, copy over the rows you want, and then dump the entire copy schema.

Following our film and actor examples from above:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
create database sakila_subset;

use sakila_subset;

create table sakila_subset.film 
  like sakila.film;

create table sakila_subset.actor 
  like sakila.actor;

insert into sakila_subset.film
select film.*
from sakila.temp_film
  inner join sakila.film on film.film_id = temp_film.film_id;

insert into sakila_subset.actor
select actor.*
from sakila.temp_actor
  inner join sakila.actor on actor.actor_id = temp_actor.actor_id;

Now I can dump the entire sakila_subset schema:

1
2
3
mysqldump \
  --single-transaction \
  sakila_subset 

If you want your copy to maintain the original schema name you can edit the output file and replace references to sakila_subset with sakila.

Comments

Comments