Flite Culture

Loading Large Flat Files Into MySQL With LOAD DATA INFILE and Pt-fifo-split

It’s easy to load multiple rows of data from a file into a MySQL table with a single LOAD DATA INFILE command, but if the file is over a certain size you may want to load it in multiple chunks. When I have a file with millions of lines I typically use pt-fifo-split to separate the file into multiple chunks in order to limit the size of each transaction. Unlike most tools in Percona Toolkit, pt-fifo-split doesn’t actually interact with your database, but this is one MySQL use case where it’s very helpful.

Here’s a quick example in bash that illustrates how I can take a large tab-delimited file, break it into multiple 100,000 line chunks, and load the chunks into a table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
FLAT_FILE="/tmp/big_file.txt"
FIFO_PATH="${FLAT_FILE}.fifo"
LOAD_FILE="${FLAT_FILE}.load"
CHUNK_SIZE=100000

# Split the file
pt-fifo-split --force --lines ${CHUNK_SIZE} ${FLAT_FILE} --fifo ${FIFO_PATH} &
# Sleep 10 seconds to assure ${FIFO_PATH} exists before entering loop
sleep 10
while [ -e ${FIFO_PATH} ]
do
  # Write chunk to disk
  cat ${FIFO_PATH} > ${LOAD_FILE}
  # Load chunk into table
  mysql --database=test \
   --show-warnings \
   -vve "load data infile '${LOAD_FILE}' into table my_table;"
done

YMMV, so you should do some testing to determine the optimal chunk size based on your schema, storage engine, replication setup, complexity of LOAD DATA INFILE statement, etc.

Comments

Using MySQL Sandbox to Test Replication Behavior

It is easy to set up a replication topology with MySQL Sandbox, and I use it for lots of testing scenarios. For example I used MySQL Sandbox to run the tests for yeterday’s post about the way UUIDs are replicated in row-based replication (RBR) versus statement-based replication (SBR).

Here’s how easy it is to set up a replicated sandbox for testing:

Installing MySQL Sandbox

If you’ve never installed MySQL Sandbox before, you can do so by running a single command as root:

1
2
sudo su -
cpan MySQL::Sandbox

Download the MySQL binary

Pick the appropriate binary based on your OS and the version of MySQL you want to use. In my case I’m on Mac OS X and I want to test MySQL 5.6 so I went to dev.mysql.com and downloaded the following compressed TAR archive:

1
mysql-5.6.22-osx10.8-x86_64.tar.gz

Create the sandbox

This is another one-liner:

1
make_replication_sandbox ~/Downloads/mysql-5.6.22-osx10.8-x86_64.tar.gz

Use the sandbox

MySQL Sandbox created 3 servers for me: a master and two replicas. Everything is installed under this directory:

1
$HOME/sandboxes/rsandbox_mysql-5_6_22

Here are the relevant my.cnf files:

1
2
3
master/my.sandbox.cnf
node1/my.sandbox.cnf
node2/my.sandbox.cnf

Here are some handy scripts to start and stop all 3 servers:

1
2
3
./start_all
./stop_all
./restart_all

And here are the handy scripts to connect to each server with the MySQL CLI:

1
2
3
./m
./s1
./s2

It’s easy to get confused about which server you are connected to when you are running multiple servers, so MySQL Sandbox overrides the default mysql prompt with a more information prompt. For example here’s the prompt for my master server:

1
master [localhost] {msandbox} ((none)) > 

Happy testing!

Comments

How to Safely Replicate UUID Values in MySQL

With MySQL replication, when using the UUID() function to insert or update values in your database you need to be careful to assure the UUIDs are replicated properly. This is primarily a problem in statement-based replication.

Here’s a simple example with row-based replication to illustrate UUIDs replicating successfully:

Master

1
2
3
4
5
6
7
8
9
10
11
12
13
master > set binlog_format = 'ROW';
Query OK, 0 rows affected (0.00 sec)

master > insert into uuid_test (binlog_format,uuid_string) values(@@binlog_format,uuid());
Query OK, 1 row affected (0.00 sec)

master > select id,binlog_format,uuid_string,md5(uuid_string) from uuid_test;
+----+---------------+--------------------------------------+----------------------------------+
| id | binlog_format | uuid_string                          | md5(uuid_string)                 |
+----+---------------+--------------------------------------+----------------------------------+
|  1 | ROW           | f54448fa-7c20-11e4-8f6a-03e05a24a3c7 | a9f18c42b9d561f126ca007b1d273473 |
+----+---------------+--------------------------------------+----------------------------------+
1 row in set (0.00 sec)

Replica

1
2
3
4
5
6
7
replica1 > select id,binlog_format,uuid_string,md5(uuid_string) from uuid_test;
+----+---------------+--------------------------------------+----------------------------------+
| id | binlog_format | uuid_string                          | md5(uuid_string)                 |
+----+---------------+--------------------------------------+----------------------------------+
|  1 | ROW           | f54448fa-7c20-11e4-8f6a-03e05a24a3c7 | a9f18c42b9d561f126ca007b1d273473 |
+----+---------------+--------------------------------------+----------------------------------+
1 row in set (0.00 sec)

(Note that in addition to the UUID itself I also check the md5 hash of the UUID to make it easier to visually spot differences. This is relevant for UUIDs in particular since two non-matching UUIDs from the same machine will still have some matching substrings and thus may look like they match at first glance.)

In this case I can see the UUID value is the same.

Read on ✈
Comments

Handling Invalid Numeric Values in Hive

Recently I found some NULL values in a Hive table in a numeric column that I expected to be NOT NULL. The query that populates the table extracts a numeric value from a json-formatted string using the get_json_object() UDF like this:

1
coalesce(get_json_object(sub_object, '$.key'),0)

The intent of the COALESCE() call is to replace all NULL values of sub_object.key with 0, but when I looked at the actual values I saw both 0 and NULL present in the table. Why are some of the NULL values not being replaced with 0? Because there are some non-numeric values in sub_object.key, and those ended up being inserted as NULL due to implicit type conversion. get_json_object() returns a string, which is then being implicitly converted to an int, so a value like 123abc would not be overidden by COALESCE(). Instead the implicit type conversion tries to cast 123abc as an int and it ends up as NULL.

The solution? Use explicit type conversion instead of relying on implicit type conversion.

Specifically I updated my query to cast the string as an int, like this:

1
coalesce(cast(get_json_object(sub_object, '$.key') as int),0)

With that change in place my table no longer has any NULL values in that column, and all missing values are being treated equally and stored as 0.

Comments

Alternatives for Chunking Bulk Deletes in Common_schema

I’ve blogged about common_schema multiple times in the past, and it’s a tool I use frequently. Last week I had a project to delete millions of rows from multiple rollup tables in a star schema. Since the tables are not partitioned I needed to use DELETE instead of DROP PARTITION, but I didn’t want to delete millions of rows in a single transaction. My first instinct was to use common_schema’s split() function to break the deletes into chunks. So I ran a query on INFORMATION_SCHEMA to generate a bunch of statements like this:

1
2
3
4
call common_schema.run("split(delete from rollup_table1 where the_date > '2013-03-30') pass;");
call common_schema.run("split(delete from rollup_table2 where the_date > '2013-03-30') pass;");
call common_schema.run("split(delete from rollup_table3 where the_date > '2013-03-30') pass;");
...

That’s the simplest way to do deletes with split(), and the tool will automatically determine which index and what chunk size to use. If I were running this on an active database (or a master) I would probably use throttle to control the speed of the deletes, but in this case it was running on passive replicas so I just used pass to run the deletes with no sleep time in between them. I sorted the deletes by table size, from smallest to largest, and had a total of 33 tables to process.

Read on ✈
Comments

Feature Flags at Flite

Feature flags (or flippers or toggles or whatever you like to call them) are a really important piece to any development process, especially if you are employing Continuous Integration (CI) . Even if you’re not quite there with CI, feature flags can still be a very useful tool that can help agile development teams be more productive, produce higher quality products, and work at the speed of business.

Read on ✈
CI
Comments

Avoiding MySQL ERROR 1069 by Explicitly Naming Indexes

Since I recently wrote about both MySQL error 1071 and error 1070 I decided to continue the pattern with a quick note on MySQL error 1069. In case you’ve never seen it before, this is MySQL error 1069:

1
ERROR 1069 (42000): Too many keys specified; max 64 keys allowed

I can’t think of a valid use case that requires more than 64 indexes on a MySQL table, but it’s possible to get this error by inadvertantly adding lots of duplicate indexes to a table. This can happen if you don’t explicitly name your indexes.

Read on for examples…

Read on ✈
Comments

A Workaround for MySQL ERROR 1070

As documented in the Reference Manual MySQL supports a maximum of 16 columns per index. That’s more than sufficient for most index use cases, but what about unique constraints? If I create a fact table with more than 16 dimension columns in my star schema, and then try to add an index to enforce a unique constraint across all of the dimension columns, then I’ll get this error:

1
ERROR 1070 (42000): Too many key parts specified; max 16 parts allowed

For multi-column unique indexes, internally MySQL concatenates all of the column values together in a single hyphen-delimited string for comparison. Thus I can simulate a multi-column unique index by adding an extra column that stores the concatenated column values, and adding a unique index on that column.

Read on for details…

Read on ✈
Comments

Simulating Add Column if Not Exists in MySQL With Common_schema

Some MySQL DDL commands such as CREATE TABLE and DROP TABLE support an IF [NOT] EXISTS option which allows you to downgrade the error to a warning if you try to create something that already exists or drop something that doesn’t exist.

For example this gives an error:

1
2
mysql> drop table sakila.fake_table;
ERROR 1051 (42S02): Unknown table 'sakila.fake_table'

And this gives a warning:

1
2
3
4
mysql> drop table if exists sakila.fake_table;
Query OK, 0 rows affected, 1 warning (0.00 sec)

Note (Code 1051): Unknown table 'sakila.fake_table'

You may also want to use IF [NOT] EXISTS for column-level changes such as ADD COLUMN and DROP COLUMN, but MySQL does not support that.

Read on for some examples of how to simulate IF [NOT] EXISTS using the QueryScript language from common_schema.

Read on ✈
Comments

Using Innodb_large_prefix to Avoid ERROR 1071

If you’ve ever tried to add an index that includes a long varchar column to an InnoDB table in MySQL, you may have seen this error:

1
ERROR 1071 (42000): Specified key was too long; max key length is 767 bytes

The character limit depends on the character set you use. For example if you use latin1 then the largest column you can index is varchar(767), but if you use utf8 then the limit is varchar(255). There is also a separate 3072 byte limit per index. The 767 byte limit is per column, so you can include multiple columns (each 767 bytes or smaller) up to 3072 total bytes per index, but no column longer than 767 bytes. (MyISAM is a little different. It has a 1000 byte index length limit, but no separate column length limit within that).

One workaround for these limits is to only index a prefix of the longer columns, but what if you want to index more than 767 bytes of a column in InnoDB?

In that case you should consider using innodb_large_prefix, which was introduced in MySQL 5.5.14 and allows you to include columns up to 3072 bytes long in InnoDB indexes. It does not affect the index limit, which is still 3072 bytes as quoted in the manual:

The InnoDB internal maximum key length is 3500 bytes, but MySQL itself restricts this to 3072 bytes. This limit applies to the length of the combined index key in a multi-column index.

Read on for details and examples about innodb_large_prefix.

Read on ✈
Comments