Flite Culture

Using MySQL 5.7 Generated Columns to Avoid ERROR 1070

When I first blogged about a potential workaround for MySQL error 1070, I received a suggestion in the comments that my workaround might be a useful trick with MariaDB virtual columns or MySQL 5.7 generated columns. Although I’ve already blogged about a simplified version of my original workaround, the arrival of the MySQL 5.7.7 release candidate inspired me to try an even simpler implementation using generated columns.

Read on for my findings.

Read on ✈
Comments

Speaking at Percona Live 2015

A week from today I am speaking at Percona Live: MySQL Conference and Expo 2015.

Here is my talk:

My talk is for anyone who wants to make their life easier using MySQL tools. Rather than doing a deep dive on one or two tools, I will describe the methodology we’ve used at Flite to successfully integrate many different MySQL tools into our day-to-day workflow. The talk will take place Wednesday, April 15 at 3:30 PM in Ballroom E.

I most recently gave this talk at the San Francisco MySQL Meetup in February. If you are interested in my talk but won’t be at Percona Live, you can find the video online here.

See you in Santa Clara next week!

Comments

Introducing Pt-reanimate

Most of the time I like pt-kill, but sometimes it gets a little overzealous and kills a query that I need. That’s why I’m glad Percona created pt-reanimate so I can bring those important queries back to life. Of course the queries are not exactly the same when they come back, but at least they come back.

Here’s an example of pt-reanimate in action. First I add a column to a large table:

1
alter table big_table add column new_col int;

Here is my DDL in the processlist table:

1
2
3
4
5
+---------+------+----------------+----------------------------------------------+
| command | time | state          | info                                         |
+---------+------+----------------+----------------------------------------------+
| Query   |  146 | altering table | alter table big_table add column new_col int |
+---------+------+----------------+----------------------------------------------+

Since I’m using MySQL 5.6 this is an online operation, and it does not block reads or writes on the table. Unfortunately pt-kill doesn’t know that, and due to the long runtime of my DDL pt-kill killed the query, and it disappeared from the processlist.

Luckily I am also running pt-reanimate, and it brought my query back to life! After pt-reanimate recovered my query, this is what it looked like in the processlist table:

1
2
3
4
5
+---------+------+--------+----------------------------------------------+
| command | time | state  | info                                         |
+---------+------+--------+----------------------------------------------+
| Zombie  |  187 | brains | alter table big_table add column new_col int |
+---------+------+--------+----------------------------------------------+

The good news is the query eventually completed successfully. The bad news is it tried to eat some of my other queries, but they were all fast-running queries so they escaped unharmed.

Comments

Avoiding MySQL ERROR 1052 by Prefixing Column Names in Multi Table Queries

MySQL is fairly relaxed about prefixing column names with a table name or alias in a query involving multiple tables. As long as the column name you use only exists in one of the tables in your query, MySQL does not require you to prefix it, regardless of SQL mode.

My policy has always been to prefix all column names with the table name/alias in a multiple table query. This is not simply a syle choice. If I reference a column name without prefixing it the query may work today, but it is not future-proof. If the schema changes in the future and I add a column with the same name to one of the other tables involved in that query, then the query will fail with ERROR 1052:

1
ERROR 1052 (23000): Column 'used_to_be_unique' in field list is ambiguous

Therefore I recommend always prefixing all column names in a multi table query to protect your queries against future additive schema changes.

Here’s some code to illustrate the problem:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
mysql> create table t1 (a int);
Query OK, 0 rows affected (0.03 sec)

mysql> create table t2 (a int, b int);
Query OK, 0 rows affected (0.02 sec)

mysql> select t1.a, b 
    -> from t1
    ->   inner join t2 on t2.a = t1.a;
Empty set (0.00 sec)

mysql> alter table t1 add column b int;
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> select t1.a, b 
    -> from t1
    ->   inner join t2 on t2.a = t1.a;
ERROR 1052 (23000): Column 'b' in field list is ambiguous
Comments

Updating Millions of Rows in MySQL With Common_schema

Last week I had to update several million rows in a single MySQL table. Rather that executing the update directly in the database, I decided to use common_schema’s split() function.

There are two reasons I chose to use common_schema for this task:

Chunking

Chunking breaks a single update statement into multiple statements, each updating at most a certain specified number of rows. The default chunk size is 1000, and I changed it to 2000 by setting the size parameter.

Throttling

Throttling has two benefits: it minimizes the amount of load on the master host, and it minimizes the replication lag introduced when the update statements are executed on the replicas. It achieves this by introducing sleep statements after each chunk is updated. The duration of the sleep statement is proportional to the runtime of the update statement. After experimenting with several different throttle values in a test environment, I decided on a value of 4, which means it will sleep for 4 times as long as each update takes. Thus if an update takes 100 milliseconds, it will sleep for 400 milliseconds before executing the next update. Using throttle does not guarantee that you will completely avoid replication lag, but if you do some testing you should be able to find an appropriate value that will strike a balance between the total run time of your updates and the level of replication lag that it introduces.

Read on ✈
Comments

Adding a Unique Constraint With More Than 16 Columns in MySQL

A while ago I blogged about a potential workaround for MySQL error 1070 that could be used to add a unique constraint with more than 16 columns.

As a reminder here’s the error you get when you try to create a unique constraint with more than 16 columns in MySQL:

1
ERROR 1070 (42000): Too many key parts specified; max 16 parts allowed

The solution I proposed should work, but when I started to implement it I made a couple of changes that merit this follow-up post. One change is fairly generic, and the other is specific to my use case.

The generic change was to only concatenate some of the columns, rather than all of them. Specifically I only need to concatenate N-15 columns. So if I want to have a unique constraint on 20 columns, I can include 15 of those actual columns and then concatenate the remaining 5 columns into a new column that is included as the 16th column in the unique constraint. When choosing which columns to concatenate, it makes sense to choose columns with short values, since that limits the amount of redundant data you are storing and helps prevent the table from taking up too much disk space.

Read on ✈
Comments

Loading Large Flat Files Into MySQL With LOAD DATA INFILE and Pt-fifo-split

It’s easy to load multiple rows of data from a file into a MySQL table with a single LOAD DATA INFILE command, but if the file is over a certain size you may want to load it in multiple chunks. When I have a file with millions of lines I typically use pt-fifo-split to separate the file into multiple chunks in order to limit the size of each transaction. Unlike most tools in Percona Toolkit, pt-fifo-split doesn’t actually interact with your database, but this is one MySQL use case where it’s very helpful.

Here’s a quick example in bash that illustrates how I can take a large tab-delimited file, break it into multiple 100,000 line chunks, and load the chunks into a table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
FLAT_FILE="/tmp/big_file.txt"
FIFO_PATH="${FLAT_FILE}.fifo"
LOAD_FILE="${FLAT_FILE}.load"
CHUNK_SIZE=100000

# Split the file
pt-fifo-split --force --lines ${CHUNK_SIZE} ${FLAT_FILE} --fifo ${FIFO_PATH} &
# Sleep 10 seconds to assure ${FIFO_PATH} exists before entering loop
sleep 10
while [ -e ${FIFO_PATH} ]
do
  # Write chunk to disk
  cat ${FIFO_PATH} > ${LOAD_FILE}
  # Load chunk into table
  mysql --database=test \
   --show-warnings \
   -vve "load data infile '${LOAD_FILE}' into table my_table;"
done

YMMV, so you should do some testing to determine the optimal chunk size based on your schema, storage engine, replication setup, complexity of LOAD DATA INFILE statement, etc.

Comments

Using MySQL Sandbox to Test Replication Behavior

It is easy to set up a replication topology with MySQL Sandbox, and I use it for lots of testing scenarios. For example I used MySQL Sandbox to run the tests for yeterday’s post about the way UUIDs are replicated in row-based replication (RBR) versus statement-based replication (SBR).

Here’s how easy it is to set up a replicated sandbox for testing:

Installing MySQL Sandbox

If you’ve never installed MySQL Sandbox before, you can do so by running a single command as root:

1
2
sudo su -
cpan MySQL::Sandbox

Download the MySQL binary

Pick the appropriate binary based on your OS and the version of MySQL you want to use. In my case I’m on Mac OS X and I want to test MySQL 5.6 so I went to dev.mysql.com and downloaded the following compressed TAR archive:

1
mysql-5.6.22-osx10.8-x86_64.tar.gz

Create the sandbox

This is another one-liner:

1
make_replication_sandbox ~/Downloads/mysql-5.6.22-osx10.8-x86_64.tar.gz

Use the sandbox

MySQL Sandbox created 3 servers for me: a master and two replicas. Everything is installed under this directory:

1
$HOME/sandboxes/rsandbox_mysql-5_6_22

Here are the relevant my.cnf files:

1
2
3
master/my.sandbox.cnf
node1/my.sandbox.cnf
node2/my.sandbox.cnf

Here are some handy scripts to start and stop all 3 servers:

1
2
3
./start_all
./stop_all
./restart_all

And here are the handy scripts to connect to each server with the MySQL CLI:

1
2
3
./m
./s1
./s2

It’s easy to get confused about which server you are connected to when you are running multiple servers, so MySQL Sandbox overrides the default mysql prompt with a more information prompt. For example here’s the prompt for my master server:

1
master [localhost] {msandbox} ((none)) > 

Happy testing!

Comments

How to Safely Replicate UUID Values in MySQL

With MySQL replication, when using the UUID() function to insert or update values in your database you need to be careful to assure the UUIDs are replicated properly. This is primarily a problem in statement-based replication.

Here’s a simple example with row-based replication to illustrate UUIDs replicating successfully:

Master

1
2
3
4
5
6
7
8
9
10
11
12
13
master > set binlog_format = 'ROW';
Query OK, 0 rows affected (0.00 sec)

master > insert into uuid_test (binlog_format,uuid_string) values(@@binlog_format,uuid());
Query OK, 1 row affected (0.00 sec)

master > select id,binlog_format,uuid_string,md5(uuid_string) from uuid_test;
+----+---------------+--------------------------------------+----------------------------------+
| id | binlog_format | uuid_string                          | md5(uuid_string)                 |
+----+---------------+--------------------------------------+----------------------------------+
|  1 | ROW           | f54448fa-7c20-11e4-8f6a-03e05a24a3c7 | a9f18c42b9d561f126ca007b1d273473 |
+----+---------------+--------------------------------------+----------------------------------+
1 row in set (0.00 sec)

Replica

1
2
3
4
5
6
7
replica1 > select id,binlog_format,uuid_string,md5(uuid_string) from uuid_test;
+----+---------------+--------------------------------------+----------------------------------+
| id | binlog_format | uuid_string                          | md5(uuid_string)                 |
+----+---------------+--------------------------------------+----------------------------------+
|  1 | ROW           | f54448fa-7c20-11e4-8f6a-03e05a24a3c7 | a9f18c42b9d561f126ca007b1d273473 |
+----+---------------+--------------------------------------+----------------------------------+
1 row in set (0.00 sec)

(Note that in addition to the UUID itself I also check the md5 hash of the UUID to make it easier to visually spot differences. This is relevant for UUIDs in particular since two non-matching UUIDs from the same machine will still have some matching substrings and thus may look like they match at first glance.)

In this case I can see the UUID value is the same.

Read on ✈
Comments

Handling Invalid Numeric Values in Hive

Recently I found some NULL values in a Hive table in a numeric column that I expected to be NOT NULL. The query that populates the table extracts a numeric value from a json-formatted string using the get_json_object() UDF like this:

1
coalesce(get_json_object(sub_object, '$.key'),0)

The intent of the COALESCE() call is to replace all NULL values of sub_object.key with 0, but when I looked at the actual values I saw both 0 and NULL present in the table. Why are some of the NULL values not being replaced with 0? Because there are some non-numeric values in sub_object.key, and those ended up being inserted as NULL due to implicit type conversion. get_json_object() returns a string, which is then being implicitly converted to an int, so a value like 123abc would not be overidden by COALESCE(). Instead the implicit type conversion tries to cast 123abc as an int and it ends up as NULL.

The solution? Use explicit type conversion instead of relying on implicit type conversion.

Specifically I updated my query to cast the string as an int, like this:

1
coalesce(cast(get_json_object(sub_object, '$.key') as int),0)

With that change in place my table no longer has any NULL values in that column, and all missing values are being treated equally and stored as 0.

Comments