Flite Culture

Avoiding MySQL ERROR 1052 by Prefixing Column Names in Multi Table Queries

MySQL is fairly relaxed about prefixing column names with a table name or alias in a query involving multiple tables. As long as the column name you use only exists in one of the tables in your query, MySQL does not require you to prefix it, regardless of SQL mode.

My policy has always been to prefix all column names with the table name/alias in a multiple table query. This is not simply a syle choice. If I reference a column name without prefixing it the query may work today, but it is not future-proof. If the schema changes in the future and I add a column with the same name to one of the other tables involved in that query, then the query will fail with ERROR 1052:

1
ERROR 1052 (23000): Column 'used_to_be_unique' in field list is ambiguous

Therefore I recommend always prefixing all column names in a multi table query to protect your queries against future additive schema changes.

Here’s some code to illustrate the problem:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
mysql> create table t1 (a int);
Query OK, 0 rows affected (0.03 sec)

mysql> create table t2 (a int, b int);
Query OK, 0 rows affected (0.02 sec)

mysql> select t1.a, b 
    -> from t1
    ->   inner join t2 on t2.a = t1.a;
Empty set (0.00 sec)

mysql> alter table t1 add column b int;
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> select t1.a, b 
    -> from t1
    ->   inner join t2 on t2.a = t1.a;
ERROR 1052 (23000): Column 'b' in field list is ambiguous
Comments

Updating Millions of Rows in MySQL With Common_schema

Last week I had to update several million rows in a single MySQL table. Rather that executing the update directly in the database, I decided to use common_schema’s split() function.

There are two reasons I chose to use common_schema for this task:

Chunking

Chunking breaks a single update statement into multiple statements, each updating at most a certain specified number of rows. The default chunk size is 1000, and I changed it to 2000 by setting the size parameter.

Throttling

Throttling has two benefits: it minimizes the amount of load on the master host, and it minimizes the replication lag introduced when the update statements are executed on the replicas. It achieves this by introducing sleep statements after each chunk is updated. The duration of the sleep statement is proportional to the runtime of the update statement. After experimenting with several different throttle values in a test environment, I decided on a value of 4, which means it will sleep for 4 times as long as each update takes. Thus if an update takes 100 milliseconds, it will sleep for 400 milliseconds before executing the next update. Using throttle does not guarantee that you will completely avoid replication lag, but if you do some testing you should be able to find an appropriate value that will strike a balance between the total run time of your updates and the level of replication lag that it introduces.

Read on ✈
Comments

Adding a Unique Constraint With More Than 16 Columns in MySQL

A while ago I blogged about a potential workaround for MySQL error 1070 that could be used to add a unique constraint with more than 16 columns.

As a reminder here’s the error you get when you try to create a unique constraint with more than 16 columns in MySQL:

1
ERROR 1070 (42000): Too many key parts specified; max 16 parts allowed

The solution I proposed should work, but when I started to implement it I made a couple of changes that merit this follow-up post. One change is fairly generic, and the other is specific to my use case.

The generic change was to only concatenate some of the columns, rather than all of them. Specifically I only need to concatenate N-15 columns. So if I want to have a unique constraint on 20 columns, I can include 15 of those actual columns and then concatenate the remaining 5 columns into a new column that is included as the 16th column in the unique constraint. When choosing which columns to concatenate, it makes sense to choose columns with short values, since that limits the amount of redundant data you are storing and helps prevent the table from taking up too much disk space.

Read on ✈
Comments

Loading Large Flat Files Into MySQL With LOAD DATA INFILE and Pt-fifo-split

It’s easy to load multiple rows of data from a file into a MySQL table with a single LOAD DATA INFILE command, but if the file is over a certain size you may want to load it in multiple chunks. When I have a file with millions of lines I typically use pt-fifo-split to separate the file into multiple chunks in order to limit the size of each transaction. Unlike most tools in Percona Toolkit, pt-fifo-split doesn’t actually interact with your database, but this is one MySQL use case where it’s very helpful.

Here’s a quick example in bash that illustrates how I can take a large tab-delimited file, break it into multiple 100,000 line chunks, and load the chunks into a table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
FLAT_FILE="/tmp/big_file.txt"
FIFO_PATH="${FLAT_FILE}.fifo"
LOAD_FILE="${FLAT_FILE}.load"
CHUNK_SIZE=100000

# Split the file
pt-fifo-split --force --lines ${CHUNK_SIZE} ${FLAT_FILE} --fifo ${FIFO_PATH} &
# Sleep 10 seconds to assure ${FIFO_PATH} exists before entering loop
sleep 10
while [ -e ${FIFO_PATH} ]
do
  # Write chunk to disk
  cat ${FIFO_PATH} > ${LOAD_FILE}
  # Load chunk into table
  mysql --database=test \
   --show-warnings \
   -vve "load data infile '${LOAD_FILE}' into table my_table;"
done

YMMV, so you should do some testing to determine the optimal chunk size based on your schema, storage engine, replication setup, complexity of LOAD DATA INFILE statement, etc.

Comments

Using MySQL Sandbox to Test Replication Behavior

It is easy to set up a replication topology with MySQL Sandbox, and I use it for lots of testing scenarios. For example I used MySQL Sandbox to run the tests for yeterday’s post about the way UUIDs are replicated in row-based replication (RBR) versus statement-based replication (SBR).

Here’s how easy it is to set up a replicated sandbox for testing:

Installing MySQL Sandbox

If you’ve never installed MySQL Sandbox before, you can do so by running a single command as root:

1
2
sudo su -
cpan MySQL::Sandbox

Download the MySQL binary

Pick the appropriate binary based on your OS and the version of MySQL you want to use. In my case I’m on Mac OS X and I want to test MySQL 5.6 so I went to dev.mysql.com and downloaded the following compressed TAR archive:

1
mysql-5.6.22-osx10.8-x86_64.tar.gz

Create the sandbox

This is another one-liner:

1
make_replication_sandbox ~/Downloads/mysql-5.6.22-osx10.8-x86_64.tar.gz

Use the sandbox

MySQL Sandbox created 3 servers for me: a master and two replicas. Everything is installed under this directory:

1
$HOME/sandboxes/rsandbox_mysql-5_6_22

Here are the relevant my.cnf files:

1
2
3
master/my.sandbox.cnf
node1/my.sandbox.cnf
node2/my.sandbox.cnf

Here are some handy scripts to start and stop all 3 servers:

1
2
3
./start_all
./stop_all
./restart_all

And here are the handy scripts to connect to each server with the MySQL CLI:

1
2
3
./m
./s1
./s2

It’s easy to get confused about which server you are connected to when you are running multiple servers, so MySQL Sandbox overrides the default mysql prompt with a more information prompt. For example here’s the prompt for my master server:

1
master [localhost] {msandbox} ((none)) > 

Happy testing!

Comments

How to Safely Replicate UUID Values in MySQL

With MySQL replication, when using the UUID() function to insert or update values in your database you need to be careful to assure the UUIDs are replicated properly. This is primarily a problem in statement-based replication.

Here’s a simple example with row-based replication to illustrate UUIDs replicating successfully:

Master

1
2
3
4
5
6
7
8
9
10
11
12
13
master > set binlog_format = 'ROW';
Query OK, 0 rows affected (0.00 sec)

master > insert into uuid_test (binlog_format,uuid_string) values(@@binlog_format,uuid());
Query OK, 1 row affected (0.00 sec)

master > select id,binlog_format,uuid_string,md5(uuid_string) from uuid_test;
+----+---------------+--------------------------------------+----------------------------------+
| id | binlog_format | uuid_string                          | md5(uuid_string)                 |
+----+---------------+--------------------------------------+----------------------------------+
|  1 | ROW           | f54448fa-7c20-11e4-8f6a-03e05a24a3c7 | a9f18c42b9d561f126ca007b1d273473 |
+----+---------------+--------------------------------------+----------------------------------+
1 row in set (0.00 sec)

Replica

1
2
3
4
5
6
7
replica1 > select id,binlog_format,uuid_string,md5(uuid_string) from uuid_test;
+----+---------------+--------------------------------------+----------------------------------+
| id | binlog_format | uuid_string                          | md5(uuid_string)                 |
+----+---------------+--------------------------------------+----------------------------------+
|  1 | ROW           | f54448fa-7c20-11e4-8f6a-03e05a24a3c7 | a9f18c42b9d561f126ca007b1d273473 |
+----+---------------+--------------------------------------+----------------------------------+
1 row in set (0.00 sec)

(Note that in addition to the UUID itself I also check the md5 hash of the UUID to make it easier to visually spot differences. This is relevant for UUIDs in particular since two non-matching UUIDs from the same machine will still have some matching substrings and thus may look like they match at first glance.)

In this case I can see the UUID value is the same.

Read on ✈
Comments

Handling Invalid Numeric Values in Hive

Recently I found some NULL values in a Hive table in a numeric column that I expected to be NOT NULL. The query that populates the table extracts a numeric value from a json-formatted string using the get_json_object() UDF like this:

1
coalesce(get_json_object(sub_object, '$.key'),0)

The intent of the COALESCE() call is to replace all NULL values of sub_object.key with 0, but when I looked at the actual values I saw both 0 and NULL present in the table. Why are some of the NULL values not being replaced with 0? Because there are some non-numeric values in sub_object.key, and those ended up being inserted as NULL due to implicit type conversion. get_json_object() returns a string, which is then being implicitly converted to an int, so a value like 123abc would not be overidden by COALESCE(). Instead the implicit type conversion tries to cast 123abc as an int and it ends up as NULL.

The solution? Use explicit type conversion instead of relying on implicit type conversion.

Specifically I updated my query to cast the string as an int, like this:

1
coalesce(cast(get_json_object(sub_object, '$.key') as int),0)

With that change in place my table no longer has any NULL values in that column, and all missing values are being treated equally and stored as 0.

Comments

Alternatives for Chunking Bulk Deletes in Common_schema

I’ve blogged about common_schema multiple times in the past, and it’s a tool I use frequently. Last week I had a project to delete millions of rows from multiple rollup tables in a star schema. Since the tables are not partitioned I needed to use DELETE instead of DROP PARTITION, but I didn’t want to delete millions of rows in a single transaction. My first instinct was to use common_schema’s split() function to break the deletes into chunks. So I ran a query on INFORMATION_SCHEMA to generate a bunch of statements like this:

1
2
3
4
call common_schema.run("split(delete from rollup_table1 where the_date > '2013-03-30') pass;");
call common_schema.run("split(delete from rollup_table2 where the_date > '2013-03-30') pass;");
call common_schema.run("split(delete from rollup_table3 where the_date > '2013-03-30') pass;");
...

That’s the simplest way to do deletes with split(), and the tool will automatically determine which index and what chunk size to use. If I were running this on an active database (or a master) I would probably use throttle to control the speed of the deletes, but in this case it was running on passive replicas so I just used pass to run the deletes with no sleep time in between them. I sorted the deletes by table size, from smallest to largest, and had a total of 33 tables to process.

Read on ✈
Comments

Feature Flags at Flite

Feature flags (or flippers or toggles or whatever you like to call them) are a really important piece to any development process, especially if you are employing Continuous Integration (CI) . Even if you’re not quite there with CI, feature flags can still be a very useful tool that can help agile development teams be more productive, produce higher quality products, and work at the speed of business.

Read on ✈
CI
Comments

Avoiding MySQL ERROR 1069 by Explicitly Naming Indexes

Since I recently wrote about both MySQL error 1071 and error 1070 I decided to continue the pattern with a quick note on MySQL error 1069. In case you’ve never seen it before, this is MySQL error 1069:

1
ERROR 1069 (42000): Too many keys specified; max 64 keys allowed

I can’t think of a valid use case that requires more than 64 indexes on a MySQL table, but it’s possible to get this error by inadvertantly adding lots of duplicate indexes to a table. This can happen if you don’t explicitly name your indexes.

Read on for examples…

Read on ✈
Comments