Flite Culture

Hash-based Workarounds for MySQL Unique Constraint Limitations

In the past I have written about MySQL’s limitations that prevent you from creating an index over a certain length, or with more than a certain number of columns. I’ve offered some solutions for those problems, but many of my solutions hit another limit further down the line. For example innodb_large_prefix allows you to create an index on larger columns, but only up to 3072 bytes, and my workarounds for the 16 column limit per index will also hit that 3072 byte limit once you add lots of columns or add some very long columns.

Today I’m going to suggest a hash-based solution to bypass that 3072 byte limit to solve a specific subset of index use cases. The basic idea is to build a string by concatenating all of the columns we want to include in the index, and create a unique index on the hash of that string, rather than the string itself. The hash will be a fixed length regardless of the input columns, so we don’t need to worry about the 3072 byte limit or the 16 column limit.

For example instead of building a unique index on this:

1
2
concat(c1,'-',c2,'-',c3,'-',c4,'-',c5,'-',c6,'-',c7,'-',c8,'-',c9,'-',c10,'-',
c11,'-',c12,'-',c13,'-',c14,'-',c15,'-',c16,'-',c17,'-',c18,'-',c19,'-',c20)

I can build a unique index on this:

1
2
unhex(sha2(concat(c1,'-',c2,'-',c3,'-',c4,'-',c5,'-',c6,'-',c7,'-',c8,'-',c9,'-',c10,'-',
c11,'-',c12,'-',c13,'-',c14,'-',c15,'-',c16,'-',c17,'-',c18,'-',c19,'-',c20),256))

An index on a hash value is not useful for most queries, since it can’t speed up range scans or queries using an index prefix, but it will work for unique constraints, and you can also use it for single row lookups if you apply the same hash function to the values you are searching for.

Hat tip to Justin Swanhart for giving me the idea to use hashes for this.

Read on for details and code examples.

Read on ✈
Comments

Groupwise Most Frequent Value Query in Hive

I am used to writing SQL queries and making small changes to accomodate Hive’s limitations, but sometimes I encounter a query that takes a little more work to implement in Hive. Recently I needed to write a query to calculate the most frequently occurring value by key. MySQL and Hive both lack a MODE() function, but there are several ways to accomplish this in MySQL. I usually execute two identical sub-queries to group by key and value, and use an outer join to exclude all of the key/value pairs that are not the more frequently occurring.

Read on for details on the MySQL query and how I needed to modify it to make it work in Hive.

Read on ✈
Comments

Adding a Unique Constraint With More Than 16 Columns in MariaDB

When I started writing this post I planned to follow up on my series of posts by creating a unique constraint with more than 16 columns using a MariaDB virtual column the same way I used a MySQL generated column in my most recent post. During my testing I abandoned that plan when I discovered two things:

  1. MariaDB virtual columns impose a 252 character limit on the expression that defines the column. This works for concatenating lots of columns with very short names like I did in my last post, but in the real world it’s easy to find an example where column names are long enough that a concatenate expression involving more than 16 columns is longer than 252 characters.
  2. MariaDB doesn’t have the same 16 column limit on indexes; instead it imposes a limit of 32 columns. Thus I can add a unique constraint on 17-32 columns in MariaDB without having to do anything special.

So I can’t use MariaDB virtual columns as a workaround to add a unique constraint on more than 16 columns, but it probably doesn’t matter because I actually don’t need that workaround as long as my unique constraint includes no more than 32 columns.

Read on for code examples.

Read on ✈
Comments

Using MySQL 5.7 Generated Columns to Avoid ERROR 1070

When I first blogged about a potential workaround for MySQL error 1070, I received a suggestion in the comments that my workaround might be a useful trick with MariaDB virtual columns or MySQL 5.7 generated columns. Although I’ve already blogged about a simplified version of my original workaround, the arrival of the MySQL 5.7.7 release candidate inspired me to try an even simpler implementation using generated columns.

Read on for my findings.

Read on ✈
Comments

Speaking at Percona Live 2015

A week from today I am speaking at Percona Live: MySQL Conference and Expo 2015.

Here is my talk:

My talk is for anyone who wants to make their life easier using MySQL tools. Rather than doing a deep dive on one or two tools, I will describe the methodology we’ve used at Flite to successfully integrate many different MySQL tools into our day-to-day workflow. The talk will take place Wednesday, April 15 at 3:30 PM in Ballroom E.

I most recently gave this talk at the San Francisco MySQL Meetup in February. If you are interested in my talk but won’t be at Percona Live, you can find the video online here.

See you in Santa Clara next week!

Comments

Introducing Pt-reanimate

Most of the time I like pt-kill, but sometimes it gets a little overzealous and kills a query that I need. That’s why I’m glad Percona created pt-reanimate so I can bring those important queries back to life. Of course the queries are not exactly the same when they come back, but at least they come back.

Here’s an example of pt-reanimate in action. First I add a column to a large table:

1
alter table big_table add column new_col int;

Here is my DDL in the processlist table:

1
2
3
4
5
+---------+------+----------------+----------------------------------------------+
| command | time | state          | info                                         |
+---------+------+----------------+----------------------------------------------+
| Query   |  146 | altering table | alter table big_table add column new_col int |
+---------+------+----------------+----------------------------------------------+

Since I’m using MySQL 5.6 this is an online operation, and it does not block reads or writes on the table. Unfortunately pt-kill doesn’t know that, and due to the long runtime of my DDL pt-kill killed the query, and it disappeared from the processlist.

Luckily I am also running pt-reanimate, and it brought my query back to life! After pt-reanimate recovered my query, this is what it looked like in the processlist table:

1
2
3
4
5
+---------+------+--------+----------------------------------------------+
| command | time | state  | info                                         |
+---------+------+--------+----------------------------------------------+
| Zombie  |  187 | brains | alter table big_table add column new_col int |
+---------+------+--------+----------------------------------------------+

The good news is the query eventually completed successfully. The bad news is it tried to eat some of my other queries, but they were all fast-running queries so they escaped unharmed.

Comments

Avoiding MySQL ERROR 1052 by Prefixing Column Names in Multi Table Queries

MySQL is fairly relaxed about prefixing column names with a table name or alias in a query involving multiple tables. As long as the column name you use only exists in one of the tables in your query, MySQL does not require you to prefix it, regardless of SQL mode.

My policy has always been to prefix all column names with the table name/alias in a multiple table query. This is not simply a syle choice. If I reference a column name without prefixing it the query may work today, but it is not future-proof. If the schema changes in the future and I add a column with the same name to one of the other tables involved in that query, then the query will fail with ERROR 1052:

1
ERROR 1052 (23000): Column 'used_to_be_unique' in field list is ambiguous

Therefore I recommend always prefixing all column names in a multi table query to protect your queries against future additive schema changes.

Here’s some code to illustrate the problem:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
mysql> create table t1 (a int);
Query OK, 0 rows affected (0.03 sec)

mysql> create table t2 (a int, b int);
Query OK, 0 rows affected (0.02 sec)

mysql> select t1.a, b 
    -> from t1
    ->   inner join t2 on t2.a = t1.a;
Empty set (0.00 sec)

mysql> alter table t1 add column b int;
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> select t1.a, b 
    -> from t1
    ->   inner join t2 on t2.a = t1.a;
ERROR 1052 (23000): Column 'b' in field list is ambiguous
Comments

Updating Millions of Rows in MySQL With Common_schema

Last week I had to update several million rows in a single MySQL table. Rather that executing the update directly in the database, I decided to use common_schema’s split() function.

There are two reasons I chose to use common_schema for this task:

Chunking

Chunking breaks a single update statement into multiple statements, each updating at most a certain specified number of rows. The default chunk size is 1000, and I changed it to 2000 by setting the size parameter.

Throttling

Throttling has two benefits: it minimizes the amount of load on the master host, and it minimizes the replication lag introduced when the update statements are executed on the replicas. It achieves this by introducing sleep statements after each chunk is updated. The duration of the sleep statement is proportional to the runtime of the update statement. After experimenting with several different throttle values in a test environment, I decided on a value of 4, which means it will sleep for 4 times as long as each update takes. Thus if an update takes 100 milliseconds, it will sleep for 400 milliseconds before executing the next update. Using throttle does not guarantee that you will completely avoid replication lag, but if you do some testing you should be able to find an appropriate value that will strike a balance between the total run time of your updates and the level of replication lag that it introduces.

Read on ✈
Comments

Adding a Unique Constraint With More Than 16 Columns in MySQL

A while ago I blogged about a potential workaround for MySQL error 1070 that could be used to add a unique constraint with more than 16 columns.

As a reminder here’s the error you get when you try to create a unique constraint with more than 16 columns in MySQL:

1
ERROR 1070 (42000): Too many key parts specified; max 16 parts allowed

The solution I proposed should work, but when I started to implement it I made a couple of changes that merit this follow-up post. One change is fairly generic, and the other is specific to my use case.

The generic change was to only concatenate some of the columns, rather than all of them. Specifically I only need to concatenate N-15 columns. So if I want to have a unique constraint on 20 columns, I can include 15 of those actual columns and then concatenate the remaining 5 columns into a new column that is included as the 16th column in the unique constraint. When choosing which columns to concatenate, it makes sense to choose columns with short values, since that limits the amount of redundant data you are storing and helps prevent the table from taking up too much disk space.

Read on ✈
Comments

Loading Large Flat Files Into MySQL With LOAD DATA INFILE and Pt-fifo-split

It’s easy to load multiple rows of data from a file into a MySQL table with a single LOAD DATA INFILE command, but if the file is over a certain size you may want to load it in multiple chunks. When I have a file with millions of lines I typically use pt-fifo-split to separate the file into multiple chunks in order to limit the size of each transaction. Unlike most tools in Percona Toolkit, pt-fifo-split doesn’t actually interact with your database, but this is one MySQL use case where it’s very helpful.

Here’s a quick example in bash that illustrates how I can take a large tab-delimited file, break it into multiple 100,000 line chunks, and load the chunks into a table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
FLAT_FILE="/tmp/big_file.txt"
FIFO_PATH="${FLAT_FILE}.fifo"
LOAD_FILE="${FLAT_FILE}.load"
CHUNK_SIZE=100000

# Split the file
pt-fifo-split --force --lines ${CHUNK_SIZE} ${FLAT_FILE} --fifo ${FIFO_PATH} &
# Sleep 10 seconds to assure ${FIFO_PATH} exists before entering loop
sleep 10
while [ -e ${FIFO_PATH} ]
do
  # Write chunk to disk
  cat ${FIFO_PATH} > ${LOAD_FILE}
  # Load chunk into table
  mysql --database=test \
   --show-warnings \
   -vve "load data infile '${LOAD_FILE}' into table my_table;"
done

YMMV, so you should do some testing to determine the optimal chunk size based on your schema, storage engine, replication setup, complexity of LOAD DATA INFILE statement, etc.

Comments