Flite Culture

Closing the Unique Index Null Loophole

As documented in the MySQL Reference Manual:

a UNIQUE index permits multiple NULL values for columns that can contain NULL.

There are many scenarios in which this behavior is desirable. For example imagine you create a user table and require an email address, but you want username to be optional. You make the username column nullable, but you also want to make sure all non-null values in the column are unique, so it makes sense to add a UNIQUE INDEX. If the UNIQUE INDEX didn’t allow multiple NULL values then you would need to move the username column to a separate table in order to both maintain uniqueness and allow users without usernames.

But there are also plenty of scenarios where this behavior can cause problems. For example, consider aggregated fact tables in a star schema. You typically only want one row for a given combination of dimension values, and you can enforce this using a UNIQUE INDEX. However, if one or more of the dimension columns are nullable then you are at risk of having duplicate data.

Recently this affected an ETL process at Flite that loads data into a MySQL star schema. Sometimes I need to re-run all or part of the ETL process based on failures, and I rely on unique indexes to prevent the same data from being loaded into the star schema more than once. Most of my dimension columns do not allow null values, but a few of them do, which led to some data duplication when I re-ran pieces of the ETL flow. To fix the problem I identified all of the affected columns, cleaned up the existing data, made the columns NOT NULL, and came up with other values to use for missing data: For text columns I replaced NULL with the empty string ''. For numeric values I replaced NULL with 0.

Read on for details on how to find and fix this problem. For my examples I will use the Foodmart sample database from Pentaho.

Introducing the problem

I’ll use the sales_fact_1997 table for my example:

1
2
3
4
5
6
7
8
9
10
11
12
13
mysql> desc sales_fact_1997;
+--------------+---------------+------+-----+---------+-------+
| Field        | Type          | Null | Key | Default | Extra |
+--------------+---------------+------+-----+---------+-------+
| product_id   | int(11)       | NO   | MUL | NULL    |       |
| time_id      | int(11)       | NO   | MUL | NULL    |       |
| customer_id  | int(11)       | NO   | MUL | NULL    |       |
| promotion_id | int(11)       | NO   | MUL | NULL    |       |
| store_id     | int(11)       | NO   | MUL | NULL    |       |
| store_sales  | decimal(10,4) | NO   |     | NULL    |       |
| store_cost   | decimal(10,4) | NO   |     | NULL    |       |
| unit_sales   | decimal(10,4) | NO   |     | NULL    |       |
+--------------+---------------+------+-----+---------+-------+

It doesn’t have a UNIQUE INDEX, so I add one:

1
2
3
4
mysql> alter table sales_fact_1997 
    ->   add unique index all_dimension_index (product_id,time_id,customer_id,promotion_id,store_id);
Query OK, 86829 rows affected (0.69 sec)
Records: 86829  Duplicates: 0  Warnings: 0

If I dump the data and try to reload it all rows are correctly rejected as duplicates.

1
2
3
4
5
6
7
8
9
10
mysql> select *
    -> into outfile '/tmp/sales_fact_1997.no_nulls.txt'
    -> from sales_fact_1997;
Query OK, 86829 rows affected (0.08 sec)

mysql> 
mysql> load data infile '/tmp/sales_fact_1997.no_nulls.txt'
    -> ignore into table sales_fact_1997;
Query OK, 0 rows affected (0.46 sec)
Records: 86829  Deleted: 0  Skipped: 86829  Warnings: 0

If I make the promotion_id column nullable and replace 0 with NULL, then dump the data and try to reload it again, all of the rows with a promotion_id value of NULL are re-inserted, giving me duplicate data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
mysql> alter table sales_fact_1997
    ->   modify column promotion_id int null;
Query OK, 86829 rows affected (0.70 sec)
Records: 86829  Duplicates: 0  Warnings: 0

mysql> update sales_fact_1997
    -> set promotion_id = NULL
    -> where promotion_id = 0;
Query OK, 63652 rows affected (1.14 sec)
Rows matched: 63652  Changed: 63652  Warnings: 0

mysql> select *
    -> into outfile '/tmp/sales_fact_1997.with_nulls.txt'
    -> from sales_fact_1997;
Query OK, 86829 rows affected (0.06 sec)

mysql> load data infile '/tmp/sales_fact_1997.with_nulls.txt'
    -> ignore into table sales_fact_1997;
Query OK, 63652 rows affected (1.51 sec)
Records: 86829  Deleted: 0  Skipped: 23177  Warnings: 0

Finding vulnerable columns

The simplest way to find fact tables in the foodmart schema that have a nullable column in a UNIQUE INDEX is to run a query against common_schema:

1
2
3
4
5
6
7
8
9
10
mysql> select * 
    -> from common_schema._unique_keys 
    -> where has_nullable = true 
    -> and table_schema = 'foodmart' 
    -> and table_name like '%fact%';
+--------------+-----------------+---------------------+-----------------------+--------------+------------+------------------------------------------------------+-------------------+
| TABLE_SCHEMA | TABLE_NAME      | INDEX_NAME          | COUNT_COLUMN_IN_INDEX | has_nullable | is_primary | COLUMN_NAMES                                         | FIRST_COLUMN_NAME |
+--------------+-----------------+---------------------+-----------------------+--------------+------------+------------------------------------------------------+-------------------+
| foodmart     | sales_fact_1997 | all_dimension_index |                     5 |            1 |          0 | product_id,time_id,customer_id,promotion_id,store_id | product_id        |
+--------------+-----------------+---------------------+-----------------------+--------------+------------+------------------------------------------------------+-------------------+

That’s a good way to find out how many tables are affected, but to figure out which columns are affected I wrote my own query on information_schema:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
mysql> select c.table_schema,c.table_name,c.column_name  
    -> from information_Schema.table_constraints tc
    ->   inner join information_Schema.key_column_usage kcu on kcu.TABLE_SCHEMA = tc.TABLE_SCHEMA and kcu.TABLE_NAME = tc.TABLE_NAME and kcu.CONSTRAINT_NAME = tc.CONSTRAINT_NAME
    ->   inner join information_schema.columns c on c.table_schema = kcu.table_schema and c.table_name = kcu.table_name and c.column_name = kcu.column_name
    -> where tc.table_schema = 'foodmart'
    ->   and tc.constraint_type = 'UNIQUE'
    ->   and c.table_name like '%fact%'
    ->   and c.is_nullable = 'YES'
    -> order by c.table_schema,c.table_name,c.column_name;
+--------------+-----------------+--------------+
| table_schema | table_name      | column_name  |
+--------------+-----------------+--------------+
| foodmart     | sales_fact_1997 | promotion_id |
+--------------+-----------------+--------------+

Now that I found the affected tables and columns, it’s time to fix them.

Fixing the problem

I can use a similar query against information_schema to generate the DDL to make those columns NOT NULL:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
mysql> select concat('ALTER TABLE ',c.table_schema,'.',c.table_name,' MODIFY COLUMN ',  
    ->   c.column_name,
    ->   ' ',
    ->   c.column_type,
    ->   ' ',
    ->   case when c.CHARACTER_SET_NAME is not null and c.CHARACTER_SET_NAME != '' then concat(' CHARACTER SET ',c.CHARACTER_SET_NAME) else '' end,
    ->   case when c.COLLATION_NAME is not null and c.COLLATION_NAME != '' then concat(' COLLATE ',c.COLLATION_NAME) else '' end,
    ->   ' NOT NULL',
    ->   ';') as sql_stmt 
    -> from information_Schema.table_constraints tc
    ->   inner join information_Schema.key_column_usage kcu on kcu.TABLE_SCHEMA = tc.TABLE_SCHEMA and kcu.TABLE_NAME = tc.TABLE_NAME and kcu.CONSTRAINT_NAME = tc.CONSTRAINT_NAME
    ->   inner join information_schema.columns c on c.table_schema = kcu.table_schema and c.table_name = kcu.table_name and c.column_name = kcu.column_name
    -> where tc.table_schema = 'foodmart'
    ->   and tc.constraint_type = 'UNIQUE'
    ->   and c.table_name like '%fact%'
    ->   and c.is_nullable = 'YES';
+------------------------------------------------------------------------------------+
| sql_stmt                                                                           |
+------------------------------------------------------------------------------------+
| ALTER TABLE foodmart.sales_fact_1997 MODIFY COLUMN promotion_id int(11)  NOT NULL; |
+------------------------------------------------------------------------------------+

I can’t run the DDL yet because of the duplicate data I introduced earlier, so I need to fix the data first. There are many ways to remove duplicate data, and this method will not work for all cases, but in this case the row count is small enough and I know the duplicate rows are true duplicates so my preferred method is just to dump the data to disk, truncate the table, run the DDL, then reload the data and let the index enforce uniqueness:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
mysql> select *
    -> into outfile '/tmp/sales_fact_1997.backup.txt'
    -> from sales_fact_1997;
Query OK, 150481 rows affected (0.12 sec)

mysql> truncate table sales_fact_1997;
Query OK, 0 rows affected (0.01 sec)

mysql> ALTER TABLE foodmart.sales_fact_1997 MODIFY COLUMN promotion_id int(11)  NOT NULL;
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> load data infile '/tmp/sales_fact_1997.backup.txt'
    -> ignore into table sales_fact_1997;
Query OK, 86829 rows affected, 63652 warnings (0.75 sec)
Records: 150481  Deleted: 0  Skipped: 63652  Warnings: 63652

Now I can re-run the common_schema query one more time to confirm that the vulnerability is gone:

1
2
3
4
5
mysql> select * 
    -> from common_schema._unique_keys 
    -> where has_nullable = true 
    -> and table_schema = 'foodmart' 
    -> and table_name like '%fact%';
Comments

Comments