Flite Culture

Handling Invalid Numeric Values in Hive

Recently I found some NULL values in a Hive table in a numeric column that I expected to be NOT NULL. The query that populates the table extracts a numeric value from a json-formatted string using the get_json_object() UDF like this:

1
coalesce(get_json_object(sub_object, '$.key'),0)

The intent of the COALESCE() call is to replace all NULL values of sub_object.key with 0, but when I looked at the actual values I saw both 0 and NULL present in the table. Why are some of the NULL values not being replaced with 0? Because there are some non-numeric values in sub_object.key, and those ended up being inserted as NULL due to implicit type conversion. get_json_object() returns a string, which is then being implicitly converted to an int, so a value like 123abc would not be overidden by COALESCE(). Instead the implicit type conversion tries to cast 123abc as an int and it ends up as NULL.

The solution? Use explicit type conversion instead of relying on implicit type conversion.

Specifically I updated my query to cast the string as an int, like this:

1
coalesce(cast(get_json_object(sub_object, '$.key') as int),0)

With that change in place my table no longer has any NULL values in that column, and all missing values are being treated equally and stored as 0.

Comments

Alternatives for Chunking Bulk Deletes in Common_schema

I’ve blogged about common_schema multiple times in the past, and it’s a tool I use frequently. Last week I had a project to delete millions of rows from multiple rollup tables in a star schema. Since the tables are not partitioned I needed to use DELETE instead of DROP PARTITION, but I didn’t want to delete millions of rows in a single transaction. My first instinct was to use common_schema’s split() function to break the deletes into chunks. So I ran a query on INFORMATION_SCHEMA to generate a bunch of statements like this:

1
2
3
4
call common_schema.run("split(delete from rollup_table1 where the_date > '2013-03-30') pass;");
call common_schema.run("split(delete from rollup_table2 where the_date > '2013-03-30') pass;");
call common_schema.run("split(delete from rollup_table3 where the_date > '2013-03-30') pass;");
...

That’s the simplest way to do deletes with split(), and the tool will automatically determine which index and what chunk size to use. If I were running this on an active database (or a master) I would probably use throttle to control the speed of the deletes, but in this case it was running on passive replicas so I just used pass to run the deletes with no sleep time in between them. I sorted the deletes by table size, from smallest to largest, and had a total of 33 tables to process.

Read on ✈
Comments

Feature Flags at Flite

Feature flags (or flippers or toggles or whatever you like to call them) are a really important piece to any development process, especially if you are employing Continuous Integration (CI) . Even if you’re not quite there with CI, feature flags can still be a very useful tool that can help agile development teams be more productive, produce higher quality products, and work at the speed of business.

Read on ✈
CI
Comments

Avoiding MySQL ERROR 1069 by Explicitly Naming Indexes

Since I recently wrote about both MySQL error 1071 and error 1070 I decided to continue the pattern with a quick note on MySQL error 1069. In case you’ve never seen it before, this is MySQL error 1069:

1
ERROR 1069 (42000): Too many keys specified; max 64 keys allowed

I can’t think of a valid use case that requires more than 64 indexes on a MySQL table, but it’s possible to get this error by inadvertantly adding lots of duplicate indexes to a table. This can happen if you don’t explicitly name your indexes.

Read on for examples…

Read on ✈
Comments

A Workaround for MySQL ERROR 1070

As documented in the Reference Manual MySQL supports a maximum of 16 columns per index. That’s more than sufficient for most index use cases, but what about unique constraints? If I create a fact table with more than 16 dimension columns in my star schema, and then try to add an index to enforce a unique constraint across all of the dimension columns, then I’ll get this error:

1
ERROR 1070 (42000): Too many key parts specified; max 16 parts allowed

For multi-column unique indexes, internally MySQL concatenates all of the column values together in a single hyphen-delimited string for comparison. Thus I can simulate a multi-column unique index by adding an extra column that stores the concatenated column values, and adding a unique index on that column.

Read on for details…

Read on ✈
Comments

Simulating Add Column if Not Exists in MySQL With Common_schema

Some MySQL DDL commands such as CREATE TABLE and DROP TABLE support an IF [NOT] EXISTS option which allows you to downgrade the error to a warning if you try to create something that already exists or drop something that doesn’t exist.

For example this gives an error:

1
2
mysql> drop table sakila.fake_table;
ERROR 1051 (42S02): Unknown table 'sakila.fake_table'

And this gives a warning:

1
2
3
4
mysql> drop table if exists sakila.fake_table;
Query OK, 0 rows affected, 1 warning (0.00 sec)

Note (Code 1051): Unknown table 'sakila.fake_table'

You may also want to use IF [NOT] EXISTS for column-level changes such as ADD COLUMN and DROP COLUMN, but MySQL does not support that.

Read on for some examples of how to simulate IF [NOT] EXISTS using the QueryScript language from common_schema.

Read on ✈
Comments

Using Innodb_large_prefix to Avoid ERROR 1071

If you’ve ever tried to add an index that includes a long varchar column to an InnoDB table in MySQL, you may have seen this error:

1
ERROR 1071 (42000): Specified key was too long; max key length is 767 bytes

The character limit depends on the character set you use. For example if you use latin1 then the largest column you can index is varchar(767), but if you use utf8 then the limit is varchar(255). There is also a separate 3072 byte limit per index. The 767 byte limit is per column, so you can include multiple columns (each 767 bytes or smaller) up to 3072 total bytes per index, but no column longer than 767 bytes. (MyISAM is a little different. It has a 1000 byte index length limit, but no separate column length limit within that).

One workaround for these limits is to only index a prefix of the longer columns, but what if you want to index more than 767 bytes of a column in InnoDB?

In that case you should consider using innodb_large_prefix, which was introduced in MySQL 5.5.14 and allows you to include columns up to 3072 bytes long in InnoDB indexes. It does not affect the index limit, which is still 3072 bytes as quoted in the manual:

The InnoDB internal maximum key length is 3500 bytes, but MySQL itself restricts this to 3072 bytes. This limit applies to the length of the combined index key in a multi-column index.

Read on for details and examples about innodb_large_prefix.

Read on ✈
Comments

Replacing Pt-slave-delay With MASTER_DELAY in MySQL 5.6

In the past I have used pt-slave-delay when I want to maintain an intentionally delayed replica. Now that I have upgraded to MySQL 5.6 I am switching over to use MASTER_DELAY, which is a built-in feature that does the same thing.

For example I can replace this:

1
pt-slave-delay --delay 7d --interval 1m --daemonize

With this:

1
2
3
STOP SLAVE;
CHANGE MASTER TO MASTER_DELAY = 604800;
START SLAVE;

The implementation is similar: the IO thread copies the events to the relay log as fast as normal, but the SQL thread only executes events older than the defined lag. The process to fast-forward the replica should be similar as well.

So far I see a couple of advantages of using MASTER_DELAY:

  • Running stop slave and start slave manually can’t cause the replica to catch up beyond the defined lag, which could happen with pt-slave-delay
  • No need to monitor the daemon (I used monit for this when running pt-slave-delay)

Since MASTER_DELAY is part of CHANGE MASTER it is persisted with the other replication configuration data, thus it doesn’t need to (and can’t) be defined in my.cnf, and it survives a reboot.

Comments

Fine-Tuning MySQL Full-Text Search With InnoDB

If you are using FULLTEXT indexes in MySQL and plan to switch from MyISAM to InnoDB then you should review the reference manual section on Fine-Tuning MySQL Full-Text Search to see what configuration changes may be required. As I mentioned in yesterday’s post when comparing query results on my database with FULLTEXT indexes in MyISAM versus InnoDB I got different results. Specifically, the InnoDB tables were returning fewer results for certain queries with short FULLTEXT search terms. Here’s an example of a query that returned fewer results on InnoDB:

1
2
3
select id
from flite.ad_index
where match(name,description,keywords) against('+v1*' IN BOOLEAN MODE);

The issue was that all of the fine tuning I had done before was limited to MyISAM, so it didn’t affect InnoDB. In the past I configured MySQL FULLTEXT search to index words as short as 1 character (the default is 3), and to index common words (not to use any stopword list). These are the relevant variables I set in in my.cnf:

1
2
ft_min_word_len = 1
ft_stopword_file = ''

InnoDB has its own variables to control stopwords and minimum word length, so I needed to set these variables when I changed the tables from MyISAM to InnoDB:

1
2
innodb_ft_min_token_size = 1
innodb_ft_enable_stopword = OFF

Since those variables are not dynamic, I had to restart MySQL for them to take effect. Furthermore, I needed to rebuild the FULLTEXT indexes on the relevant tables. This is how the manual instructs you to rebuld the indexes:

To rebuild the FULLTEXT indexes for an InnoDB table, use ALTER TABLE with the DROP INDEX and ADD INDEX options to drop and re-create each index.

Rather than drop and recreate the indexes, I just used ALTER TABLE ... FORCE to rebuild the table (and indexes), like this:

1
alter table flite.ad_index force;

After making those changes I re-ran pt-upgrade, and now I am getting the same set of rows back from MyISAM and InnoDB. The order of the rows is slightly different in some cases, but as I mentioned yesterday that is expected behavior.

Comments

Testing MySQL FULLTEXT Indexes in InnoDB Using Pt-upgrade

As I prepare to convert some MySQL tables with FULLTEXT indexes from MyISAM to InnoDB I want to verify that running a standard production query set against the tables will return the same results with InnoDB that it did with MyISAM. Since I read Matt Lord’s blog post about the document relevancy rankings used for InnoDB full-text searches I knew to expect some differences when sorting by relevancy, so I want to focus on getting the same set of rows back, mostly ignoring the order in which the rows are returned.

Percona toolkit has a tool called pt-upgrade that works well for this purpose. I used 2 test servers with a copy of my production database. On one of the servers I left the tables in MyISAM, and on the other I converted the tables to InnoDB. I copied a slow query log from a production host running with long_query_time=0 to get the query set for testing. Since I was only interested in queries on a few tables, rather than running the entire slow query log against the servers I just extracted the specific queries I was interested in and ran them as a raw log.

Here’s the command I used:

1
2
3
4
5
pt-upgrade --read-only \
  --database flite \
  --type rawlog /tmp/proddb-slow.log.raw \
  h=testdb33.flite.com \
  h=testdb47.flite.com

I used the --read-only flag so pt-upgrade would only execute SELECT statements, and not any statements that modify data.

Since I extracted the SQL queries from the slow query log instead of using the full slow query log, I used --type rawlog instead of the default of --type slowlog.

For the two hosts I compared, testdb33 is using FULLTEXT on InnoDB, and testdb47 is using FULLTEXT on MyISAM.

When I ran pt-upgrade it exposed several significant discrepancies. I will document those discrepancies and how I fixed them in a future post.

Comments