Flite Careers

Handling Invalid Numeric Values in Hive

Recently I found some NULL values in a Hive table in a numeric column that I expected to be NOT NULL. The query that populates the table extracts a numeric value from a json-formatted string using the get_json_object() UDF like this:

1
coalesce(get_json_object(sub_object, '$.key'),0)

The intent of the COALESCE() call is to replace all NULL values of sub_object.key with 0, but when I looked at the actual values I saw both 0 and NULL present in the table. Why are some of the NULL values not being replaced with 0? Because there are some non-numeric values in sub_object.key, and those ended up being inserted as NULL due to implicit type conversion. get_json_object() returns a string, which is then being implicitly converted to an int, so a value like 123abc would not be overidden by COALESCE(). Instead the implicit type conversion tries to cast 123abc as an int and it ends up as NULL.

The solution? Use explicit type conversion instead of relying on implicit type conversion.

Specifically I updated my query to cast the string as an int, like this:

1
coalesce(cast(get_json_object(sub_object, '$.key') as int),0)

With that change in place my table no longer has any NULL values in that column, and all missing values are being treated equally and stored as 0.

Comments

Comments