Flite Careers

Using Explode and Lateral View in Hive

At Flite we store a lot of JSON formatted log data in Amazon S3, and we use Amazon Elastic MapReduce to run Hive queries on that log data using Amazon’s standard JSON SerDe. Sometimes we store embedded JSON objects, which can be extracted pretty easily using the get_json_object function.

We also store some embedded JSON objects, including arrays. Recently I wanted to do a rollup query in Hive using the exploded results of an embedded array using the explode UDTF along with the LATERAL VIEW syntax. Converting the JSON array (which is stored as a string) to a Hive array was a bit messy, but I got it to work. Below I will show the query and break down the steps to create the query.

The query

Here’s the final query to count impressions by placement, id, and version for a given partition:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
select d.placement_id, d.id, d.version, count(1) as impressions
from
(
  select c.placement_id,
    get_json_object(c.info_object,'$.id') as id,
    get_json_object(c.info_object,'$.v') as version
  from
  (
    select b.placement_id, info_object
    from
    (
      select placement_id,
        split(regexp_replace(regexp_replace(a.info_array_string,'\\}\\,\\{','\\}\\;\\{'),'\\[|\\]',''),'\\;') as info_array
      from
      (
        select placement_id,
          get_json_object(embedded_data, '$.info') as info_array_string
        from ad_impression
        where dt = '2014-04-15-12'
      ) a
    ) b lateral view explode(b.info_array) info_array_exploded as info_object
  ) c
) d
group by d.placement_id, d.id, d.version;

Read on for a detailed explanation of how the query is built.

Step 1: Extracting the info array as a string

The embedded_data column stores strings like this:

1
{"key1":"0","key2":"11.800","info":"[{\"id\":\"1234\",\"v\":\"3\"},{\"id\":\"5678\",\"v\":\"3\"}]","key3":"3"}

In this particular query I’m only interested in the “info” array from embedded_data, so I extract it like this:

1
get_json_object(embedded_data, '$.info')

Step 2: Reformatting the array string

This is likely not the only (or best) way to do this, but it works for my specific case.

The info_array_string value looks like this:

1
[{"id":"6256753","v":"3"},{"id":"6256751","v":"3"},{"id":"6256749","v":"3"}]

I want to use the Hive split function to convert that to an array, but before I do that I need to remove the surrounding square brackets, and replace the delimiter between each array element with a character other than a comma, since the comma is also used as a delimiter inside each array element.

To accomplish that I use the Hive regexp_replace function to remove the square brackets and replace all occurences of },{ with };{. Semi-colon is a safe delimiter in my case since I control the data format and know that semi-colons will never occur inside the array elements. As far as I know Hive doesn’t have a plain replace function, thus the need to use regexp_replace.

Here’s the code to do the string replacements:

1
regexp_replace(regexp_replace(a.info_array_string,'\\}\\,\\{','\\}\\;\\{'),'\\[|\\]','')

Step 3: Converting the array string to a Hive array

Now that the array string is formatted properly, I can use the Hive split function to convert it to a Hive array by splitting on semi-colon. Again, this only works if I know there are no other semi-colons in the formatted array string:

1
split(formatted_array_string,'\\;')

Step 4: Exploding the Hive array and joining to the exploded array

Now that I have a Hive array I can finally use the explode function. The point of explode is to expand an array in a single row across multiple rows, one for each value in the array.

So this:

1
[1,2,3]

Becomes this:

1
2
3
1
2
3

In my case I just explode my Hive array and then join to it with LATERAL VIEW in order to correlate each placement_id with the various objects from the info array:

1
lateral view explode(b.info_array) info_array_exploded as info_object

Step 5: Extracting the id and version values

Since the info object is itself an embedded object with multiple values, I still need to extract the version and id values in the outer query. I do that with get_json_object:

1
2
get_json_object(c.info_object,'$.id') as id,
get_json_object(c.info_object,'$.v') as version

Putting it all together, the resulting Hive query gives me the data I want.

Comments

Comments