Flite Careers

Groupwise Most Frequent Value Query in Hive

I am used to writing SQL queries and making small changes to accomodate Hive’s limitations, but sometimes I encounter a query that takes a little more work to implement in Hive. Recently I needed to write a query to calculate the most frequently occurring value by key. MySQL and Hive both lack a MODE() function, but there are several ways to accomplish this in MySQL. I usually execute two identical sub-queries to group by key and value, and use an outer join to exclude all of the key/value pairs that are not the more frequently occurring.

Read on for details on the MySQL query and how I needed to modify it to make it work in Hive.

Imagine I have multiple records with names of YouTube videos and their duration in seconds, and I need to calculate a definitive duration for each video by using the most frequently occuring value. In case of a tie I’ll use the longer duration.

Here’s a sample data set:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
mysql> select video_name,duration from video_time;
+-----------------------+----------+
| video_name            | duration |
+-----------------------+----------+
| How to Slice a Mango  |      194 |
| Diet Coke + Mentos    |      179 |
| Diet Coke + Mentos    |      179 |
| Charlie bit my finger |       54 |
| Charlie bit my finger |       55 |
| Charlie bit my finger |       55 |
| Gangnam Style         |      252 |
| Gangnam Style         |      252 |
| Gangnam Style         |      253 |
| Gangnam Style         |      253 |
+-----------------------+----------+
10 rows in set (0.00 sec)

As an aside, there is a simpler use case where I just want to get the most frequently occurring key across the entire data set, ignoring the values. That is easy in MySQL:

1
2
3
4
5
6
7
8
9
10
11
mysql> select video_name, count(*)
    -> from video_time
    -> group by video_name
    -> order by count(*) desc
    -> limit 1;
+---------------+----------+
| video_name    | count(*) |
+---------------+----------+
| Gangnam Style |        4 |
+---------------+----------+
1 row in set (0.00 sec)

That query won’t work in Hive because I can’t order by an aggregate function:

1
2
3
4
5
6
hive> select video_name, count(*)
    > from video_time
    > group by video_name
    > order by count(*) desc
    > limit 1;
FAILED: SemanticException [Error 10128]: Line 4:9 Not yet supported place for UDAF 'count'

In order to make that query work in Hive I simply alias the count:

1
2
3
4
5
6
7
hive> select video_name, count(*) as frequency
    > from video_time
    > group by video_name
    > order by frequency desc
    > limit 1;
OK
Gangnam Style 4

But that’s not what I’m actually trying to do. I want to group by video name and calculate the most frequently occurring duration for each video. This is the output I want:

1
2
3
4
5
6
7
8
+-----------------------+------------------------+
| video_name            | most_frequent_duration |
+-----------------------+------------------------+
| Charlie bit my finger |                     55 |
| Diet Coke + Mentos    |                    179 |
| Gangnam Style         |                    253 |
| How to Slice a Mango  |                    194 |
+-----------------------+------------------------+

In MySQL I can use the implementation I mentioned earlier with a query like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
mysql> select t1.video_name,t1.duration as most_frequent_duration
    -> from (select video_name, duration, count(*) as frequency
    ->       from video_time
    ->       group by video_name, duration) t1
    -> left outer join 
    ->      (select video_name, duration, count(*) as frequency
    ->       from video_time
    ->       group by video_name, duration) t2
    ->   on (t1.video_name = t2.video_name  
    ->     and (t1.frequency < t2.frequency 
    ->     or (t1.frequency = t2.frequency and t1.duration < t2.duration)))
    -> where t2.video_name is null
    -> order by t1.video_name;
+-----------------------+------------------------+
| video_name            | most_frequent_duration |
+-----------------------+------------------------+
| Charlie bit my finger |                     55 |
| Diet Coke + Mentos    |                    179 |
| Gangnam Style         |                    253 |
| How to Slice a Mango  |                    194 |
+-----------------------+------------------------+
4 rows in set (0.01 sec)

But when I run that query in Hive I get an error:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
hive> select t1.video_name,t1.duration as most_frequent_duration
    > from (select video_name, duration, count(*) as frequency
    >       from video_time
    >       group by video_name, duration) t1
    > left outer join 
    >      (select video_name, duration, count(*) as frequency
    >       from video_time
    >       group by video_name, duration) t2
    >   on (t1.video_name = t2.video_name
    >     and (t1.frequency < t2.frequency 
    >     or (t1.frequency = t2.frequency and t1.duration < t2.duration)))
    > where t2.video_name is null
    > order by t1.video_name;
FAILED: SemanticException [Error 10019]: Line 9:41 OR not supported in JOIN currently 'duration'

I could try to rewrite the query without an OR in the join by writing two separate queries and using UNION to combine them, but it still won’t work because Hive only supports equi-join, so I can’t include t1.frequency < t2.frequency in the join clause.

To get this to work in Hive I need to rewrite it with an equi-join, so the approach I took was to add another layer to one of the sub-queries to calculate the most frequent duration for each video name, then join the sub-query that has video name, duration, and frequency to that query.

Here’s what the final Hive query looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
hive> select t1.video_name,max(t1.duration) as most_frequent_duration
    > from 
    > (select video_name, duration, count(*) as frequency
    >       from video_time
    >       group by video_name, duration) t1
    > inner join 
    > (select t2.video_name,max(t2.frequency) as max_frequency
    > from (select video_name, duration, count(*) as frequency
    >       from video_time
    >       group by video_name, duration) t2
    > group by t2.video_name) t3 
    >   on t3.video_name = t1.video_name and t1.frequency = t3.max_frequency
    > group by t1.video_name
    > order by video_name;
OK
Charlie bit my finger 55
Diet Coke + Mentos    179
Gangnam Style 253
How to Slice a Mango  194
Comments

Comments