Flite Careers

Analyzing Apache Access Logs in Hive Without RegexSerDe

Parsing Apache logs is a well documented Hive use case, but when I tried to implement it recently using RegexSerDe I could not get it to work. I copied the gzipped Apache logs to S3 and used Amazon Elastic MapReduce to create the Hive table as documented, but when I tried to select from the table all of the values came out as NULL. I tried some different versions of the Hive DDL that I found online, but none of them worked.

Rather than continuing to debug that implementation I decided to take a different approach. Rather than using RegexSerDe, I just created a table with one column representing the entire log line and used the Hive regexp_extract function to select various virtual columns.

For example, here’s the table in Hive:

1
2
3
4
5
6
CREATE EXTERNAL TABLE IF NOT EXISTS apache_log (
  log_line STRING
)
PARTITIONED BY (dt STRING)
STORED AS TEXTFILE
LOCATION '${BUCKET}/apache';

And here’s a query to extract the HTTP response code from each log line so we can count how many times each response code occurs in a given day:

1
2
3
4
5
select regexp_extract(log_line,' ([0-9]{3}) (?:.*[0-9]{1,6})',1) as http_response_code, 
  count(1)
from apache_log 
where dt='2013-08-12'
group by regexp_extract(log_line,' ([0-9]{3}) (?:.*[0-9]{1,6})',1);

The rest of my project mostly involved parsing the request. I was going to have to use regexp_extract to do that anyways, so using that function to do all of the parsing was pretty easy.

Comments

Comments