Hive implements MapReduce using HiveQL. The built-in capabilities of HiveQL abstracts the implementation of mappers and reducers with a simple yet powerful SQL like query language. To demonstrate the inbuilt capabilities of HiveQL, I will be analysing hashtags from a twitter feed on Hortonworks Data Platform (HDP). I wrote a similar post using Azure HDInsight, but this time it’s on HDP.
For this post, I am using Hortonworks Sandbox on Hyper-V. The data I am using is from Twitter archive which can be downloaded by following these instructions. The twitter archive contains all your tweets from day one and information about each tweet which will help to determine the structure of Hive table.
From the Hyper-V Manager, connect to the Hortonworks Sandbox.
Login to sandbox and click Beeswax which is the Hive UI (Screen Capture 1)
Screen Capture 1 – Beeswax
From the query editor execute the following DDL statement to create a Hive database
CREATE DATABASE IF NOT EXISTS twitter LOCATION '/hive/warehouse';
Switch the database from default to twitter (Screen Capture 2)
Screen Capture 2 – Change Database
Execute the following DDL statement to create a Hive table.
CREATE TABLE IF NOT EXISTS twitter.tweets( tweet_id BIGINT, in_reply_to_status_id BIGINT, in_reply_to_user_id INT, status_timestamp TIMESTAMP, source STRING, text STRING, retweeted_status_id BIGINT, retweeted_status_user_id INT, retweeted_status_timestamp TIMESTAMP, expanded_urls STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
Convert the tweets.csv to a tab separated text file and remove the header row. Copy the text file to HDFS using the File Browser (Screen Capture 3)
Screen Capture 3 – File Browser
Once the data file is uploaded to HDFS, load the data into Hive table using this statement
LOAD DATA INPATH '/user/hue/tweets_formatted.txt' INTO TABLE twitter.tweets;
Now execute the following Hive DML Statement from query editor.
SELECT text, word FROM twitter.tweets LATERAL VIEW explode(split(text,' ')) text_ex as word;
The explode() is a Hive built-in User Defined Table-Generating Function (UDTF) that breaks down a array into its elements. In this case the tweet gets broken into words. The LATERAL VIEW joins the output of explode() to the input row (tweet) creating a result set that contains n rows (words) for each tweet as shown below
Tweet | Output |
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS | SSIS: |
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS | Creating |
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS | Dynamic |
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS | Data-driven |
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS | SSAS |
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS | Partitions |
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS | http://t.co/dmC9RdnZ1H |
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS | #SSIS |
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS | #SSAS |
This result set can be easily filtered to separate hashtags from all other words to give a simple statistics about the commonly used hashtags.
SELECT word, count(1) as wcount FROM twitter.tweets LATERAL VIEW explode(split(text,' ')) text_ex as word WHERE word LIKE '#%' GROUP BY word ORDER BY wcount DESC LIMIT 5
The output would be something like this
#SSAS | 85 |
#MDX | 72 |
#SSIS | 61 |
#Hive | 15 |
#HDInsight | 8 |
If you wish, you can save the query to a HDFS folder for later use.
This simple example illustrates the capabilities of in-built Hive statements for common analytical tasks which could be accomplished without the need to write custom mappers and reducers.