Hive implements MapReduce using HiveQL. The built-in capabilities of HiveQL abstracts the implementation of mappers and reducers with a simple yet powerful SQL like query language. To demonstrate the inbuilt capabilities of HiveQL, I will be analysing hashtags from a twitter feed on Azure HDInsight platform.
Login to Azure with your subscription.
Locate your HDInsight cluster and click Manage Cluster (Screen Capture 1) to bring up the Hive page.

Screen Capture 1 –HDInsight Cluster
I am using data from Twitter archive which can be downloaded by following these instructions. This archive contains all your tweets from day one and information about each tweet which will help to determine the structure of Hive table.
From the Hive page, execute the following DDL statement to create a Hive database and table (Screen Capture 2)
CREATE DATABASE IF NOT EXISTS twitter LOCATION '/hive/warehouse'; CREATE TABLE IF NOT EXISTS twitter.tweets( tweet_id BIGINT, in_reply_to_status_id BIGINT, in_reply_to_user_id INT, status_timestamp TIMESTAMP, source STRING, text STRING, retweeted_status_id BIGINT, retweeted_status_user_id INT, retweeted_status_timestamp TIMESTAMP, expanded_urls STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

Screen Capture 2 – Hive DDL
Convert the tweets.csv to a tab separated text file and remove the header row. Copy the text file to Azure Storage using Azure Storage Explorer. Check the user guide of Azure Storage Explorer if you need help to upload the file.
Once the data file is uploaded to Azure Storage, load the data into Hive table using this statement (Screen Capture 3)

Screen Capture 3 – Load Data to Hive
LOAD DATA INPATH '/tweets_formatted.txt' INTO TABLE twitter.tweets;
Now execute the following Hive DML Statement.
SELECT text, word FROM twitter.tweets LATERAL VIEW explode(split(text,' ')) text_ex as word;
The explode() is a Hive built-in User Defined Table-Generating Function (UDTF) that breaks down a array into its elements. In this case the tweet gets broken into words. The LATERAL VIEW joins the output of explode() to the input row (tweet) creating a result set that contains n rows (words) for each tweet as shown below
Tweet | Output |
Working with #Hive in #HDInsight http://t.co/YseuEuD2yf #BigData | Working |
Working with #Hive in #HDInsight http://t.co/YseuEuD2yf #BigData | with |
Working with #Hive in #HDInsight http://t.co/YseuEuD2yf #BigData | #Hive |
Working with #Hive in #HDInsight http://t.co/YseuEuD2yf #BigData | in |
Working with #Hive in #HDInsight http://t.co/YseuEuD2yf #BigData | #HDInsight |
Working with #Hive in #HDInsight http://t.co/YseuEuD2yf #BigData | http://t.co/YseuEuD2yf |
Working with #Hive in #HDInsight http://t.co/YseuEuD2yf #BigData | #BigData |
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS | SSIS: |
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS | Creating |
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS | Dynamic |
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS | Data-driven |
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS | SSAS |
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS | Partitions |
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS | http://t.co/dmC9RdnZ1H |
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS | #SSIS |
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS | #SSAS |
This gives us a result set where the #hashtags can be easily filtered from all other words and aggregated to give a simple statistics about the commonly used #hashtags.
SELECT word, count(1) as wcount FROM twitter.tweets LATERAL VIEW explode(split(text,' ')) text_ex as word WHERE word LIKE '#%' GROUP BY word ORDER BY wcount DESC LIMIT 5;
The output would be something like this
#SSAS | 85 |
#MDX | 72 |
#SSIS | 61 |
#Hive | 15 |
#HDInsight | 8 |
This simple example illustrates the capabilities of in-built Hive statements for common analytical tasks which could be accomplished without the need to write custom mappers and reducers.
Related Links
Working with Hive in HDInsight
Upload data using Azure Storage Explorer
https://twitter.com/vnarayanaswamy/status/493653111500595200
One thought on “HDInsight: Analysing Twitter Hashtags using Hive”