HDInsight: Analysing Twitter Hashtags using Hive

Hive implements MapReduce using HiveQL. The built-in capabilities of HiveQL abstracts the implementation of mappers and reducers with a simple yet powerful SQL like query language. To demonstrate the inbuilt capabilities of HiveQL, I will be analysing hashtags from a twitter feed on Azure HDInsight platform.

Login to Azure with your subscription.

Locate your HDInsight cluster and click Manage Cluster (Screen Capture 1) to bring up the Hive page.

Screen Capture 1 –HDInsight Cluster

I am using data from Twitter archive which can be downloaded by following these instructions. This archive contains all your tweets from day one and information about each tweet which will help to determine the structure of Hive table.

From the Hive page, execute the following DDL statement to create a Hive database and table (Screen Capture 2)

CREATE DATABASE IF NOT EXISTS twitter LOCATION '/hive/warehouse';
CREATE TABLE IF NOT EXISTS twitter.tweets(
tweet_id BIGINT,
in_reply_to_status_id BIGINT,
in_reply_to_user_id INT,
status_timestamp TIMESTAMP,
source STRING,
text STRING,
retweeted_status_id BIGINT,
retweeted_status_user_id INT,
retweeted_status_timestamp TIMESTAMP,
expanded_urls STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

Screen Capture 2 – Hive DDL

Convert the tweets.csv to a tab separated text file and remove the header row. Copy the text file to Azure Storage using Azure Storage Explorer. Check the user guide of Azure Storage Explorer if you need help to upload the file.

Once the data file is uploaded to Azure Storage, load the data into Hive table using this statement (Screen Capture 3)

Screen Capture 3 – Load Data to Hive

LOAD DATA INPATH '/tweets_formatted.txt' INTO TABLE twitter.tweets;

Now execute the following Hive DML Statement.

SELECT text, word FROM twitter.tweets
LATERAL VIEW explode(split(text,' ')) text_ex as word;

The explode() is a Hive built-in User Defined Table-Generating Function (UDTF) that breaks down a array into its elements. In this case the tweet gets broken into words. The LATERAL VIEW joins the output of explode() to the input row (tweet) creating a result set that contains n rows (words) for each tweet as shown below

Tweet Output
Working with #Hive in #HDInsight http://t.co/YseuEuD2yf #BigData Working
Working with #Hive in #HDInsight http://t.co/YseuEuD2yf #BigData with
Working with #Hive in #HDInsight http://t.co/YseuEuD2yf #BigData #Hive
Working with #Hive in #HDInsight http://t.co/YseuEuD2yf #BigData in
Working with #Hive in #HDInsight http://t.co/YseuEuD2yf #BigData #HDInsight
Working with #Hive in #HDInsight http://t.co/YseuEuD2yf #BigData http://t.co/YseuEuD2yf
Working with #Hive in #HDInsight http://t.co/YseuEuD2yf #BigData #BigData
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS SSIS:
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS Creating
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS Dynamic
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS Data-driven
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS SSAS
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS Partitions
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS http://t.co/dmC9RdnZ1H
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS #SSIS
SSIS: Creating Dynamic Data-driven SSAS Partitions http://t.co/dmC9RdnZ1H #SSIS #SSAS #SSAS

This gives us a result set where the #hashtags can be easily filtered from all other words and aggregated to give a simple statistics about the commonly used #hashtags.

SELECT word, count(1) as wcount FROM twitter.tweets
LATERAL VIEW explode(split(text,' ')) text_ex as word
WHERE word LIKE '#%'
GROUP BY word
ORDER BY wcount DESC LIMIT 5;

The output would be something like this

#SSAS 85
#MDX 72
#SSIS 61
#Hive 15
#HDInsight 8

This simple example illustrates the capabilities of in-built Hive statements for common analytical tasks which could be accomplished without the need to write custom mappers and reducers.

Related Links

Working with Hive in HDInsight

HiveQL

Azure Storage Explorer

Upload data using Azure Storage Explorer

https://twitter.com/vnarayanaswamy/status/493653111500595200

One thought on “HDInsight: Analysing Twitter Hashtags using Hive

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s