Update 18-June-2014
Please refer to Working with Hive in HDInsight which the updated version of this post. The orginal post was written when HDInsight was a separate entity and in preview mode. Since then HDInsight has been completely integrated into Microsoft Azure cloud services. The concepts explained in this post still holds, however some of the instructions and screen captures have changed significantly. So I would recommend you to refer to the updated version of this post
Original post
This post gives step by step instruction to create and load a Hive table on Windows Azure HDInsight. Steps to be executed in sequence
- Provision a Hadoop Cluster
- HiveQL DDL Statement
- Data preparation/ETL
- HiveQL DML for Data Loading
- HiveQL DML for Data Verification
Step1: Provision a Hadoop Cluster
For this post, preview version of Windows Azure HDInsight is used. At the time of writing this post, access to preview version is available by invitation. Once you get access to Windows Azure HDInsight, the first step is to create a Hadoop Cluster. Follow the instruction in this article to create a Hadoop cluster. I created a Hadoop cluster with DNS Name northstep for this exercise
(Screen Capture 1).
Screen Capture 1 – Hadoop Cluster
Step 2: HiveQL DDL Statement
The next step is to create a Hive table using a HiveQL DDL statement. For this demonstration I am using a subset of data from BOCSAR. Specifically I am using crime data by Premise Type by Offence which is available for download in csv format.
Login to your Hadoop Cluster and navigate to Interactive Console ->Hive. Copy the HiveQL DDL given below in the text box area of Interactive Hive and hit evaluate (Screen Capture 2).
CREATE TABLE IF NOT EXISTS incident_by_premise_type( incident_year INT, division STRING, incident STRING, premise_type STRING, incident_count INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001';
The above HiveQL DDL statement creates the table incident_by_premise_type in the default database if it does not already exist. Rows are delimited by newline character and columns are delimited by ctrl-A character.
Check out these wikis for more information on Hive primitive data types and DDL Statements
Screen Capture 2 – Interactive Hive – DDL
Step 3: Data Preparation/ETL
Hive tables are usually loaded from txt files or from another HiveQL DML. I used an SSIS package to prepare BOCSAR data in the format required by Hive tables. Hive tables expect column data to be delimited by ctrl-A which is ‘/001’ in or char (1). I used SSIS script component as destination after unpivoting BOCSAR data. The code for the script component is given below. It must be noted that column headers are not included as first row of the text file.
#region Namespaces using System; using System.Data; using Microsoft.SqlServer.Dts.Pipeline.Wrapper; using Microsoft.SqlServer.Dts.Runtime.Wrapper; using System.IO; #endregion [Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute] public class ScriptMain : UserComponent { StreamWriter hiveFileStream; public override void PreExecute() { base.PreExecute(); try { String hiveFileName = Variables.hiveFilePath + Variables.csvFileName.Replace(".csv", ".txt"); if (File.Exists(hiveFileName)) {File.Delete(hiveFileName);} hiveFileStream = File.CreateText(hiveFileName); } catch (UnauthorizedAccessException uauth) { throw uauth; } catch (DirectoryNotFoundException notfound) { throw notfound; } catch (IOException ioException) { throw ioException; } } public override void PostExecute() { base.PostExecute(); hiveFileStream.Close(); } public override void Input0_ProcessInputRow(Input0Buffer Row) { String rowData; rowData = Row.dcYEAR.ToString() + Convert.ToChar(1).ToString() + Row.DIVISION.ToString() +Convert.ToChar(1).ToString() + Row.INCIDENT.ToString() +Convert.ToChar(1).ToString() + Row.dcPREMISETYPE.ToString() +Convert.ToChar(1).ToString() + Row.INCIDENTCOUNT.ToString() + Convert.ToChar(1).ToString(); hiveFileStream.WriteLine(rowData); } }
The hive data files created from the SSIS package can be downloaded from here
Step 4: HiveQL DML for Data Loading
When data is loaded to a Hive table, all it does is move the data from ctrl-A delimited file to a HDFS folder structure which is in form of database>table>partition. In this exercise I am loading data from a local folder on Hadoop cluster. To move the ctrl-A formatted hive data files created by SSIS package to Hadoop Cluster, remote desktop using the cluster user id/password (Screen Capture 3). Create a local folder called bocsar_data in C: and copy the files over to this folder.
Screen Capture 3 – Remote Desktop
Navigate to Interactive Console -> Hive and execute the following HiveQL DML statement. Note that multiple HiveQL statements can be executed from the interactive console provided each statement is separated by semicolon
LOAD DATA LOCAL INPATH '/bocsar_data/Arson_Sydney_TressPass_Data_By_PremiseType.txt' INTO TABLE incident_by_premise_type; LOAD DATA LOCAL INPATH '/bocsar_data/TransportOffence_Sydney_Data_By_PremiseType.txt' INTO TABLE incident_by_premise_type; LOAD DATA LOCAL INPATH '/bocsar_data/Tresspass_Sydney_Data_By_PremiseType.txt' INTO TABLE incident_by_premise_type;
Check out this wiki for more information on Hive Load Statement
Screen Capture 4- Load Data
Step 5: HiveQL DML for Data Verification
Now that the data is loaded into the Hive table incident_by_premise_type, finally verify the record counts by executing the following HiveQL Select statement. Check out this wiki for more information on HiveQL Select Statement
select
incident,
count(1)
select incident, count(1) from incident_by_premise_type group by incident;
That’s it – you now have a Hive table loaded with data.
Excellent Post. Thanks.