Pig: Using CUBE Operator to Analyse Energy Rating of Air Conditioners

ApachePig

CUBE operator in Pig computes all possible combination of the specified fields. In this post I will demonstrate the use of Cube operator to analyse energy rating of air conditioners in Hortonworks Data Platform (HDP). Continue Reading

Pig: Expense Analysis using GROUP Operator

ApachePig

In this post I will demonstrate how to use Pig’s GROUP operator to analyse credit card expenses and determine the top expenses for the year and their percentage of the total expense. This exercise was done in Hortonworks Data Platform (HDP). Continue Reading

HDP: Analysing Twitter Hashtags using Hive

Twitter

Hive implements MapReduce using HiveQL. The built-in capabilities of HiveQL abstracts the implementation of mappers and reducers with a simple yet powerful SQL like query language. To demonstrate the inbuilt capabilities of HiveQL, I will be analysing hashtags from a twitter feed on Hortonworks Data Platform (HDP). Continue Reading

Power Query: Import Data from Web Pages

Power Query can discover and import data from websites. Often data warehouses rely on external data which is readily available in public websites for e.g. public holidays, school holidays, daylight savings, SIC codes, SWIFT codes, post codes etc. Power Query is perfectly suitable for such situations. Power Query can discover, fetch and transform data from a HTML table in a web page into a format that can be easily imported into data warehouses using SSIS package. It’s like an ETL tool for the web page data source.
Continue Reading

Azure: Delete HDInsight Cluster, yet Retain Data

Azure

Storage cluster (HDFS) in Hadoop is also the Processing cluster (MapReduce). Azure provides two different options to store data:

Option 1: Use HDInsight cluster to store data as well as to process MapReduce requests. For e.g. a Hive database hosted in an HDInsight cluster which also executes HiveQL MapReduce queries. In this instance data is stored in the cluster’s HDFS.

Option 2: Use HDInsight cluster to only process MapReduce requests whereas data is stored in Azure blob storage. For e.g. the Hive data is stored in Azure storage while the HDInsight cluster executes HiveQL MapReduce queries. Here the metadata of Hive database is stored in the cluster whereas the actual data is stored in Azure storage. The HDInsight cluster is co-located in the same datacentre as the Azure storage and connected by high speed network.

There are several advantage of using Azure storage (Option 2). Continue Reading

HDInsight: Analysing Twitter Hashtags using Hive

Twitter

Hive implements MapReduce using HiveQL. The built-in capabilities of HiveQL abstracts the implementation of mappers and reducers with a simple yet powerful SQL like query language. To demonstrate the inbuilt capabilities of HiveQL, I will be analysing hashtags from a twitter feed on Azure HDInsight platform. Continue Reading