Box plot is an effective way to visualize the distribution of your data.It only takes a few lines of code in R to come up with a basic box plot.
Continue ReadingPig: Using CUBE Operator to Analyse Energy Rating of Air Conditioners

CUBE operator in Pig computes all possible combination of the specified fields. In this post I will demonstrate the use of Cube operator to analyse energy rating of air conditioners in Hortonworks Data Platform (HDP). Continue Reading
Pig: Expense Analysis using GROUP Operator

In this post I will demonstrate how to use Pig’s GROUP operator to analyse credit card expenses and determine the top expenses for the year and their percentage of the total expense. This exercise was done in Hortonworks Data Platform (HDP). Continue Reading
HDP: Analysing Twitter Hashtags using Hive

Hive implements MapReduce using HiveQL. The built-in capabilities of HiveQL abstracts the implementation of mappers and reducers with a simple yet powerful SQL like query language. To demonstrate the inbuilt capabilities of HiveQL, I will be analysing hashtags from a twitter feed on Hortonworks Data Platform (HDP). Continue Reading
Power Query: Import Data from Web Pages

Power Query can discover and import data from websites. Often data warehouses rely on external data which is readily available in public websites for e.g. public holidays, school holidays, daylight savings, SIC codes, SWIFT codes, post codes etc. Power Query is perfectly suitable for such situations. Power Query can discover, fetch and transform data from a HTML table in a web page into a format that can be easily imported into data warehouses using SSIS package. It’s like an ETL tool for the web page data source.
Continue Reading
Azure: Delete HDInsight Cluster, yet Retain Data

Storage cluster (HDFS) in Hadoop is also the Processing cluster (MapReduce). Azure provides two different options to store data:
Option 1: Use HDInsight cluster to store data as well as to process MapReduce requests. For e.g. a Hive database hosted in an HDInsight cluster which also executes HiveQL MapReduce queries. In this instance data is stored in the cluster’s HDFS.
Option 2: Use HDInsight cluster to only process MapReduce requests whereas data is stored in Azure blob storage. For e.g. the Hive data is stored in Azure storage while the HDInsight cluster executes HiveQL MapReduce queries. Here the metadata of Hive database is stored in the cluster whereas the actual data is stored in Azure storage. The HDInsight cluster is co-located in the same datacentre as the Azure storage and connected by high speed network.
There are several advantage of using Azure storage (Option 2). Continue Reading
HDInsight: Analysing Twitter Hashtags using Hive

Hive implements MapReduce using HiveQL. The built-in capabilities of HiveQL abstracts the implementation of mappers and reducers with a simple yet powerful SQL like query language. To demonstrate the inbuilt capabilities of HiveQL, I will be analysing hashtags from a twitter feed on Azure HDInsight platform.
Continue ReadingWorking with Hive in HDInsight

This post is a tutorial to get started on Hive in HDInsight.
The steps to be followed are given below. As a pre-requisite you would need a subscription to Microsoft Azure to try out these steps
- Provision Azure Storage Account
- Provision HDInsight Cluster
- Create Hive Database and Tables
- Prepare Data as Ctrl-A separated Text Files
- Upload Text Files to Azure Storage
- Load Data to Hive
- Execute HiveQL DML Jobs
Continue Reading
SSIS: Creating Dynamic Data-driven SSAS Partitions

Partitions improve the query response of SSAS cube by narrowing the MDX query to a pre-defined subset of data. Usually cubes are developed with few basic partitions to begin with and new partitions are added over time. Partitions are created using XMLA commands. The rest of this post explains the steps to create partitions using metadata and SSIS package.
Continue ReadingSSIS: Dynamic Processing of SSAS Partitions
SSAS Partitions are primarily geared towards better query performance by narrowing the MDX query to a pre-defined subset of data. Each partition must be in processed state for the cube to be available for browsing. However in a delta ETL run, it is more than likely that only some partitions would be refreshed with data while the rest remain unchanged. For example in a hospital admissions cube partitioned by month, data movements are frequent for admissions made during the current/last month compared to earlier months. This means if the partitions updated by the ETL run can be identified, then we can tailor SSIS package to dynamically process only those partitions that require data refresh while leaving the remaining partitions untouched.
Continue Reading