Energy Rating Analysis of Air conditioners using R Decision Trees

Decision tree is a data mining model that graphically represents the parameters that are most likely to influence the outcome and the extent of influence. The output is similar to a tree/flowchart with nodes, branches and leaves. The nodes represent the parameters, the branches represent the classification question/decision and the leaves represent the outcome (Screen Capture 1). Internally, decision tree algorithm performs a recursive classification on the input dataset and assigns each record to a segment of the tree where it fits closest.
There are several packages in R that generate decision trees. For this post, I am using the ctree() function available in party package. The data I am using as input is energy rating of household air conditioners.

R: Box Plot

Box plot is an effective way to visualize the distribution of your data.It only takes a few lines of code in R to come up with a basic box plot.

Pig: Using CUBE Operator to Analyse Energy Rating of Air Conditioners


Pig: Expense Analysis using GROUP Operator


HDP: Analysing Twitter Hashtags using Hive


Power Query: Import Data from Web Pages

Power Query can discover and import data from websites. Often data warehouses rely on external data which is readily available in public websites for e.g. public holidays, school holidays, daylight savings, SIC codes, SWIFT codes, post codes etc. Power Query is perfectly suitable for such situations. Power Query can discover, fetch and transform data from a HTML table in a web page into a format that can be easily imported into data warehouses using SSIS package. It’s like an ETL tool for the web page data source.
Azure: Delete HDInsight Cluster, yet Retain Data


Storage cluster (HDFS) in Hadoop is also the Processing cluster (MapReduce). Azure provides two different options to store data:

Option 1: Use HDInsight cluster to store data as well as to process MapReduce requests. For e.g. a Hive database hosted in an HDInsight cluster which also executes HiveQL MapReduce queries. In this instance data is stored in the cluster’s HDFS.

Option 2: Use HDInsight cluster to only process MapReduce requests whereas data is stored in Azure blob storage. For e.g. the Hive data is stored in Azure storage while the HDInsight cluster executes HiveQL MapReduce queries. Here the metadata of Hive database is stored in the cluster whereas the actual data is stored in Azure storage. The HDInsight cluster is co-located in the same datacentre as the Azure storage and connected by high speed network.

