Storage cluster (HDFS) in Hadoop is also the Processing cluster (MapReduce). Azure provides two different options to store data:
Option 1: Use HDInsight cluster to store data as well as to process MapReduce requests. For e.g. a Hive database hosted in an HDInsight cluster which also executes HiveQL MapReduce queries. In this instance data is stored in the cluster’s HDFS.
Option 2: Use HDInsight cluster to only process MapReduce requests whereas data is stored in Azure blob storage. For e.g. the Hive data is stored in Azure storage while the HDInsight cluster executes HiveQL MapReduce queries. Here the metadata of Hive database is stored in the cluster whereas the actual data is stored in Azure storage. The HDInsight cluster is co-located in the same datacentre as the Azure storage and connected by high speed network.
There are several advantage of using Azure storage (Option 2). Brad Sarsfield and Danny Lee have highlighted this in this post – Why use Blob Storage with HDInsight on Azure. Another advantage which I find particularly useful when working with Hive database is that the HDInsight cluster can be created on demand and need not be available always. I can delete my HDInsight cluster when I am done with the data analysis (Screen Capture 1). My Hive data is still retained in Azure storage. Should I need to revisit my data analysis, all I have to do is provision the HDInsight cluster again.
Screen Capture 1 – Delete HDInsight Cluster
By default, the metadata of Hive database is stored in head node of HDInsight cluster. So if the cluster is deleted, the metadata of Hive database is deleted as well. The metadata includes the database, schema and table definitions. So execute the DDL scripts of the Hive database after the cluster is provisioned. There is no need to reload data. You can get around this step if you use a SQL database to store the Hive metadata. When provisioning the cluster, select the option “Enter the Hive/Oozie Metastore” and complete the configuration.
Screen Capture 2 – Hive/Oozie Metastore
Now you can execute HiveQL queries on the new HDInsight cluster without reloading the data.
Cost of running an HDInsight cluster is comparatively more than storage. Provisioning the HDInsight cluster on demand while retaining Azure Storage is cost effective especially when there is spending limits on your Azure subscription (action pack for instance).