The Modern Data warehouse; The low-cost solution using Big Data with HDInsight and PowerShell

Organizations have been reluctant to transition their current business intelligence solutions from the traditional data warehouse infrastructure to big data for many reasons, but there are two reasons that are false barrier to entries, cost and complexity of provisioning environments. In the post below, we will cover how to use PowerShell to commission and decommission HDInsight clusters on Azure. If used correctly, this allows organizations to cut cost down to the sub $100 dollars a month for a big data solution by creating the clusters on demand on weekends or evenings, processing all the heavy big data they may have, creating the data files required and then deleting the cluster so no charges are incurred the rest of the time. The Storage files can remain, so when you spin up the cluster again, it is if nothing ever happened to your previous data or work. The scrips below will remove the storage files as well and will need to be modified to keep that portion around. You can also move your storage to a different resource group to avoid it being removed. 

Before we start, it is important to note, that keeping the clusters running all the time is not very expensive if you require the lower end of the infrastructure. Cost only goes up as you scale up, and if this is not a concern then you will reap the benefits of having an always on big data solution where you can process live transactional data. If you are spinning clusters up and down, you are limited to the large volume batch data processing that big data solutions provide.

Provisioning A Server

  • All you will need to get off the ground will be an Azure subscription and PowerShell, which comes natively installed with Windows. To run power shell, simply open search and type in “Powershell ISE” to launch the tool.
  • Download and Install Azure PowerShell from Azure Features page. This will give you all the commands needed to run PowerShell commands against your Azure environment. The current download can be found at the following link.
    1. https://github.com/Azure/azure-powershell/releases/tag/v6.7.0-August2018
  • Once installed, when running Powershell ISE, you should now see Azure commands in the command section to the right.
  • See the script below to provision a cluster with credit to EDX online education. This script will create a single resource group, a storage account and cluster. This will create the lowest cost of cluster required for a HDInsight big data solution. This can be scaled up or down once provisioned.
    1. The variables at the top will define all your names and passwords. Please be sure to update these accordingly and save to a secure location
    2. You can change the names, locations and credentials if you wish to spin up multiple instances
    3. The line “Login-AzureRmAccount” is where the script will pause during execution to obtain your Azure credentials. This will be the credentials you use to log into your Azure Portal.
  • Once started, the script will post updates. This step can take anywhere between 10 minutes to 20 minutes to complete. Once completed you will have a full functioning HDInsights Cluster on Azure ready to go.

Deleting and Cleanup

  • The following script is all you need to clear out all the items created in the previous step. By doing this, you will ensure nothing is still actively running on Azure and you are not incurring any cost.

In between the steps of provisioning a cluster and cleaning it up, is where you will want to add your big data processing automation steps. You can string together a series of PowerShell scripts this way to knock out a ton of batch processing very cost effectively.