Streaming ETL using CDC and Azure Event Hub. A Modern Data Architecture.

In Modern Data architecture, As Data Warehouses have gotten bigger and faster, and as big data technology has allowed us to store vast amounts of data it is still strange to me that most data warehouse refresh processes found in the wild are still some form of batch processing. Even Hive queries against massive Hadoop … Read more

The Modern Data Warehouse; Azure Data Lake and U-SQL to combine data

The modern data warehouse will need to use Azure Data Lake and U-SQL to combine data. Begin by navigating to your Azure Portal and searching for the Data Lake Analytics Resource. Let’s start by creating a new Data Lake. Don’t worry, this service only charges on data in and out, not just remaining on like an HDInsights cluster so you should not be charged anything and we will not need to spin up and spin down services like we did earlier.

Read more

The Modern Data Warehouse; Running Hive Queries in Visual Studio to combine data

In previous posts we have looked at storing data files to blob storage and using PowerShell to spin up an HDInsight Hadoop cluster. We have also installed some basic software that will help us get going once the services are provisioned. Now that the basics are ready, it is time to process some of that data using Hive and Visual Studio. In this scenario, we will be loading our Happiness Index data files into Hive tables and then consolidating that data into a single file.

Read more

Setting up tools to work with HDInsights and run Hive Queries – Azure Data Lake Tools and Azure Storage Browser

Two tools that are going to make life a bit simpler if you are going to be working with HDInsights and Azure blog storage are “Azure Data Lake and Stream Analytic Tools for Visual Studio” and Azure Storage Browser.

Azure Data Lake and Stream Analytic Tools for Visual Studio

  •  To run Hive Queries, you’re going to need to install Azure Data Lake and Stream Analytic Tools to your version of Visual Studio, sometimes referred to as HDInsight tools for Visual Studio or Azure Data Lake tools for Visual Studio. You can install directly from Visual studio by selection Tools -> Get Tools and Features.

Read more

The Modern Data warehouse; The low-cost solution using Big Data with HDInsight and PowerShell

Organizations have been reluctant to transition their current business intelligence solutions from the traditional data warehouse infrastructure to big data for many reasons, but there are two reasons that are false barrier to entries, cost and complexity of provisioning environments. In the post below, we will cover how to use PowerShell to commission and decommission HDInsight clusters on Azure. If used correctly, this allows organizations to cut cost down to the sub $100 dollars a month for a big data solution by creating the clusters on demand on weekends or evenings, processing all the heavy big data they may have, creating the data files required and then deleting the cluster so no charges are incurred the rest of the time. The Storage files can remain, so when you spin up the cluster again, it is if nothing ever happened to your previous data or work. The scrips below will remove the storage files as well and will need to be modified to keep that portion around. You can also move your storage to a different resource group to avoid it being removed. 

Read more