HDInsight – jack of all trades master of some

Streaming ETL using CDC and Azure Event Hub. A Modern Data Architecture.

Vimal Vachhani — Thu, 21 Mar 2019 21:54:38 +0000

In Modern Data architecture, As Data Warehouses have gotten bigger and faster, and as big data technology has allowed us to store vast amounts of data it is still strange to me that most data warehouse refresh processes found in the wild are still some form of batch processing. Even Hive queries against massive Hadoop infrastructures are essentially fast performing bath queries. Sure, they may occur every half day or even every hour but the speed of business continues to accelerate and we must start looking at architecture that combines the speed and transactional processing of Kafka/Spark/Event Hubs into creating a real time streaming ETL to load a data warehouse at a cost that is comparable and even cheaper then purchasing an ETL tool. Let’s look at Streaming ETL using CDC and Azure Event Hub.

Interested in Learning More about Modern Data Architecture? .

Side Note: Want to learn SQL or Python for free. In less then 10 minutes a day and less than an hour total? Signup for my free classes delivered daily right to your email inbox for free!

Now back to the article…

For this series, we will be looking at Azure Event Hubs or IoT Hubs. These were designed to capture fast streaming data of millions of rows from IoT devices or streaming data like Twitter. But why should this tool be limited to these use cases? Most businesses have no need or requirement for this use case, but we can use this technology to create a live streaming ETL to your data warehouse or your reporting environment with out sacrifice performance or creating a strain on your source systems. This architecture can be used to perform data synchronization between systems and other integrations as well, and since we are not using it to its full potential of capturing millions of flowing records, our costs end up being pennies a day!

Others have emulated this sort of process by using triggers on their source table, but this can potentially add an extra step of processing and overhead to your database. By enabling change data capture natively on SQL Server, it can be much lighter than a trigger. You can then take the first steps to creating a streaming ETL for your data. If CDC is not available, simple staging scripts can be written to emulate the same but be sure to keep an eye on performance. Let’s take a look at the first step of setting up native Change Data Capture on your SQL Server tables

Steps

First, enable Change Data Capture at the Database and Table Level using the following scripts. More information is available on the Microsoft Site. If you are not using SQL Server or a tool that has organic CDC build it, a similar process can be hand built a read only view by leveraging timestamps on last created date and last updated date.

sys.sp_cdc_enable_db

EXECUTE sys.sp_cdc_enable_table

@source_schema = N’dbo’

, @source_name = N’Task’

, @role_name = N’cdc_Admin’;

This step will enable CDC on the database as well as add it to the table “Task”. SQL Agent must be running as two jobs are created during this process, one to load the table and one to clean it out.

Once the CDC is enabled, SQL Server will automatically create the following tables “schema_tablename_CT” in the System folder section. This table will now automatically track all data changes that occur on that table. You can reference the _$operationcode to determine what change occurred with the legend below. If you wish to capture changes to only certain fields, see the Microsoft documentation on CDC to see how that can be set. If you are hand writing your SQL, this can also be programmed in when building your staging query.

1 = delete
2 = insert
3 = update (captured column values are those before the update operation). This value applies only when the row filter option ‘all update old’ is specified.
4 = update (captured column values are those after the update operation)

Now once you add, edit or delete a record, you should be able to find it in the new CDC table. Next, we will look at scanning this table and turning the data to JSON to send to an Event Hub!

For more information on SQL CDC please see their documentation here.

Be sure to check out my full online class on the topic. A hands on walk through of a Modern Data Architecture using Microsoft Azure. For beginners and experienced business intelligence experts alike, learn the basic of navigating the Azure Portal to building an end to end solution of a modern data warehouse using popular technologies such as SQL Database, Data Lake, Data Factory, Data Bricks, Azure Synapse Data Warehouse and Power BI. Link to the class can be found here.

Part 1 – Navigating the Azure Portal

Part 2 – Resource Groups and Subscriptions

Part 3 – Creating Data Lake Storage

Part 4 – Setting up an Azure SQL Server

Part 5 – Loading Data Lake with Azure Data Factory

Part 6 – Configuring and Setting up Data Bricks

Part 7 – Staging data into Data Lake

Part 8 = Provisioning a Synapse SQL Data Warehouse

Part 9 – Loading Data into Azure Data Synapse Data Warehouse

The Modern Data Warehouse; Azure Data Lake and U-SQL to combine data

Vimal Vachhani — Mon, 17 Dec 2018 15:10:04 +0000

The modern data warehouse will need to use Azure Data Lake and U-SQL to combine data. Begin by navigating to your Azure Portal and searching for the Data Lake Analytics Resource. Let’s start by creating a new Data Lake. Don’t worry, this service only charges on data in and out, not just remaining on like an HDInsights cluster so you should not be charged anything and we will not need to spin up and spin down services like we did earlier.

You will need to give it a unique name as well as tie it to a pay as you go subscription. A data lake will also need a data lake storage layer as well. You can keep the naming as is or rename it if you wish. It is recommended to keep the storage encrypted. Deployment may take a few minutes.

Let’s navigate to the data explore and create a new folder to put our sample data. Upload the same 3 files from the happiness index to the new folder.

Take a look through your data explorer at the tables. It should contain a master database similar to as if you were running a traditional SQL Server.

Create a new U-SQL Job that creates the new database in the catalog as well as a schema and a table.

CREATE DATABASE IF NOT EXISTS testdata;

USE DATABASE testdata;

CREATE SCHEMA IF NOT EXISTS happiness;

CREATE TABLE happiness.placeholderdata

(

Region string,

HappinessRank float,

HappinessScore float,

LowerConfidenceInterval float,

UpperConfidenceInterval float,

Economy_GDPperCapita float,

Family float,

Health_LifeExpectancy float,

FreedomTrust_GovernmentCorruption float,

GenerosityDystopiaResidual float,

INDEX clx_Region CLUSTERED(Region ASC) DISTRIBUTED BY HASH(Region)

);

While running and once completed, Azure will present the execution tree. With the new table created, we can now load data into that table.

USE DATABASE testdata;

@log =

EXTRACT Region string,

HappinessRank float,

HappinessScore float,

LowerConfidenceInterval float,

UpperConfidenceInterval float,

Economy_GDPperCapita float,

Family float,

Health_LifeExpectancy float,

FreedomTrust_GovernmentCorruption float,

GenerosityDystopiaResidual float

FROM “/sampledata/{*}.csv”

USING Extractors.Text(‘,’,silent:true);

INSERT INTO happiness.placeholderdata

SELECT * FROM @log;

Once the data is loaded, we can go query the data in the data explorer to see what it looks like. The script will run the select and output the data to a file to be browsed.

@table = SELECT * FROM [testdata].[happiness].[placeholderdata];

OUTPUT @table

TO “/OUTPUTS/Sampledataquery.csv”

USING Outputters.Csv();

Part 1 – Navigating the Azure Portal

Part 2 – Resource Groups and Subscriptions

Part 3 – Creating Data Lake Storage

Part 4 – Setting up an Azure SQL Server

Part 5 – Loading Data Lake with Azure Data Factory

Part 6 – Configuring and Setting up Data Bricks

Part 7 – Staging data into Data Lake

Part 8 = Provisioning a Synapse SQL Data Warehouse

Part 9 – Loading Data into Azure Data Synapse Data Warehouse

The Modern Data Warehouse; Running Hive Queries in Visual Studio to combine data

Vimal Vachhani — Tue, 04 Dec 2018 21:21:24 +0000

In previous posts we have looked at storing data files to blob storage and using PowerShell to spin up an HDInsight Hadoop cluster. We have also installed some basic software that will help us get going once the services are provisioned. Now that the basics are ready, it is time to process some of that data using Hive and Visual Studio. In this scenario, we will be loading our Happiness Index data files into Hive tables and then consolidating that data into a single file.

For this tutorial we are going to leverage Azure Storage Browser to view our storage files as well as create folders. You can use this tool as well as the shell to create new folders and upload your data files. The actual folder will not be created until you upload files using this tool. In a real-world scenario, you would use a file transfer task or data factory to stage your files.

Open the tool and sign into your Azure account. Once created, navigate to your HDInsights Cluster and in your storage and create a new folder in your Hive Storage called “data” and uploaded all 3 of our files to this location.

Open Visual Studio with your newly installed “Azure Data Lake Tools” installed. If you look at your server explorer and ensure you are signed in with the same Azure account as where your cluster is located, you will see the cluster listed in the menu.

You can create a new table directly from the server explorer. In the tool you can either script or use the wizard to create the new table. Create it so that it has the same column names as the file. For the example below we will just run the create table and the load table step as one Hive script. Update the script to load the 2016 and the 2017 files as well.

CREATE TABLE IF NOT EXISTS default.sourcedata2015(Country string,

Region string,

HappinessRank float,

HappinessScore float,

LowerConfidenceInterval float,

UpperConfidenceInterval float,

Economy_GDPperCapita float,

Family float,

Health_LifeExpectancy float,

FreedomTrust_GovernmentCorruption float,

GenerosityDystopiaResidual float)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’

COLLECTION ITEMS TERMINATED BY ‘\002’

MAP KEYS TERMINATED BY ‘\003’

STORED AS TEXTFILE;

LOAD DATA INPATH ‘/data/2015.csv’ INTO table sourcedata2015;

Once you have created your 3 new Hive tables, it is time to consolidate them into one table using a simple SQL like union statement and adding the year column to the end.

Select * from

(

SELECT *, ‘2015’ as Year FROM sourcedata2017 b

UNION ALL

SELECT *, ‘2016’ as Year FROM sourcedata2017 c

UNION ALL

SELECT *, ‘2017’ as Year FROM sourcedata2017 d

) CombinedTable

Once the job is complete, you should be able to view the results in see by clicking “Job Output”. All we have done here is created a select statement, but this data could have also been inserted into a new Hive table for either more processing or a push to a targeted data warehouse.

I hope this introduction helps you get off the ground in the basics of running Hive queries. In later posts we will look at how HDinsights intersects with Azure Data Lake and Data Factory. Before you leave, be sure to delete your HDCluster. Since this services runs as an hourly service and is not billed on a job basis, so it will continue to charge your account. You can either delete via the Azure dashboard or using the PowerShell script from our previous posts.

Setting up tools to work with HDInsights and run Hive Queries – Azure Data Lake Tools and Azure Storage Browser

Vimal Vachhani — Tue, 27 Nov 2018 22:00:30 +0000

Two tools that are going to make life a bit simpler if you are going to be working with HDInsights and Azure blog storage are “Azure Data Lake and Stream Analytic Tools for Visual Studio” and Azure Storage Browser.

Azure Data Lake and Stream Analytic Tools for Visual Studio

To run Hive Queries, you’re going to need to install Azure Data Lake and Stream Analytic Tools to your version of Visual Studio, sometimes referred to as HDInsight tools for Visual Studio or Azure Data Lake tools for Visual Studio. You can install directly from Visual studio by selection Tools -> Get Tools and Features.

Once you have Visual Studio Open, navigate to your server explorer and verify you are connected to the right Azure Subscription. You should now see your cluster information as created in the previous blog post on “Provisioning HDInsight Clusters using PowerShell”

You should be able to browse and interact with your storage from Visual Studio as well as the Azure dashboard at this point but another helpful tool to install will be “Azure Storage Explorer”. It can be found and installed from the following location.

Data explorer can be used to create new folders and upload your data files and this does make that job a little easier. Please note, when creating new folders, the actual folder will not be created until you upload files using this tool.

That should be it! From the Server explorer, you can now see the HDInsight Cluster that was spun up as well as the storage accounts where files will be stored. In the next post we will look at building and running Hive queries against your data files to get them ready for reporting.

The Modern Data warehouse; The low-cost solution using Big Data with HDInsight and PowerShell

Vimal Vachhani — Thu, 30 Aug 2018 20:07:07 +0000

Organizations have been reluctant to transition their current business intelligence solutions from the traditional data warehouse infrastructure to big data for many reasons, but there are two reasons that are false barrier to entries, cost and complexity of provisioning environments. In the post below, we will cover how to use PowerShell to commission and decommission HDInsight clusters on Azure. If used correctly, this allows organizations to cut cost down to the sub $100 dollars a month for a big data solution by creating the clusters on demand on weekends or evenings, processing all the heavy big data they may have, creating the data files required and then deleting the cluster so no charges are incurred the rest of the time. The Storage files can remain, so when you spin up the cluster again, it is if nothing ever happened to your previous data or work. The scrips below will remove the storage files as well and will need to be modified to keep that portion around. You can also move your storage to a different resource group to avoid it being removed.

Before we start, it is important to note, that keeping the clusters running all the time is not very expensive if you require the lower end of the infrastructure. Cost only goes up as you scale up, and if this is not a concern then you will reap the benefits of having an always on big data solution where you can process live transactional data. If you are spinning clusters up and down, you are limited to the large volume batch data processing that big data solutions provide.

Provisioning A Server

All you will need to get off the ground will be an Azure subscription and PowerShell, which comes natively installed with Windows. To run power shell, simply open search and type in “Powershell ISE” to launch the tool.
Download and Install Azure PowerShell from Azure Features page. This will give you all the commands needed to run PowerShell commands against your Azure environment. The current download can be found at the following link.
1. https://github.com/Azure/azure-powershell/releases/tag/v6.7.0-August2018
Once installed, when running Powershell ISE, you should now see Azure commands in the command section to the right.
See the script below to provision a cluster with credit to EDX online education. This script will create a single resource group, a storage account and cluster. This will create the lowest cost of cluster required for a HDInsight big data solution. This can be scaled up or down once provisioned.
1. The variables at the top will define all your names and passwords. Please be sure to update these accordingly and save to a secure location
2. You can change the names, locations and credentials if you wish to spin up multiple instances
3. The line “Login-AzureRmAccount” is where the script will pause during execution to obtain your Azure credentials. This will be the credentials you use to log into your Azure Portal.
Once started, the script will post updates. This step can take anywhere between 10 minutes to 20 minutes to complete. Once completed you will have a full functioning HDInsights Cluster on Azure ready to go.

Deleting and Cleanup

The following script is all you need to clear out all the items created in the previous step. By doing this, you will ensure nothing is still actively running on Azure and you are not incurring any cost.

In between the steps of provisioning a cluster and cleaning it up, is where you will want to add your big data processing automation steps. You can string together a series of PowerShell scripts this way to knock out a ton of batch processing very cost effectively.

Provision Download

Delete Download