ETL – jack of all trades master of some

Create a NodeJS Client Application to Submit Data to Event Hubs

Vimal Vachhani — Tue, 09 Apr 2019 03:19:15 +0000

Now that we have provisioned a Event Hub in Azure, let’s Create a NodeJS Client Application to Submit Data to Event Hubs.

Prerequisites

Visual Studio 2017
Install NodeJS SDK

From Visual Studio, select “tool->Get Tools and Features” and from the window, select “node.js” development to add the appropriate libraries to your Visual Studio Instance.

Start a command prompt window in the location of where your script will reside. For this example we will use C:\EventHub
Create a package for the application using “npm init”
1. Accept all defaults in the set up and select “yes”
2. You will be prompted with the package.json that will be created for you

You must install the Azure SDK Package by running the command “npm install azure-event-hubs”

Navigate back to Azure to the security profile you created and copy the connection key. Place this in the js script file for the connection. This script just intermittently sends data to event hubs using a json string.

Run the following application in command “node eventclient.js” to begin sending messages to the Event Hub.

If you navigate back to Azure, you will see the events being recorded in the Event Hub.

Reading Data in Event Hubs

Follow the same commands from the previous section to set up node and the json file via command prompt.
1. npm init
2. npm install azure-event-hubs
Update the script with the connection string that was set up as send and listen shared access policy.

Run the command “node eventreader.js” to being to read the messages going into Event Hub.

Part 1 – Provisioning Event Hubs

Provisioning an Azure Event Hub to capture real time streaming data

Vimal Vachhani — Tue, 09 Apr 2019 03:18:03 +0000

Provisioning an Azure Event Hub to capture real time streaming data is fairy easy once you have an Azure account. Event Hubs can be used to capture data from many different sources including databases or IoT devices. As we look at building a CDC streaming ETL, let’s take a look at the basics of Event Hubs

Create a new Event Hubs in Azure, by finding it in the search bar.

Create a new namespace.
- Name it something unique and use the basic pricing tier to limit cost since our needs are fairly limited and do not need the full horse power of Azure Event Hubs.
- Select your basic subscription and create a resource group if you do not already have one.
- Select “Create” to begin deployment process to create.

Once the deployment process is created, navigate to your new Namespace and select “Add Event Hub”
- Give it a name and leave all settings as it to keep it small.
- Once you hit create, the deployment process will start.

After completion you should now have an active event hub in your namespace.

Granting Access to the Event Hub

On the right of the name space window, select “Shared Access Policies”

Add a new policy and give it “manage” access rights to it may send and listen to messages coming from and leaving the event hub. Different policies can be used to different applications as a best practice. Once created, this will create a key for access to be used to send and receive messages.

Part 2 – Build a client app in NodeJS to send data to Event Hubs

Streaming ETL using CDC and Azure Event Hub. A Modern Data Architecture.

Vimal Vachhani — Thu, 21 Mar 2019 21:54:38 +0000

In Modern Data architecture, As Data Warehouses have gotten bigger and faster, and as big data technology has allowed us to store vast amounts of data it is still strange to me that most data warehouse refresh processes found in the wild are still some form of batch processing. Even Hive queries against massive Hadoop infrastructures are essentially fast performing bath queries. Sure, they may occur every half day or even every hour but the speed of business continues to accelerate and we must start looking at architecture that combines the speed and transactional processing of Kafka/Spark/Event Hubs into creating a real time streaming ETL to load a data warehouse at a cost that is comparable and even cheaper then purchasing an ETL tool. Let’s look at Streaming ETL using CDC and Azure Event Hub.

Interested in Learning More about Modern Data Architecture? .

Side Note: Want to learn SQL or Python for free. In less then 10 minutes a day and less than an hour total? Signup for my free classes delivered daily right to your email inbox for free!

Now back to the article…

For this series, we will be looking at Azure Event Hubs or IoT Hubs. These were designed to capture fast streaming data of millions of rows from IoT devices or streaming data like Twitter. But why should this tool be limited to these use cases? Most businesses have no need or requirement for this use case, but we can use this technology to create a live streaming ETL to your data warehouse or your reporting environment with out sacrifice performance or creating a strain on your source systems. This architecture can be used to perform data synchronization between systems and other integrations as well, and since we are not using it to its full potential of capturing millions of flowing records, our costs end up being pennies a day!

Others have emulated this sort of process by using triggers on their source table, but this can potentially add an extra step of processing and overhead to your database. By enabling change data capture natively on SQL Server, it can be much lighter than a trigger. You can then take the first steps to creating a streaming ETL for your data. If CDC is not available, simple staging scripts can be written to emulate the same but be sure to keep an eye on performance. Let’s take a look at the first step of setting up native Change Data Capture on your SQL Server tables

Steps

First, enable Change Data Capture at the Database and Table Level using the following scripts. More information is available on the Microsoft Site. If you are not using SQL Server or a tool that has organic CDC build it, a similar process can be hand built a read only view by leveraging timestamps on last created date and last updated date.

sys.sp_cdc_enable_db

EXECUTE sys.sp_cdc_enable_table

@source_schema = N’dbo’

, @source_name = N’Task’

, @role_name = N’cdc_Admin’;

This step will enable CDC on the database as well as add it to the table “Task”. SQL Agent must be running as two jobs are created during this process, one to load the table and one to clean it out.

Once the CDC is enabled, SQL Server will automatically create the following tables “schema_tablename_CT” in the System folder section. This table will now automatically track all data changes that occur on that table. You can reference the _$operationcode to determine what change occurred with the legend below. If you wish to capture changes to only certain fields, see the Microsoft documentation on CDC to see how that can be set. If you are hand writing your SQL, this can also be programmed in when building your staging query.

1 = delete
2 = insert
3 = update (captured column values are those before the update operation). This value applies only when the row filter option ‘all update old’ is specified.
4 = update (captured column values are those after the update operation)

Now once you add, edit or delete a record, you should be able to find it in the new CDC table. Next, we will look at scanning this table and turning the data to JSON to send to an Event Hub!

For more information on SQL CDC please see their documentation here.

Be sure to check out my full online class on the topic. A hands on walk through of a Modern Data Architecture using Microsoft Azure. For beginners and experienced business intelligence experts alike, learn the basic of navigating the Azure Portal to building an end to end solution of a modern data warehouse using popular technologies such as SQL Database, Data Lake, Data Factory, Data Bricks, Azure Synapse Data Warehouse and Power BI. Link to the class can be found here.

Part 1 – Navigating the Azure Portal

Part 2 – Resource Groups and Subscriptions

Part 3 – Creating Data Lake Storage

Part 4 – Setting up an Azure SQL Server

Part 5 – Loading Data Lake with Azure Data Factory

Part 6 – Configuring and Setting up Data Bricks

Part 7 – Staging data into Data Lake

Part 8 = Provisioning a Synapse SQL Data Warehouse

Part 9 – Loading Data into Azure Data Synapse Data Warehouse

Accelerating the Staging Process for your Data Warehouse

Vimal Vachhani — Wed, 27 Jun 2018 21:09:56 +0000

Real Estate Data Warehouse – Accelerating the Staging Process

The script below can be used to build a staging environment for any sort of industry and not just real estate related databases. The specifics of a RE Data warehouse will be covered in future blog post. It will allow you to Accelerating the Staging Process for your Data Warehouse

When starting the process to capture data analytics, whether you planning to eventually build a data warehouse or eventually feed a big data Hadoop cluster, it helps to stage your data away from your source systems. This provides many benefits; the primary being having a copy of data to work with and process that is no longer in the transactional operational system. Having large processes or queries running against your transactional operation system provides unnecessary risk, can introduce bottlenecks or slowdowns and even open security holes that may not be needed. When you have a staging environment, all you need is one service level account managed by IT security that has read access to the source systems. From there, building a scheduled refresh process to load this data as a blanket truncate and reload can be set up easily. Many tools for ETL can be used and robust auditing and scheduling should be set up but getting off the ground quickly to start prototyping and profiling your data will allow you to get moving a lot sooner and providing value to the business.

For this reason, I wrote the SQL script below a while back to help me on new projects. Running this script against a linked server connection or a replicated database will quickly allow you to build a staging database with procedures to load all the data as truncate and reloads. This can then be wrapped in a master SQL procedure and scheduled, giving you a full Staging ETL process with out needing ETL tools. Remember, this is just an accelerator and will require some tweaking and optimization to get to a final state, but this should get you off the ground with your basic SQL based source systems.

/***********************

Start of Script

************************/

/***********************

Configuration

************************/

DECLARE @sqlCommand varchar(1000)

DECLARE @DatabaseNameStaging varchar(75)

DECLARE @DatabaseNameSource varchar(75)

SET @DatabaseNameStaging = ‘Staging’

SET @DatabaseNameSource = ‘SourceDB’

— Add all tables to ignore to this list

DROP TABLE #TablestoIgnore

CREATE TABLE #TablestoIgnore

(

TableName varchar(255)

)

INSERT INTO #TablestoIgnore

Select ”

–UNION

–Select ”

— Table to Store List of all Table is Source Database

DROP TABLE #TableList

CREATE TABLE #TableList

(

TableName varchar(255)

)

/***********************

Create Staging Database

************************/

SET @sqlCommand = ‘IF NOT EXISTS(SELECT * FROM sys.databases WHERE NAME = ”’+@DatabaseNameStaging+”’)

BEGIN

CREATE DATABASE ‘+@DatabaseNameStaging+’

— Set Logging to Simple

USE master ;

ALTER DATABASE ‘+@DatabaseNameStaging+’ SET RECOVERY SIMPLE

END’

EXEC (@sqlCommand)

/***********************

Get List of All Tables

************************/

SET @sqlCommand = ‘INSERT INTO #TableList SELECT DISTINCT T.name AS Table_Name

FROM ‘+@DatabaseNameSource+’.sys.objects AS T

WHERE T.type_desc = ”USER_TABLE”

AND T.name NOT IN (SELECT TableName FROM #TablestoIgnore)

ORDER By 1′

EXEC (@sqlCommand)

–Create Drop and Create Statements

SELECT ‘IF OBJECT_ID(”’ + @DatabaseNameStaging + ‘.dbo.’+ TableName + ”’, ”U”) IS NOT NULL DROP TABLE ‘ + @DatabaseNameStaging + ‘.dbo.’+ TableName + ‘;’ AS DropStatement,

‘SELECT Top 1 * INTO ‘ + @DatabaseNameStaging + ‘.dbo.’+ TableName + ‘ From ‘ + @DatabaseNameSource + ‘.dbo.’+ TableName AS CreateStatement

INTO #DatabaseStatements

FROM #TableList

–Create Drop and Create Statements

— Run Drop Commands

DECLARE @MyCursor CURSOR;

DECLARE @MyField varchar(500);

BEGIN

SET @MyCursor = CURSOR FOR

SELECT DropStatement FROM #DatabaseStatements

OPEN @MyCursor

FETCH NEXT FROM @MyCursor

INTO @MyField

WHILE @@FETCH_STATUS = 0

BEGIN

EXEC (@MyField)

FETCH NEXT FROM @MyCursor

INTO @MyField

END;

CLOSE @MyCursor ;

DEALLOCATE @MyCursor;

END;

— Run Create Commands

BEGIN

SET @MyCursor = CURSOR FOR

SELECT CreateStatement FROM #DatabaseStatements

OPEN @MyCursor

FETCH NEXT FROM @MyCursor

INTO @MyField

WHILE @@FETCH_STATUS = 0

BEGIN

EXEC (@MyField)

FETCH NEXT FROM @MyCursor

INTO @MyField

END;

CLOSE @MyCursor ;

DEALLOCATE @MyCursor;

END;

/***********************

Create All Stored Procedures

to Load Staging

*** THIS SECTION MUST BE RUN AGAINST STAGING ENVIRONMENT ***

*** This step may result in Error for Identity Tables. Those ETL’s will need to be created

************************/

USE Staging

— Run Create Commands

BEGIN

SET @MyCursor = CURSOR FOR

SELECT TableName

FROM #TableList

OPEN @MyCursor

FETCH NEXT FROM @MyCursor

INTO @MyField

WHILE @@FETCH_STATUS = 0

BEGIN

EXEC ( ‘IF OBJECT_ID(”spLoad’+@MyField+”’, ”P”) IS NOT NULL DROP PROC spLoad’+@MyField+”)

EXEC ( ‘TRUNCATE TABLE ‘+@DatabaseNameStaging+’.dbo.’+@MyField+’

CREATE PROCEDURE dbo.spLoad’+@MyField+’

BEGIN

SET NOCOUNT ON;

— Insert statements for procedure here

INSERT INTO ‘+@DatabaseNameStaging+’.dbo.’+@MyField+’

SELECT * FROM ‘+@DatabaseNameSource+’.dbo.’+@MyField+’

END’)

EXEC ( ‘CREATE PROCEDURE dbo.spLoad’+@MyField+’

BEGIN

SET NOCOUNT ON;

— Insert statements for procedure here

INSERT INTO ‘+@DatabaseNameStaging+’.dbo.’+@MyField+’

SELECT * FROM ‘+@DatabaseNameSource+’.dbo.’+@MyField+’

END’)

FETCH NEXT FROM @MyCursor

INTO @MyField

END;

CLOSE @MyCursor ;

DEALLOCATE @MyCursor;

END;

DROP TABLE #DatabaseStatements

/***********************

End of Script

************************/

Part 1 – Navigating the Azure Portal

Part 2 – Resource Groups and Subscriptions

Part 3 – Creating Data Lake Storage

Part 4 – Setting up an Azure SQL Server

Part 5 – Loading Data Lake with Azure Data Factory

Part 6 – Configuring and Setting up Data Bricks

Part 7 – Staging data into Data Lake

Part 8 = Provisioning a Synapse SQL Data Warehouse

Part 9 – Loading Data into Azure Data Synapse Data Warehouse