azure data factory – jack of all trades master of some

Streaming ETL with Azure Data Factory and CDC – Creating the Rolling ETL Window

Vimal Vachhani — Wed, 27 Jan 2021 01:33:06 +0000

In this series we look at building a Streaming ETL with Azure Data Factory and CDC – Creating the Rolling ETL Window. This is Part 1, The rest of the series is below.

This series uses the Adventureworks database. For more information on how to get that set up see my Youtube video for Downloading and Restoring the database.

Creating the Rolling ETL Window

Now that we have our parameter driven piepline we can create a trigger using a rolling time to run intermittantly and pick up changes.

Click on Add Trigger -> New

Create a new trigger set as Tumbling window and set it for a time in the future to start.

On the following screen, set the start and end dates to

@formatDateTime(trigger().outputs.windowStartTime,’yyyy-MM-dd HH:mm:ss.fff’)

@formatDateTime(trigger().outputs.windowEndTime,’yyyy-MM-dd HH:mm:ss.fff’)

Streaming ETL with Azure Data Factory and CDC – Creating the Rolling ETL Window

Streaming ETL with Azure Data Factory and CDC – Create a Parameter Driver Pipeline

Vimal Vachhani — Wed, 27 Jan 2021 01:27:40 +0000

In this series we look at building a Streaming ETL with Azure Data Factory and CDC – Create a Parameter Driver Pipeline. This is Part 6, The rest of the series is below.

This series uses the Adventureworks database. For more information on how to get that set up see my Youtube video for Downloading and Restoring the database.

The previous step will pull all the changes in the CDC table, but we do not want to do this all the time. So let’s look at creating a rolling window for the CDC ETL.

Navigate to the parameters section and create a new parameter. Add two paramenters “triggerStartTime” and triggerEndTime” and set them to yesterday and todays date in the format “2020-01-07 12:00:00:000”

On the Lookup Activity, update the query in the settings to the following to use the new variables. SQL Agent must be running for this step the parameters must be valid dates.

@concat(‘DECLARE @begin_time datetime, @end_time datetime, @from_lsn binary(10), @to_lsn binary(10);

SET @begin_time = ”’,pipeline().parameters.triggerStartTime,”’;

SET @end_time = ”’,pipeline().parameters.triggerEndTime,”’;

SET @from_lsn = sys.fn_cdc_map_time_to_lsn(”smallest greater than or equal”, @begin_time);

SET @to_lsn = sys.fn_cdc_map_time_to_lsn(”largest less than”, @end_time);

SELECT count(1) changecount FROM cdc.fn_cdc_get_all_changes_dbo_DimProduct (@from_lsn, @to_lsn, ”all”)’)

Navigate back to the “True” condition and paste the following query in to track the changes with the variables as well

@concat(‘DECLARE @begin_time datetime, @end_time datetime, @from_lsn binary(10), @to_lsn binary(10);

SET @begin_time = ”’,pipeline().parameters.triggerStartTime,”’;

SET @end_time = ”’,pipeline().parameters.triggerEndTime,”’;

SET @from_lsn = sys.fn_cdc_map_time_to_lsn(”smallest greater than or equal”, @begin_time);

SET @to_lsn = sys.fn_cdc_map_time_to_lsn(”largest less than”, @end_time);

SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_DimProduct(@from_lsn, @to_lsn, ”all”)’)

Edit the Sink tab in the true statement and click on parameters.

Add a new parameter called triggerStart

Head back to the Connections Tab for the dataset where we will be adding dynamic content for the directory and file.

Add the following for the directory and file sections.

Directory

@concat(‘dimProduct/incremental/’,formatDateTime(dataset().triggerStart,’yyyy/MM/dd’))

File

@concat(formatDateTime(dataset().triggerStart,’yyyyMMddHHmmssf

ff’),’.csv’)

Navigate back to the Sink in the Copy and expand dataset properties. Add the dynamic content for the new parameter.

You can now trigger your run and see the new files landing in the datalake.

Streaming ETL with Azure Data Factory and CDC – Create a Parameter Driver Pipeline

Streaming ETL with Azure Data Factory and CDC – Creating an Incremental Pipeline in Azure Data Factory

Vimal Vachhani — Wed, 27 Jan 2021 01:23:51 +0000

In this series we look at building a Streaming ETL with Azure Data Factory and CDC – Creating an Incremental Pipeline in Azure Data Factory. This is Part 6, The rest of the series is below.

Side Note: Want to learn SQL or Python for free. In less then 10 minutes a day and less than an hour total? Signup for my free classes delivered daily right to your email inbox for free!

Now back to the article…

This series uses the Adventureworks database. For more information on how to get that set up see my Youtube video for Downloading and Restoring the database.

Creating an Incremental Pipeline in Azure Data Factory

Create a new Dataset as a SQL Server and name it “Incremental_Load”

From the Activities, get the lookup and move it to the main window. This task will be used to track the new changes in the CDC table for a certain time frame. Name the Lookup “GetChangeCount”

In the properties, add the customer query and replace with the correct table name. The query “SELECT capture_instance FROM cdc.change_tables” will give you the names of the CDC tables and can be tested in SQL Management Sudio

DECLARE @from_lsn binary(10), @to_lsn binary(10);

SET @from_lsn =sys.fn_cdc_get_min_lsn(‘dbo_DimProduct’);

SET @to_lsn = sys.fn_cdc_map_time_to_lsn(‘largest less than or equal’, GETDATE());

SELECT count(1) changecount FROM cdc.fn_cdc_get_all_changes_dbo_DimProduct(@from_lsn, @to_lsn, ‘all’)

Preview data will show the result of this query. This will show how many changes have been recorded in

In the activities, expand iteration and add the If condition to the flow.

Name it “HasChangedRow” and in the Properties window add the code “@greater(int(activity(‘GetChangeCount’).output.firstRow.changecount),0)” and select the Pencil next to the True Condition.

Add the Copy Activity from the Move & Transform Activity and name it “Copy Incremental Data”.

In the Source Tab, set the Source to your SQL Dataset and use the following query replacing the highlighted if needed. Select Preview to see the results.

DECLARE @from_lsn binary(10), @to_lsn binary(10);

SET @from_lsn =sys.fn_cdc_get_min_lsn(‘dbo_DimProduct’);

SET @to_lsn = sys.fn_cdc_map_time_to_lsn(‘largest less than or equal’, GETDATE());

SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_DimProduct(@from_lsn, @to_lsn, ‘all’)

In the Sink tab, select your CSV Blob source.

You can now debug the main pipeline and check in your storage account to see if data was move to capture the rows from CDC.

Streaming ETL with Azure Data Factory and CDC – Creating an Incremental Pipeline in Azure Data Factory

Streaming ETL with Azure Data Factory and CDC – Creating a Data Source Connection in Azure Data Factory

Vimal Vachhani — Wed, 27 Jan 2021 01:20:05 +0000

In this series we look at building a Streaming ETL with Azure Data Factory and CDC – Creating a Data Source Connection in Azure Data Factory. This is Part 5, The rest of the series is below.

This series uses the Adventureworks database. For more information on how to get that set up see my Youtube video for Downloading and Restoring the database.

Create a new Dataset as a SQL Server.

Name it DimProperty and select your Integrated Runtime for the local SQL server. For the table, select the CDC table.

Create a new Dataset and this time select Azure Blob Storage and a DelimitedText.

Name the csv_DimProperty and select New Linked Service.

Name the blob storage “DataLake” to match your storage account and point it to your storage account in the subscription.

Select the “datalake” from the file folder section or type it in and set first row as header and select ok to complete.

We should now have our two datasets in the resource’s sections for the property CDC transfer. Select Publish All at the top to save your changes

Streaming ETL with Azure Data Factory and CDC – Creating a Data Source Connection in Azure Data Factory

Streaming ETL with Azure Data Factory and CDC – Provisioning Azure Blob Storage

Vimal Vachhani — Mon, 25 Jan 2021 02:28:39 +0000

In this series we look at building a Streaming ETL with Azure Data Factory and CDC – Provisioning Azure Blob Storage . This is Part 4, The rest of the series is below.

Side Note: Want to learn SQL or Python for free. In less then 10 minutes a day and less than an hour total? Signup for my free classes delivered daily right to your email inbox for free!

Now back to the article…

This series uses the Adventureworks database. For more information on how to get that set up see my Youtube video for Downloading and Restoring the database.

Provisioning Azure Blob Storage

For this section, we are assuming you have already signed up for a Microsoft Azure Account. If you have not, navigate to a https://azure.microsoft.com/ to get your free account.

Open the Azure Portal and provision a new Data Factory

Give the storage a resource group, a name and leave the rest as is. Select Review and Create to complete set up.

One the storage is set up, go to the resource and select “Storage Explorer” and create a new blob container called datalake

Streaming ETL with Azure Data Factory and CDC – Provisioning Azure Blob Storage

Streaming ETL with Azure Data Factory and CDC – Provisioning Azure Data Factory

Vimal Vachhani — Mon, 25 Jan 2021 02:23:08 +0000

In this series we look at building a Streaming ETL with Azure Data Factory and CDC – Provisioning Azure Data Factory . This is Part 1, The rest of the series is below.

This series uses the Adventureworks database. For more information on how to get that set up see my Youtube video for Downloading and Restoring the database.

Provisioning Azure Data Factory

For this section, we are assuming you have already signed up for a Microsoft Azure Account. If you have not, navigate to a https://azure.microsoft.com/ to get your free account.

Open the Azure Portal and provision a new Data Factory

Set up a Resource Group, and you will need to pick a name that has not been used before. Select Review and Create to complete set up of Azure Data Factory.

Streaming ETL with Azure Data Factory and CDC – Provisioning Azure Data Factory

Streaming ETL with Azure Data Factory and CDC – Setting up Audit Tables

Vimal Vachhani — Mon, 25 Jan 2021 02:16:18 +0000

In this series we look at building a Streaming ETL with Azure Data Factory and CDC – Setting up Audit Tables. This is Part 1, The rest of the series is below.

This series uses the Adventureworks database. For more information on how to get that set up see my Youtube video for Downloading and Restoring the database.

Setting up Audit Tables

Audit tables will be required to track what has been loaded and sent to the streaming process of the ETL. The second step will seed this table with the first row of data.

/********Code************/

— Create Separate Offset Table to Manage Last Position/Row Sent to Azure

CREATE TABLE [dbo].[Audit_Streaming_ETL]

(

[TableName] [varchar](50) NOT NULL,

[MaxVal] [binary](10) NOT NULL,

[LastUpdateDateTime] [datetime] NOT NULL DEFAULT getdate(),

[LastCheckedDateTime] [datetime] NOT NULL DEFAULT getdate(),

CONSTRAINT [PK_Audit_Streaming_ETL] PRIMARY KEY NONCLUSTERED

(

[TableName] ASC

)

INSERT INTO [dbo].[Audit_Streaming_ETL]

SELECT ”, 0x0000000000000000000, ‘1900-01-01 00:00:00’, ‘1900-01-01 00:00:00’

Streaming ETL with Azure Data Factory and CDC – Setting up Audit Tables

Streaming ETL with Azure Data Factory and CDC – Enabling CDC

Vimal Vachhani — Mon, 25 Jan 2021 02:09:32 +0000

In this series we look at building a Streaming ETL with Azure Data Factory and CDC – Enabling CDC. This is Part 1, The rest of the series is below.

This series uses the Adventureworks database. For more information on how to get that set up see my Youtube video for Downloading and Restoring the database.

Enabling CDC on SQL Server

First, enable Change Data Capture at the Database and Table Level using the following scripts. More information is available on the Microsoft Site. If you are not using SQL Server or a tool that has organic CDC build it, a similar process can be hand built a read only view by leveraging timestamps on last created date and last updated date.

/********Code**********/

sys.sp_cdc_enable_db

EXECUTE sys.sp_cdc_enable_table

@source_schema = N’dbo’

, @source_name = N’DimProduct’

, @role_name = N’cdc_Admin’;

This step will enable CDC on the database as well as add it to the table “DimProduct”. SQL Agent must be running as two jobs are created during this process, one to load the table and one to clean it out. This is a new process that will consume resources to be sure to do your research before turning this on for all tables.

Once the CDC is enabled, SQL Server will automatically create the following tables “schema_tablename_CT” in the System folder section. This table will now automatically track all data changes that occur on that table. You can reference the _$operationcode to determine what change occurred with the legend below. If you wish to capture changes to only certain fields, see the Microsoft documentation on CDC to see how that can be set. If you are handwriting your SQL, this can also be programmed in when building your staging query.

1 = delete

2 = insert

3 = update (captured column values are those before the update operation). This value applies only when the row filter option ‘all update old’ is specified.

4 = update (captured column values are those after the update operation)

Now once you add, edit or delete a record, you should be able to find it in the new CDC table. Next, we will look at scanning this table and turning the data to JSON to send to an Event Hub! Run the code below to insert and update into the DimProduct table.

Insert

INSERT INTO [dbo].[DimProduct]

([ProductAlternateKey]

,[ProductSubcategoryKey]

,[WeightUnitMeasureCode]

,[SizeUnitMeasureCode]

,[EnglishProductName]

,[SpanishProductName]

,[FrenchProductName]

,[StandardCost]

,[FinishedGoodsFlag]

,[Color]

,[SafetyStockLevel]

,[ReorderPoint]

,[ListPrice]

,[Size]

,[SizeRange]

,[Weight]

,[DaysToManufacture]

,[ProductLine]

,[DealerPrice]

,[Class]

,[Style]

,[ModelName]

,[LargePhoto]

,[EnglishDescription]

,[FrenchDescription]

,[ChineseDescription]

,[ArabicDescription]

,[HebrewDescription]

,[ThaiDescription]

,[GermanDescription]

,[JapaneseDescription]

,[TurkishDescription]

,[StartDate]

,[EndDate]

,[Status])

VALUES

(‘VV-2903’

,NULL

,’Test Product’

,’Black’

,NULL

,NULL)

Update

UPDATE [dbo].[DimProduct]

SET FinishedGoodsFlag = 1

WHERE [ProductAlternateKey] = ‘VV-2903’

Now you can query the table “cdc.dbo_DimProduct_CT” to see the changes that were recorded.

Streaming ETL with Azure Data Factory and CDC – Enabling CDC