Big Data – jack of all trades master of some

Visual Studio Database projects to Deploy Azure Synapse Pool

Vimal Vachhani — Mon, 15 Feb 2021 20:36:56 +0000

Visual Studio Database projects to Deploy Azure Synapse Pool

Side Note: Want to learn SQL or Python for free. In less then 10 minutes a day and less than an hour total? Signup for my free classes delivered daily right to your email inbox for free!

Now back to the article…

Get Visual Studio 2019

Download and install Visual Studio 2019 Community Edition
1. https://visualstudio.microsoft.com/

b) Verify Data Storage and Processing and make sure all updates are up to date.

Create the database project

Create a new project in Visual Studio

Create a new SQL Server Database project.

To add your first item, select from the new solution and select Add->New Item

From the list of items, select “Table (Data Warehouse) as this will allow for slightly different create table statements with columnstore indexes.

Add your code to the editor. Some items may still show an error but it will not be an issue. Save when ready.

Update the Target Platform

Right click the solution and select properties

Set the Target Platform to Microsoft Azure SQL Data Warehouse and save

Publishing Changes to Server

Right click the project in solutions explorer and select “Publish”
Select the Azure SQL Data warehouse as the target database platform and select “Publish”

Visual Studio Database projects to Deploy Azure Synapse Pool

Automatically pausing and resuming Azure Workspace Synapse Pool Using Azure Data Factory

Vimal Vachhani — Wed, 10 Feb 2021 20:59:10 +0000

Automatically pausing and resuming Azure Workspace Synapse Pool Using Azure Data Factory.

Create a new Azure Data Factory Pipeline
Add the Web Task and name it “PauseDW”

In the settings, follow the following steps.
- copy the following command. MOre on this command can be found here.
  - https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Synapse/workspaces/{workspaceName}/sqlPools/{sqlPoolName}/resume?api-version=2019-06-01-preview
  - Most of this information can be found in your Synapse main workspace page

Set the setting to “POST”

In the advanced section, set the method to “MSI” and the resource to “https://management.core.windows.net”

In your IAM Security for your workspace add the contributor role for your data factory.

Debug your pipeline and your service should now pause. Update the API call to the replace the word “pause” with “resume” to have it work the other way around.
Add these steps to triggers at specific times a day to run to turn your resources on and off.

Automatically pausing and resuming Azure Workspace Synapse Pool Using Azure Data Factory.

Looping SQL Tables to Data Lake in Azure Data Factory

Vimal Vachhani — Mon, 08 Feb 2021 02:31:46 +0000

When you load data from a SQL Server, instead of individual pipelines, it is best to have one dynamic table controlled process. Learn how to loop through SQL tables dynamically to load from SQL Server to Azure Data Lake. Looping SQL Tables to Data Lake in Azure Data Factory

Setting up the ETL_Control Database

Create the database ETLControl and the table to store the metadata for the ETL runs.

USE [ETLControl]
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[ETLControl](
	[Id] [int] IDENTITY(1,1) NOT NULL,
	[DatabaseName] [varchar](50) NOT NULL,
[SchemaName] [varchar](50) NOT NULL,
	[TableName] [varchar](50) NOT NULL,
	[LoadType] [varchar](50) NOT NULL
) ON [PRIMARY]
GO


Insert Into [dbo].[ETLControl]
Select 'Databasename', 'dbo', 'TableName1', 'Full'

Insert Into [dbo].[ETLControl]
Select Dataasename', 'dbo', 'TableName2', 'Full'

Setting up Azure Data Factory

Create a Linked Service to the SQL Database

Create a DataSet for the ETLControl Database
- Point to the Linked Service for SQL Server
- Do no assign it a table name. This will be done dynamically later.

Add a new Pipeline with the Lookup object
- Set the source Query “Select * From ETLControl”

Add the For Each Loop
- In the settings add the Dyanmic Item “@activity(‘Get-Tables’).output.value”

Add a new data source for the SQL Souce
- Give the SQL a parameter for TableName and SchemaName
- Update the Table to use the variables

Add a new data source for the DataLake Destination
- Give the SQL a parameter for FileName
- Update the File Path to the Dynamic content parameter

Add a Copy Activity to the For Each Loop.
- Set the source using the variables from the Lookup
- Set the sink as the file name variable from look up with .csv

Debug to run to see new files land in Data Lake with dynamic names. There should be one file for each table that was loaded. You can modify the file names to include folder names and more dynamic storage if needed.

Looping SQL Tables to Data Lake in Azure Data Factory

Streaming ETL with Azure Data Factory and CDC – Creating the Rolling ETL Window

Vimal Vachhani — Wed, 27 Jan 2021 01:33:06 +0000

In this series we look at building a Streaming ETL with Azure Data Factory and CDC – Creating the Rolling ETL Window. This is Part 1, The rest of the series is below.

This series uses the Adventureworks database. For more information on how to get that set up see my Youtube video for Downloading and Restoring the database.

Creating the Rolling ETL Window

Now that we have our parameter driven piepline we can create a trigger using a rolling time to run intermittantly and pick up changes.

Click on Add Trigger -> New

Create a new trigger set as Tumbling window and set it for a time in the future to start.

On the following screen, set the start and end dates to

@formatDateTime(trigger().outputs.windowStartTime,’yyyy-MM-dd HH:mm:ss.fff’)

@formatDateTime(trigger().outputs.windowEndTime,’yyyy-MM-dd HH:mm:ss.fff’)

Streaming ETL with Azure Data Factory and CDC – Creating the Rolling ETL Window

Streaming ETL with Azure Data Factory and CDC – Create a Parameter Driver Pipeline

Vimal Vachhani — Wed, 27 Jan 2021 01:27:40 +0000

In this series we look at building a Streaming ETL with Azure Data Factory and CDC – Create a Parameter Driver Pipeline. This is Part 6, The rest of the series is below.

This series uses the Adventureworks database. For more information on how to get that set up see my Youtube video for Downloading and Restoring the database.

The previous step will pull all the changes in the CDC table, but we do not want to do this all the time. So let’s look at creating a rolling window for the CDC ETL.

Navigate to the parameters section and create a new parameter. Add two paramenters “triggerStartTime” and triggerEndTime” and set them to yesterday and todays date in the format “2020-01-07 12:00:00:000”

On the Lookup Activity, update the query in the settings to the following to use the new variables. SQL Agent must be running for this step the parameters must be valid dates.

@concat(‘DECLARE @begin_time datetime, @end_time datetime, @from_lsn binary(10), @to_lsn binary(10);

SET @begin_time = ”’,pipeline().parameters.triggerStartTime,”’;

SET @end_time = ”’,pipeline().parameters.triggerEndTime,”’;

SET @from_lsn = sys.fn_cdc_map_time_to_lsn(”smallest greater than or equal”, @begin_time);

SET @to_lsn = sys.fn_cdc_map_time_to_lsn(”largest less than”, @end_time);

SELECT count(1) changecount FROM cdc.fn_cdc_get_all_changes_dbo_DimProduct (@from_lsn, @to_lsn, ”all”)’)

Navigate back to the “True” condition and paste the following query in to track the changes with the variables as well

@concat(‘DECLARE @begin_time datetime, @end_time datetime, @from_lsn binary(10), @to_lsn binary(10);

SET @begin_time = ”’,pipeline().parameters.triggerStartTime,”’;

SET @end_time = ”’,pipeline().parameters.triggerEndTime,”’;

SET @from_lsn = sys.fn_cdc_map_time_to_lsn(”smallest greater than or equal”, @begin_time);

SET @to_lsn = sys.fn_cdc_map_time_to_lsn(”largest less than”, @end_time);

SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_DimProduct(@from_lsn, @to_lsn, ”all”)’)

Edit the Sink tab in the true statement and click on parameters.

Add a new parameter called triggerStart

Head back to the Connections Tab for the dataset where we will be adding dynamic content for the directory and file.

Add the following for the directory and file sections.

Directory

@concat(‘dimProduct/incremental/’,formatDateTime(dataset().triggerStart,’yyyy/MM/dd’))

File

@concat(formatDateTime(dataset().triggerStart,’yyyyMMddHHmmssf

ff’),’.csv’)

Navigate back to the Sink in the Copy and expand dataset properties. Add the dynamic content for the new parameter.

You can now trigger your run and see the new files landing in the datalake.

Streaming ETL with Azure Data Factory and CDC – Create a Parameter Driver Pipeline

Streaming ETL with Azure Data Factory and CDC – Creating an Incremental Pipeline in Azure Data Factory

Vimal Vachhani — Wed, 27 Jan 2021 01:23:51 +0000

In this series we look at building a Streaming ETL with Azure Data Factory and CDC – Creating an Incremental Pipeline in Azure Data Factory. This is Part 6, The rest of the series is below.

Side Note: Want to learn SQL or Python for free. In less then 10 minutes a day and less than an hour total? Signup for my free classes delivered daily right to your email inbox for free!

Now back to the article…

This series uses the Adventureworks database. For more information on how to get that set up see my Youtube video for Downloading and Restoring the database.

Creating an Incremental Pipeline in Azure Data Factory

Create a new Dataset as a SQL Server and name it “Incremental_Load”

From the Activities, get the lookup and move it to the main window. This task will be used to track the new changes in the CDC table for a certain time frame. Name the Lookup “GetChangeCount”

In the properties, add the customer query and replace with the correct table name. The query “SELECT capture_instance FROM cdc.change_tables” will give you the names of the CDC tables and can be tested in SQL Management Sudio

DECLARE @from_lsn binary(10), @to_lsn binary(10);

SET @from_lsn =sys.fn_cdc_get_min_lsn(‘dbo_DimProduct’);

SET @to_lsn = sys.fn_cdc_map_time_to_lsn(‘largest less than or equal’, GETDATE());

SELECT count(1) changecount FROM cdc.fn_cdc_get_all_changes_dbo_DimProduct(@from_lsn, @to_lsn, ‘all’)

Preview data will show the result of this query. This will show how many changes have been recorded in

In the activities, expand iteration and add the If condition to the flow.

Name it “HasChangedRow” and in the Properties window add the code “@greater(int(activity(‘GetChangeCount’).output.firstRow.changecount),0)” and select the Pencil next to the True Condition.

Add the Copy Activity from the Move & Transform Activity and name it “Copy Incremental Data”.

In the Source Tab, set the Source to your SQL Dataset and use the following query replacing the highlighted if needed. Select Preview to see the results.

DECLARE @from_lsn binary(10), @to_lsn binary(10);

SET @from_lsn =sys.fn_cdc_get_min_lsn(‘dbo_DimProduct’);

SET @to_lsn = sys.fn_cdc_map_time_to_lsn(‘largest less than or equal’, GETDATE());

SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_DimProduct(@from_lsn, @to_lsn, ‘all’)

In the Sink tab, select your CSV Blob source.

You can now debug the main pipeline and check in your storage account to see if data was move to capture the rows from CDC.

Streaming ETL with Azure Data Factory and CDC – Creating an Incremental Pipeline in Azure Data Factory

Streaming ETL with Azure Data Factory and CDC – Creating a Data Source Connection in Azure Data Factory

Vimal Vachhani — Wed, 27 Jan 2021 01:20:05 +0000

In this series we look at building a Streaming ETL with Azure Data Factory and CDC – Creating a Data Source Connection in Azure Data Factory. This is Part 5, The rest of the series is below.

This series uses the Adventureworks database. For more information on how to get that set up see my Youtube video for Downloading and Restoring the database.

Create a new Dataset as a SQL Server.

Name it DimProperty and select your Integrated Runtime for the local SQL server. For the table, select the CDC table.

Create a new Dataset and this time select Azure Blob Storage and a DelimitedText.

Name the csv_DimProperty and select New Linked Service.

Name the blob storage “DataLake” to match your storage account and point it to your storage account in the subscription.

Select the “datalake” from the file folder section or type it in and set first row as header and select ok to complete.

We should now have our two datasets in the resource’s sections for the property CDC transfer. Select Publish All at the top to save your changes

Streaming ETL with Azure Data Factory and CDC – Creating a Data Source Connection in Azure Data Factory

Streaming ETL with Azure Data Factory and CDC – Provisioning Azure Blob Storage

Vimal Vachhani — Mon, 25 Jan 2021 02:28:39 +0000

In this series we look at building a Streaming ETL with Azure Data Factory and CDC – Provisioning Azure Blob Storage . This is Part 4, The rest of the series is below.

Side Note: Want to learn SQL or Python for free. In less then 10 minutes a day and less than an hour total? Signup for my free classes delivered daily right to your email inbox for free!

Now back to the article…

This series uses the Adventureworks database. For more information on how to get that set up see my Youtube video for Downloading and Restoring the database.

Provisioning Azure Blob Storage

For this section, we are assuming you have already signed up for a Microsoft Azure Account. If you have not, navigate to a https://azure.microsoft.com/ to get your free account.

Open the Azure Portal and provision a new Data Factory

Give the storage a resource group, a name and leave the rest as is. Select Review and Create to complete set up.

One the storage is set up, go to the resource and select “Storage Explorer” and create a new blob container called datalake

Streaming ETL with Azure Data Factory and CDC – Provisioning Azure Blob Storage

Streaming ETL with Azure Data Factory and CDC – Provisioning Azure Data Factory

Vimal Vachhani — Mon, 25 Jan 2021 02:23:08 +0000

In this series we look at building a Streaming ETL with Azure Data Factory and CDC – Provisioning Azure Data Factory . This is Part 1, The rest of the series is below.

This series uses the Adventureworks database. For more information on how to get that set up see my Youtube video for Downloading and Restoring the database.

Provisioning Azure Data Factory

For this section, we are assuming you have already signed up for a Microsoft Azure Account. If you have not, navigate to a https://azure.microsoft.com/ to get your free account.

Open the Azure Portal and provision a new Data Factory

Set up a Resource Group, and you will need to pick a name that has not been used before. Select Review and Create to complete set up of Azure Data Factory.

Streaming ETL with Azure Data Factory and CDC – Provisioning Azure Data Factory

Streaming ETL with Azure Data Factory and CDC – Setting up Audit Tables

Vimal Vachhani — Mon, 25 Jan 2021 02:16:18 +0000

In this series we look at building a Streaming ETL with Azure Data Factory and CDC – Setting up Audit Tables. This is Part 1, The rest of the series is below.

This series uses the Adventureworks database. For more information on how to get that set up see my Youtube video for Downloading and Restoring the database.

Setting up Audit Tables

Audit tables will be required to track what has been loaded and sent to the streaming process of the ETL. The second step will seed this table with the first row of data.

/********Code************/

— Create Separate Offset Table to Manage Last Position/Row Sent to Azure

CREATE TABLE [dbo].[Audit_Streaming_ETL]

(

[TableName] [varchar](50) NOT NULL,

[MaxVal] [binary](10) NOT NULL,

[LastUpdateDateTime] [datetime] NOT NULL DEFAULT getdate(),

[LastCheckedDateTime] [datetime] NOT NULL DEFAULT getdate(),

CONSTRAINT [PK_Audit_Streaming_ETL] PRIMARY KEY NONCLUSTERED

(

[TableName] ASC

)

INSERT INTO [dbo].[Audit_Streaming_ETL]

SELECT ”, 0x0000000000000000000, ‘1900-01-01 00:00:00’, ‘1900-01-01 00:00:00’

Streaming ETL with Azure Data Factory and CDC – Setting up Audit Tables