Azure – jack of all trades master of some

Setting up Azure Data Factory Integration Runtime for On-Prem Connections

Vimal Vachhani — Tue, 12 Jan 2021 14:34:40 +0000

If you need to connect your Azure Data Factory to an on-premise SQL server you will need help Setting up Azure Data Factory Integration Runtime for On-Prem Connections

Set up On-Prem Integration Run Time

Click on Author & Monitor from Azure Data Factory

Create two new linked services
- Linked Service to Database. Since we are using a local on-prem SQL server we will need to create a integrated run time. Click on Manage -> Integrated Runtimes -> New

Select Integration runtime setup followed by Self-Hosted.

Name the Runtime Setup and select “Next”

On the next step, Option 2: Manual Setup and download and install the integration runtime. Or download from here.

Download and install the runtime. The IR should be on the same machine that your data is located.

One the installation is done you will be prompted with the Configuration Manager (Self-Hosted). Enter the keys from your Azure Setup and hit register.

Leave the name as is and check the box to “Enable Remote Access from Intranet” if you need it, but it is not necessary.

If all is setup correctly you will get a confirmation message.

If you navigate back to your Azure environment, you will now see your new integrated self-hosted run time available.

Connecting to On-Prem SQL Server

From Azure Data Factory, Select Linked Services -> New, and you will now see an option for a SQL Server.

Give the connection a name and fill in the rest of the details to line up to your local SQL Server and set to the new integrated run time. If you have not created a SQL Authentication ID before, see the steps at the bottom of this tutorial for details.

Be sure to test the connection and hit create once completed.

Setting up Security on Local Database for Integration Runtime

You will need to create a SQL Authenticated User or service account for the connection to this database by Azure Data Factory.

Connect to your database and select Security -> New -> Login

Create a new Login name, set to SQL Server Authenticaion, set a password and on the User Mappings Roles, Grant Access to AdventureWorksDB as Owner.

Setting up Azure Data Factory Integration Runtime for On-Prem Connections

Auto Spin up and Down Azure Virtual Machines (VM)

Vimal Vachhani — Thu, 02 Jan 2020 19:04:04 +0000

One of the best perks of moving your infrastructure to the cloud is the ability to only pay for the resources you use. When using Virtual Machines that usage can be even further limited to servers that stay on only when you need them (business hours or nightly data runs) and then shut down to minimize costs. Here is a quick guide on how to Auto Spin up and Down Azure Virtual Machines (VM).

Configuration of Automation Accounts

From the Azure Portal, navigate to “Automation Accounts”

Click “Add” to create a new account and fill in the basic information. Any name, subscription resource group and location will do but keeping these inline with where your VM is located will make it easier. Keep “Create Azure Run As Account” Selected as “Yes” and select “Create” to create.

From your new Automation account, select “Runbooks”

We will now “Browse Gallery” and from the gallery section, select “Start Azure V2 VMs” followed by “Import” on the next screen

Do the same exercise with “Stop Azure V2 Machine”. Both will now be available in the runbook’s menu.

From the main runbook screen, you can now click on “Edit” and “publish”

Scheduling the Jobs

Click on “StartAzureV2Vm” to go to details. From the menu on the right select “Schedules” followed by add schedules.

Select “Link a schedule to your runbook” and select “Create a new schedule”. Give the schedule a name, description and a start date. Select “Recurring” for recurrence so that it runs daily. Leave “Set expiration” to “No”

Select “Parameters and run settings” and “Configure parameters and run settings”. In the details, use the name of the resource group name your VM is located in and the exact VM name of your virtual machine. Select ok to save and close the windows.

You may require to edit and publish the runbook again but if not you can manually run the job to test or wait until the schedule is ready. Repeat the same exercise for the Shutdown VM job giving it a time later then the activation schedule previously created to Auto Spin up and Down Azure Virtual Machines (VM).

For more information, please see the official documentation from Microsoft.

https://docs.microsoft.com/en-us/azure/automation/automation-runbook-types

Streaming ETL using CDC and Azure Event Hub. A Modern Data Architecture.

Vimal Vachhani — Thu, 21 Mar 2019 21:54:38 +0000

In Modern Data architecture, As Data Warehouses have gotten bigger and faster, and as big data technology has allowed us to store vast amounts of data it is still strange to me that most data warehouse refresh processes found in the wild are still some form of batch processing. Even Hive queries against massive Hadoop infrastructures are essentially fast performing bath queries. Sure, they may occur every half day or even every hour but the speed of business continues to accelerate and we must start looking at architecture that combines the speed and transactional processing of Kafka/Spark/Event Hubs into creating a real time streaming ETL to load a data warehouse at a cost that is comparable and even cheaper then purchasing an ETL tool. Let’s look at Streaming ETL using CDC and Azure Event Hub.

Interested in Learning More about Modern Data Architecture? .

Side Note: Want to learn SQL or Python for free. In less then 10 minutes a day and less than an hour total? Signup for my free classes delivered daily right to your email inbox for free!

Now back to the article…

For this series, we will be looking at Azure Event Hubs or IoT Hubs. These were designed to capture fast streaming data of millions of rows from IoT devices or streaming data like Twitter. But why should this tool be limited to these use cases? Most businesses have no need or requirement for this use case, but we can use this technology to create a live streaming ETL to your data warehouse or your reporting environment with out sacrifice performance or creating a strain on your source systems. This architecture can be used to perform data synchronization between systems and other integrations as well, and since we are not using it to its full potential of capturing millions of flowing records, our costs end up being pennies a day!

Others have emulated this sort of process by using triggers on their source table, but this can potentially add an extra step of processing and overhead to your database. By enabling change data capture natively on SQL Server, it can be much lighter than a trigger. You can then take the first steps to creating a streaming ETL for your data. If CDC is not available, simple staging scripts can be written to emulate the same but be sure to keep an eye on performance. Let’s take a look at the first step of setting up native Change Data Capture on your SQL Server tables

Steps

First, enable Change Data Capture at the Database and Table Level using the following scripts. More information is available on the Microsoft Site. If you are not using SQL Server or a tool that has organic CDC build it, a similar process can be hand built a read only view by leveraging timestamps on last created date and last updated date.

sys.sp_cdc_enable_db

EXECUTE sys.sp_cdc_enable_table

@source_schema = N’dbo’

, @source_name = N’Task’

, @role_name = N’cdc_Admin’;

This step will enable CDC on the database as well as add it to the table “Task”. SQL Agent must be running as two jobs are created during this process, one to load the table and one to clean it out.

Once the CDC is enabled, SQL Server will automatically create the following tables “schema_tablename_CT” in the System folder section. This table will now automatically track all data changes that occur on that table. You can reference the _$operationcode to determine what change occurred with the legend below. If you wish to capture changes to only certain fields, see the Microsoft documentation on CDC to see how that can be set. If you are hand writing your SQL, this can also be programmed in when building your staging query.

1 = delete
2 = insert
3 = update (captured column values are those before the update operation). This value applies only when the row filter option ‘all update old’ is specified.
4 = update (captured column values are those after the update operation)

Now once you add, edit or delete a record, you should be able to find it in the new CDC table. Next, we will look at scanning this table and turning the data to JSON to send to an Event Hub!

For more information on SQL CDC please see their documentation here.

Be sure to check out my full online class on the topic. A hands on walk through of a Modern Data Architecture using Microsoft Azure. For beginners and experienced business intelligence experts alike, learn the basic of navigating the Azure Portal to building an end to end solution of a modern data warehouse using popular technologies such as SQL Database, Data Lake, Data Factory, Data Bricks, Azure Synapse Data Warehouse and Power BI. Link to the class can be found here.

Part 1 – Navigating the Azure Portal

Part 2 – Resource Groups and Subscriptions

Part 3 – Creating Data Lake Storage

Part 4 – Setting up an Azure SQL Server

Part 5 – Loading Data Lake with Azure Data Factory

Part 6 – Configuring and Setting up Data Bricks

Part 7 – Staging data into Data Lake

Part 8 = Provisioning a Synapse SQL Data Warehouse

Part 9 – Loading Data into Azure Data Synapse Data Warehouse

The Modern Data Warehouse; Azure Data Lake and U-SQL to combine data

Vimal Vachhani — Mon, 17 Dec 2018 15:10:04 +0000

The modern data warehouse will need to use Azure Data Lake and U-SQL to combine data. Begin by navigating to your Azure Portal and searching for the Data Lake Analytics Resource. Let’s start by creating a new Data Lake. Don’t worry, this service only charges on data in and out, not just remaining on like an HDInsights cluster so you should not be charged anything and we will not need to spin up and spin down services like we did earlier.

You will need to give it a unique name as well as tie it to a pay as you go subscription. A data lake will also need a data lake storage layer as well. You can keep the naming as is or rename it if you wish. It is recommended to keep the storage encrypted. Deployment may take a few minutes.

Let’s navigate to the data explore and create a new folder to put our sample data. Upload the same 3 files from the happiness index to the new folder.

Take a look through your data explorer at the tables. It should contain a master database similar to as if you were running a traditional SQL Server.

Create a new U-SQL Job that creates the new database in the catalog as well as a schema and a table.

CREATE DATABASE IF NOT EXISTS testdata;

USE DATABASE testdata;

CREATE SCHEMA IF NOT EXISTS happiness;

CREATE TABLE happiness.placeholderdata

(

Region string,

HappinessRank float,

HappinessScore float,

LowerConfidenceInterval float,

UpperConfidenceInterval float,

Economy_GDPperCapita float,

Family float,

Health_LifeExpectancy float,

FreedomTrust_GovernmentCorruption float,

GenerosityDystopiaResidual float,

INDEX clx_Region CLUSTERED(Region ASC) DISTRIBUTED BY HASH(Region)

);

While running and once completed, Azure will present the execution tree. With the new table created, we can now load data into that table.

USE DATABASE testdata;

@log =

EXTRACT Region string,

HappinessRank float,

HappinessScore float,

LowerConfidenceInterval float,

UpperConfidenceInterval float,

Economy_GDPperCapita float,

Family float,

Health_LifeExpectancy float,

FreedomTrust_GovernmentCorruption float,

GenerosityDystopiaResidual float

FROM “/sampledata/{*}.csv”

USING Extractors.Text(‘,’,silent:true);

INSERT INTO happiness.placeholderdata

SELECT * FROM @log;

Once the data is loaded, we can go query the data in the data explorer to see what it looks like. The script will run the select and output the data to a file to be browsed.

@table = SELECT * FROM [testdata].[happiness].[placeholderdata];

OUTPUT @table

TO “/OUTPUTS/Sampledataquery.csv”

USING Outputters.Csv();

Part 1 – Navigating the Azure Portal

Part 2 – Resource Groups and Subscriptions

Part 3 – Creating Data Lake Storage

Part 4 – Setting up an Azure SQL Server

Part 5 – Loading Data Lake with Azure Data Factory

Part 6 – Configuring and Setting up Data Bricks

Part 7 – Staging data into Data Lake

Part 8 = Provisioning a Synapse SQL Data Warehouse

Part 9 – Loading Data into Azure Data Synapse Data Warehouse

The Modern Data Warehouse; Running Hive Queries in Visual Studio to combine data

Vimal Vachhani — Tue, 04 Dec 2018 21:21:24 +0000

In previous posts we have looked at storing data files to blob storage and using PowerShell to spin up an HDInsight Hadoop cluster. We have also installed some basic software that will help us get going once the services are provisioned. Now that the basics are ready, it is time to process some of that data using Hive and Visual Studio. In this scenario, we will be loading our Happiness Index data files into Hive tables and then consolidating that data into a single file.

For this tutorial we are going to leverage Azure Storage Browser to view our storage files as well as create folders. You can use this tool as well as the shell to create new folders and upload your data files. The actual folder will not be created until you upload files using this tool. In a real-world scenario, you would use a file transfer task or data factory to stage your files.

Open the tool and sign into your Azure account. Once created, navigate to your HDInsights Cluster and in your storage and create a new folder in your Hive Storage called “data” and uploaded all 3 of our files to this location.

Open Visual Studio with your newly installed “Azure Data Lake Tools” installed. If you look at your server explorer and ensure you are signed in with the same Azure account as where your cluster is located, you will see the cluster listed in the menu.

You can create a new table directly from the server explorer. In the tool you can either script or use the wizard to create the new table. Create it so that it has the same column names as the file. For the example below we will just run the create table and the load table step as one Hive script. Update the script to load the 2016 and the 2017 files as well.

CREATE TABLE IF NOT EXISTS default.sourcedata2015(Country string,

Region string,

HappinessRank float,

HappinessScore float,

LowerConfidenceInterval float,

UpperConfidenceInterval float,

Economy_GDPperCapita float,

Family float,

Health_LifeExpectancy float,

FreedomTrust_GovernmentCorruption float,

GenerosityDystopiaResidual float)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘,’

COLLECTION ITEMS TERMINATED BY ‘\002’

MAP KEYS TERMINATED BY ‘\003’

STORED AS TEXTFILE;

LOAD DATA INPATH ‘/data/2015.csv’ INTO table sourcedata2015;

Once you have created your 3 new Hive tables, it is time to consolidate them into one table using a simple SQL like union statement and adding the year column to the end.

Select * from

(

SELECT *, ‘2015’ as Year FROM sourcedata2017 b

UNION ALL

SELECT *, ‘2016’ as Year FROM sourcedata2017 c

UNION ALL

SELECT *, ‘2017’ as Year FROM sourcedata2017 d

) CombinedTable

Once the job is complete, you should be able to view the results in see by clicking “Job Output”. All we have done here is created a select statement, but this data could have also been inserted into a new Hive table for either more processing or a push to a targeted data warehouse.

I hope this introduction helps you get off the ground in the basics of running Hive queries. In later posts we will look at how HDinsights intersects with Azure Data Lake and Data Factory. Before you leave, be sure to delete your HDCluster. Since this services runs as an hourly service and is not billed on a job basis, so it will continue to charge your account. You can either delete via the Azure dashboard or using the PowerShell script from our previous posts.

Setting up tools to work with HDInsights and run Hive Queries – Azure Data Lake Tools and Azure Storage Browser

Vimal Vachhani — Tue, 27 Nov 2018 22:00:30 +0000

Two tools that are going to make life a bit simpler if you are going to be working with HDInsights and Azure blog storage are “Azure Data Lake and Stream Analytic Tools for Visual Studio” and Azure Storage Browser.

Azure Data Lake and Stream Analytic Tools for Visual Studio

To run Hive Queries, you’re going to need to install Azure Data Lake and Stream Analytic Tools to your version of Visual Studio, sometimes referred to as HDInsight tools for Visual Studio or Azure Data Lake tools for Visual Studio. You can install directly from Visual studio by selection Tools -> Get Tools and Features.

Once you have Visual Studio Open, navigate to your server explorer and verify you are connected to the right Azure Subscription. You should now see your cluster information as created in the previous blog post on “Provisioning HDInsight Clusters using PowerShell”

You should be able to browse and interact with your storage from Visual Studio as well as the Azure dashboard at this point but another helpful tool to install will be “Azure Storage Explorer”. It can be found and installed from the following location.

Data explorer can be used to create new folders and upload your data files and this does make that job a little easier. Please note, when creating new folders, the actual folder will not be created until you upload files using this tool.

That should be it! From the Server explorer, you can now see the HDInsight Cluster that was spun up as well as the storage accounts where files will be stored. In the next post we will look at building and running Hive queries against your data files to get them ready for reporting.