Data Warehouse – jack of all trades master of some

Visual Studio Database projects to Deploy Azure Synapse Pool

Vimal Vachhani — Mon, 15 Feb 2021 20:36:56 +0000

Visual Studio Database projects to Deploy Azure Synapse Pool

Side Note: Want to learn SQL or Python for free. In less then 10 minutes a day and less than an hour total? Signup for my free classes delivered daily right to your email inbox for free!

Now back to the article…

Get Visual Studio 2019

Download and install Visual Studio 2019 Community Edition
1. https://visualstudio.microsoft.com/

b) Verify Data Storage and Processing and make sure all updates are up to date.

Create the database project

Create a new project in Visual Studio

Create a new SQL Server Database project.

To add your first item, select from the new solution and select Add->New Item

From the list of items, select “Table (Data Warehouse) as this will allow for slightly different create table statements with columnstore indexes.

Add your code to the editor. Some items may still show an error but it will not be an issue. Save when ready.

Update the Target Platform

Right click the solution and select properties

Set the Target Platform to Microsoft Azure SQL Data Warehouse and save

Publishing Changes to Server

Right click the project in solutions explorer and select “Publish”
Select the Azure SQL Data warehouse as the target database platform and select “Publish”

Visual Studio Database projects to Deploy Azure Synapse Pool

Setting up tools to work with HDInsights and run Hive Queries – Azure Data Lake Tools and Azure Storage Browser

Vimal Vachhani — Tue, 27 Nov 2018 22:00:30 +0000

Two tools that are going to make life a bit simpler if you are going to be working with HDInsights and Azure blog storage are “Azure Data Lake and Stream Analytic Tools for Visual Studio” and Azure Storage Browser.

Azure Data Lake and Stream Analytic Tools for Visual Studio

To run Hive Queries, you’re going to need to install Azure Data Lake and Stream Analytic Tools to your version of Visual Studio, sometimes referred to as HDInsight tools for Visual Studio or Azure Data Lake tools for Visual Studio. You can install directly from Visual studio by selection Tools -> Get Tools and Features.

Once you have Visual Studio Open, navigate to your server explorer and verify you are connected to the right Azure Subscription. You should now see your cluster information as created in the previous blog post on “Provisioning HDInsight Clusters using PowerShell”

You should be able to browse and interact with your storage from Visual Studio as well as the Azure dashboard at this point but another helpful tool to install will be “Azure Storage Explorer”. It can be found and installed from the following location.

Data explorer can be used to create new folders and upload your data files and this does make that job a little easier. Please note, when creating new folders, the actual folder will not be created until you upload files using this tool.

That should be it! From the Server explorer, you can now see the HDInsight Cluster that was spun up as well as the storage accounts where files will be stored. In the next post we will look at building and running Hive queries against your data files to get them ready for reporting.

Big Data for The Rest of Us. Affordable and Modern Business Intelligence Architecture – Adding Lifecycles to your S3 buckets to save cost and retain data forever!

Vimal Vachhani — Tue, 25 Sep 2018 01:14:50 +0000

I wanted to keep this post short since as I mentioned in the previous post about cloud storage, our use case is already an affordable one, but it still makes sense to touch on some of the file movement strategy to other tiers of storage to make sure we are maximizing our cost saving vs. our base level requirements. In S3, you can easily define life cycles rules that allow you to move your files from standard storage, to infrequent and eventually cold storage. The different pricing structures can be found on AWS’s documentation located here.

Let’s quickly discuss the different types of classes offered by AWS. For more details, please see here.

Standard

Amazon S3 Standard offers high durability, availability, and performance object storage for frequently accessed data. Because it delivers low latency and high throughput, S3 Standard is perfect for a wide variety of use cases including cloud applications, dynamic websites, content distribution, mobile and gaming applications, and Big Data analytics.

Infrequent Access

Amazon S3 Standard-Infrequent Access (S3 Standard-IA) is an Amazon S3 storage class for data that is accessed less frequently but requires rapid access when needed. S3 Standard-IA offers the high durability, high throughput, and low latency of S3 Standard, with a low per GB storage price and per GB retrieval fee. This combination of low cost and high performance make S3 Standard-IA ideal for long-term storage, backups, and as a data store for disaster recovery.

Glacier (Cold Storage)

Amazon Glacier is a secure, durable, and extremely low-cost storage service for data archiving. You can reliably store any amount of data at costs that are competitive with or cheaper than on-premises solutions. To keep costs low yet suitable for varying retrieval needs, Amazon Glacier provides three options for access to archives, from a few minutes to several hours.

In our basic scenario where we are getting files representing a whole years’ worth of data, we can consider a scenario where data older then 3 years is moved to infrequent storage while data older then 7 years is moved to cold storage. This way when basic day to day analytics and analysis is occurring, it is against S3 standard, whereas analytics for more directional items such as company direction which occurs less frequently (maybe once a quarter) can occur from the infrequent buckets. If in the rare case an audit is required during an event such as a merger and acquisition, data can then be pulled and provided from cold storage to meet those needs as well. This allows you to maximized cost savings while having an infinite data retention policy.

To set up a lifecycle policy on your data, just follow the few simple steps.

From your bucket, navigate to Management -> Add Lifecycle Rule

Give your Rue a name and a scope. You can use tags if needed if you need the policy to apply t a certain set of files.

In the transition and Expiration sections, you can define the time period for when data is moved to another tier and when to expire the file if that is required as well. It is a good idea to tag a file on the upload so that similar tags can be used for the lifecycle rules as well.

That’s it, you now have an automated policy on your data without an ETL or File task process. You can create as many policies as you need and each can be unique to how you wish to retain your data.

Accelerating the Staging Process for your Data Warehouse

Vimal Vachhani — Wed, 27 Jun 2018 21:09:56 +0000

Real Estate Data Warehouse – Accelerating the Staging Process

The script below can be used to build a staging environment for any sort of industry and not just real estate related databases. The specifics of a RE Data warehouse will be covered in future blog post. It will allow you to Accelerating the Staging Process for your Data Warehouse

When starting the process to capture data analytics, whether you planning to eventually build a data warehouse or eventually feed a big data Hadoop cluster, it helps to stage your data away from your source systems. This provides many benefits; the primary being having a copy of data to work with and process that is no longer in the transactional operational system. Having large processes or queries running against your transactional operation system provides unnecessary risk, can introduce bottlenecks or slowdowns and even open security holes that may not be needed. When you have a staging environment, all you need is one service level account managed by IT security that has read access to the source systems. From there, building a scheduled refresh process to load this data as a blanket truncate and reload can be set up easily. Many tools for ETL can be used and robust auditing and scheduling should be set up but getting off the ground quickly to start prototyping and profiling your data will allow you to get moving a lot sooner and providing value to the business.

For this reason, I wrote the SQL script below a while back to help me on new projects. Running this script against a linked server connection or a replicated database will quickly allow you to build a staging database with procedures to load all the data as truncate and reloads. This can then be wrapped in a master SQL procedure and scheduled, giving you a full Staging ETL process with out needing ETL tools. Remember, this is just an accelerator and will require some tweaking and optimization to get to a final state, but this should get you off the ground with your basic SQL based source systems.

/***********************

Start of Script

************************/

/***********************

Configuration

************************/

DECLARE @sqlCommand varchar(1000)

DECLARE @DatabaseNameStaging varchar(75)

DECLARE @DatabaseNameSource varchar(75)

SET @DatabaseNameStaging = ‘Staging’

SET @DatabaseNameSource = ‘SourceDB’

— Add all tables to ignore to this list

DROP TABLE #TablestoIgnore

CREATE TABLE #TablestoIgnore

(

TableName varchar(255)

)

INSERT INTO #TablestoIgnore

Select ”

–UNION

–Select ”

— Table to Store List of all Table is Source Database

DROP TABLE #TableList

CREATE TABLE #TableList

(

TableName varchar(255)

)

/***********************

Create Staging Database

************************/

SET @sqlCommand = ‘IF NOT EXISTS(SELECT * FROM sys.databases WHERE NAME = ”’+@DatabaseNameStaging+”’)

BEGIN

CREATE DATABASE ‘+@DatabaseNameStaging+’

— Set Logging to Simple

USE master ;

ALTER DATABASE ‘+@DatabaseNameStaging+’ SET RECOVERY SIMPLE

END’

EXEC (@sqlCommand)

/***********************

Get List of All Tables

************************/

SET @sqlCommand = ‘INSERT INTO #TableList SELECT DISTINCT T.name AS Table_Name

FROM ‘+@DatabaseNameSource+’.sys.objects AS T

WHERE T.type_desc = ”USER_TABLE”

AND T.name NOT IN (SELECT TableName FROM #TablestoIgnore)

ORDER By 1′

EXEC (@sqlCommand)

–Create Drop and Create Statements

SELECT ‘IF OBJECT_ID(”’ + @DatabaseNameStaging + ‘.dbo.’+ TableName + ”’, ”U”) IS NOT NULL DROP TABLE ‘ + @DatabaseNameStaging + ‘.dbo.’+ TableName + ‘;’ AS DropStatement,

‘SELECT Top 1 * INTO ‘ + @DatabaseNameStaging + ‘.dbo.’+ TableName + ‘ From ‘ + @DatabaseNameSource + ‘.dbo.’+ TableName AS CreateStatement

INTO #DatabaseStatements

FROM #TableList

–Create Drop and Create Statements

— Run Drop Commands

DECLARE @MyCursor CURSOR;

DECLARE @MyField varchar(500);

BEGIN

SET @MyCursor = CURSOR FOR

SELECT DropStatement FROM #DatabaseStatements

OPEN @MyCursor

FETCH NEXT FROM @MyCursor

INTO @MyField

WHILE @@FETCH_STATUS = 0

BEGIN

EXEC (@MyField)

FETCH NEXT FROM @MyCursor

INTO @MyField

END;

CLOSE @MyCursor ;

DEALLOCATE @MyCursor;

END;

— Run Create Commands

BEGIN

SET @MyCursor = CURSOR FOR

SELECT CreateStatement FROM #DatabaseStatements

OPEN @MyCursor

FETCH NEXT FROM @MyCursor

INTO @MyField

WHILE @@FETCH_STATUS = 0

BEGIN

EXEC (@MyField)

FETCH NEXT FROM @MyCursor

INTO @MyField

END;

CLOSE @MyCursor ;

DEALLOCATE @MyCursor;

END;

/***********************

Create All Stored Procedures

to Load Staging

*** THIS SECTION MUST BE RUN AGAINST STAGING ENVIRONMENT ***

*** This step may result in Error for Identity Tables. Those ETL’s will need to be created

************************/

USE Staging

— Run Create Commands

BEGIN

SET @MyCursor = CURSOR FOR

SELECT TableName

FROM #TableList

OPEN @MyCursor

FETCH NEXT FROM @MyCursor

INTO @MyField

WHILE @@FETCH_STATUS = 0

BEGIN

EXEC ( ‘IF OBJECT_ID(”spLoad’+@MyField+”’, ”P”) IS NOT NULL DROP PROC spLoad’+@MyField+”)

EXEC ( ‘TRUNCATE TABLE ‘+@DatabaseNameStaging+’.dbo.’+@MyField+’

CREATE PROCEDURE dbo.spLoad’+@MyField+’

BEGIN

SET NOCOUNT ON;

— Insert statements for procedure here

INSERT INTO ‘+@DatabaseNameStaging+’.dbo.’+@MyField+’

SELECT * FROM ‘+@DatabaseNameSource+’.dbo.’+@MyField+’

END’)

EXEC ( ‘CREATE PROCEDURE dbo.spLoad’+@MyField+’

BEGIN

SET NOCOUNT ON;

— Insert statements for procedure here

INSERT INTO ‘+@DatabaseNameStaging+’.dbo.’+@MyField+’

SELECT * FROM ‘+@DatabaseNameSource+’.dbo.’+@MyField+’

END’)

FETCH NEXT FROM @MyCursor

INTO @MyField

END;

CLOSE @MyCursor ;

DEALLOCATE @MyCursor;

END;

DROP TABLE #DatabaseStatements

/***********************

End of Script

************************/

Be sure to check out my full online class on the topic. A hands on walk through of a Modern Data Architecture using Microsoft Azure. For beginners and experienced business intelligence experts alike, learn the basic of navigating the Azure Portal to building an end to end solution of a modern data warehouse using popular technologies such as SQL Database, Data Lake, Data Factory, Data Bricks, Azure Synapse Data Warehouse and Power BI. Link to the class can be found here or directly here.

Part 1 – Navigating the Azure Portal

Part 2 – Resource Groups and Subscriptions

Part 3 – Creating Data Lake Storage

Part 4 – Setting up an Azure SQL Server

Part 5 – Loading Data Lake with Azure Data Factory

Part 6 – Configuring and Setting up Data Bricks

Part 7 – Staging data into Data Lake

Part 8 = Provisioning a Synapse SQL Data Warehouse

Part 9 – Loading Data into Azure Data Synapse Data Warehouse