Cloud – jack of all trades master of some http://jackofalltradesmasterofsome.com/blog Consultant - Real Estate - Author - Business Intelligence Wed, 15 Apr 2020 14:21:14 +0000 en-US hourly 1 https://wordpress.org/?v=6.5.3 The Modern Data Warehouse; Azure Data Lake and U-SQL to combine data http://jackofalltradesmasterofsome.com/blog/2018/12/17/the-modern-data-warehouse-azure-data-lake-and-u-sql-to-combine-data/ Mon, 17 Dec 2018 15:10:04 +0000 http://jackofalltradesmasterofsome.com/blog/?p=210 The modern data warehouse will need to use Azure Data Lake and U-SQL to combine data. Begin by navigating to your Azure Portal and searching for the Data Lake Analytics Resource. Let’s start by creating a new Data Lake. Don’t worry, this service only charges on data in and out, not just remaining on like an HDInsights cluster so you should not be charged anything and we will not need to spin up and spin down services like we did earlier.

You will need to give it a unique name as well as tie it to a pay as you go subscription. A data lake will also need a data lake storage layer as well. You can keep the naming as is or rename it if you wish. It is recommended to keep the storage encrypted. Deployment may take a few minutes.

Let’s navigate to the data explore and create a new folder to put our sample data. Upload the same 3 files from the happiness index to the new folder.

Take a look through your data explorer at the tables. It should contain a master database similar to as if you were running a traditional SQL Server.

Create a new U-SQL Job that creates the new database in the catalog as well as a schema and a table.

CREATE DATABASE IF NOT EXISTS testdata;

USE DATABASE testdata;

CREATE SCHEMA IF NOT EXISTS happiness;

CREATE TABLE happiness.placeholderdata

(

    Region string,

       HappinessRank float,

       HappinessScore float,

       LowerConfidenceInterval float,

       UpperConfidenceInterval float,

       Economy_GDPperCapita float,

       Family float,

       Health_LifeExpectancy float,

       FreedomTrust_GovernmentCorruption  float,

       GenerosityDystopiaResidual float,

     INDEX clx_Region CLUSTERED(Region ASC) DISTRIBUTED BY HASH(Region)

);

While running and once completed, Azure will present the execution tree. With the new table created, we can now load data into that table.

USE DATABASE testdata;

@log =

EXTRACT Region string,

       HappinessRank float,

       HappinessScore float,

       LowerConfidenceInterval float,

       UpperConfidenceInterval float,

       Economy_GDPperCapita float,

       Family float,

       Health_LifeExpectancy float,

       FreedomTrust_GovernmentCorruption  float,

       GenerosityDystopiaResidual float

FROM “/sampledata/{*}.csv”

USING Extractors.Text(‘,’,silent:true);

INSERT INTO happiness.placeholderdata

SELECT * FROM @log;

Once the data is loaded, we can go query the data in the data explorer to see what it looks like. The script will run the select and output the data to a file to be browsed.

@table = SELECT * FROM [testdata].[happiness].[placeholderdata];

OUTPUT @table

    TO “/OUTPUTS/Sampledataquery.csv”

    USING Outputters.Csv();



Be sure to check out my full online class on the topic. A hands on walk through of a Modern Data Architecture using Microsoft Azure. For beginners and experienced business intelligence experts alike, learn the basic of navigating the Azure Portal to building an end to end solution of a modern data warehouse using popular technologies such as SQL Database, Data Lake, Data Factory, Data Bricks, Azure Synapse Data Warehouse and Power BI. Link to the class can be found here or directly here.

Part 1 – Navigating the Azure Portal

Part 2 – Resource Groups and Subscriptions

Part 3 – Creating Data Lake Storage

Part 4 – Setting up an Azure SQL Server

Part 5 – Loading Data Lake with Azure Data Factory

Part 6 – Configuring and Setting up Data Bricks

Part 7 – Staging data into Data Lake

Part 8 = Provisioning a Synapse SQL Data Warehouse

Part 9 – Loading Data into Azure Data Synapse Data Warehouse



]]>
Setting up tools to work with HDInsights and run Hive Queries – Azure Data Lake Tools and Azure Storage Browser http://jackofalltradesmasterofsome.com/blog/2018/11/27/setting-up-visual-studio-to-work-with-hdinsights-and-run-hive-queries/ Tue, 27 Nov 2018 22:00:30 +0000 http://jackofalltradesmasterofsome.com/blog/?p=169 Two tools that are going to make life a bit simpler if you are going to be working with HDInsights and Azure blog storage are “Azure Data Lake and Stream Analytic Tools for Visual Studio” and Azure Storage Browser.

Azure Data Lake and Stream Analytic Tools for Visual Studio

  •  To run Hive Queries, you’re going to need to install Azure Data Lake and Stream Analytic Tools to your version of Visual Studio, sometimes referred to as HDInsight tools for Visual Studio or Azure Data Lake tools for Visual Studio. You can install directly from Visual studio by selection Tools -> Get Tools and Features.

  • Once you have Visual Studio Open, navigate to your server explorer and verify you are connected to the right Azure Subscription. You should now see your cluster information as created in the previous blog post on “Provisioning HDInsight Clusters using PowerShell

You should be able to browse and interact with your storage from Visual Studio as well as the Azure dashboard at this point but another helpful tool to install will be “Azure Storage Explorer”. It can be found and installed from the following location

Data explorer can be used to create new folders and upload your data files and this does make that job a little easier. Please note, when creating new folders, the actual folder will not be created until you upload files using this tool.

That should be it! From the Server explorer, you can now see the HDInsight Cluster that was spun up as well as the storage accounts where files will be stored. In the next post we will look at building and running Hive queries against your data files to get them ready for reporting.

]]>
Big Data for The Rest of Us. Affordable and Modern Business Intelligence Architecture – Adding Lifecycles to your S3 buckets to save cost and retain data forever! http://jackofalltradesmasterofsome.com/blog/2018/09/25/big-data-for-the-rest-of-us-affordable-and-modern-business-intelligence-architecture-adding-lifecycles-to-your-s3-buckets-to-save-cost-and-retain-data-forever/ Tue, 25 Sep 2018 01:14:50 +0000 http://jackofalltradesmasterofsome.com/blog/?p=156 I wanted to keep this post short since as I mentioned in the previous post about cloud storage, our use case is already an affordable one, but it still makes sense to touch on some of the file movement strategy to other tiers of storage to make sure we are maximizing our cost saving vs. our base level requirements. In S3, you can easily define life cycles rules that allow you to move your files from standard storage, to infrequent and eventually cold storage. The different pricing structures can be found on AWS’s documentation located here.

Let’s quickly discuss the different types of classes offered by AWS. For more details, please see here.

Standard

Amazon S3 Standard offers high durability, availability, and performance object storage for frequently accessed data. Because it delivers low latency and high throughput, S3 Standard is perfect for a wide variety of use cases including cloud applications, dynamic websites, content distribution, mobile and gaming applications, and Big Data analytics.

Infrequent Access

Amazon S3 Standard-Infrequent Access (S3 Standard-IA) is an Amazon S3 storage class for data that is accessed less frequently but requires rapid access when needed. S3 Standard-IA offers the high durability, high throughput, and low latency of S3 Standard, with a low per GB storage price and per GB retrieval fee. This combination of low cost and high performance make S3 Standard-IA ideal for long-term storage, backups, and as a data store for disaster recovery.

Glacier (Cold Storage)

Amazon Glacier is a secure, durable, and extremely low-cost storage service for data archiving. You can reliably store any amount of data at costs that are competitive with or cheaper than on-premises solutions. To keep costs low yet suitable for varying retrieval needs, Amazon Glacier provides three options for access to archives, from a few minutes to several hours.

In our basic scenario where we are getting files representing a whole years’ worth of data, we can consider a scenario where data older then 3 years is moved to infrequent storage while data older then 7 years is moved to cold storage. This way when basic day to day analytics and analysis is occurring, it is against S3 standard, whereas analytics for more directional items such as company direction which occurs less frequently (maybe once a quarter) can occur from the infrequent buckets. If in the rare case an audit is required during an event such as a merger and acquisition, data can then be pulled and provided from cold storage to meet those needs as well. This allows you to maximized cost savings while having an infinite data retention policy.

To set up a lifecycle policy on your data, just follow the few simple steps.

  • From your bucket, navigate to Management -> Add Lifecycle Rule
  • Give your Rue a name and a scope. You can use tags if needed if you need the policy to apply t a certain set of files.
  • In the transition and Expiration sections, you can define the time period for when data is moved to another tier and when to expire the file if that is required as well. It is a good idea to tag a file on the upload so that similar tags can be used for the lifecycle rules as well.
  • That’s it, you now have an automated policy on your data without an ETL or File task process. You can create as many policies as you need and each can be unique to how you wish to retain your data.

]]>