Modern Data Architecture – Part 3 – Creating Data Lake Storage
A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).
Azure data storage is extremely cost effective and can be used to store vast amounts of data for a small monthly cost. No cost will be incurred for empty data buckets.
From your Azure Portal we will create a new Storage Account. You may see a resource for Data Lake Gen1 Storage, but the go forward standard at the time of creating this content is a default Storage Account. Azure Storage is a Microsoft-managed service providing cloud storage that is highly available, secure, durable, scalable, and redundant. Azure Storage includes Azure Blobs (objects), Azure Data Lake Storage Gen2, Azure Files, Azure Queues, and Azure Tables. The cost of your storage account depends on the usage and the options you choose below.
- Search for Storage Account from your Portal and select “+ Add” at the top left
- From the Storage Account Window, fill in the following details and select “Create and Review” to complete the wizard.
- Subscription – The subscription you previously setup
- Resource Group – “training_resourcegroup_yourname” or the group you previously created
- Storage Account Name – As this value does not allow underscores or anything longer then 26 characters name the storage account “trainingsayourname”
- Location – US East 2 or the region you belong to. It is best practice to keep the same region for latency issues as services that need to communicate with one another are slightly affected by distance
- Performance – Standard
- Account Kind – Storage V2
- Replication – Read-Access Geo-Redundant
- Access Tier – Hot
- After a short duration the storage group will be provisioned and will appear in your dashboard. No cost will be associated to this resource until actual data is moved and stored here.
Creating a Folder Structure
- Select “Storage Explorer (preview)
- Right click “blob containers” and select “Create blob container”. Name it “root” and leave it “private”
- Create a new virtual folder
- This will create a virtual folder. A virtual folder does not actually exist in Azure until you paste, drag or upload blobs into it. To paste a blob into a virtual folder, copy the blob before creating the folder.
- From the Datafiles Folder from this lab, upload the file “placeholderfile” into folder called “raw”.
- Complete the same exercise for a folder named “staging”
Be sure to check out my full online class on the topic. A hands on walk through of a Modern Data Architecture using Microsoft Azure. For beginners and experienced business intelligence experts alike, learn the basic of navigating the Azure Portal to building an end to end solution of a modern data warehouse using popular technologies such as SQL Database, Data Lake, Data Factory, Data Bricks, Azure Synapse Data Warehouse and Power BI. Link to the class can be found here or directly here.
Part 1 – Navigating the Azure Portal
Part 2 – Resource Groups and Subscriptions
Part 3 – Creating Data Lake Storage
Part 4 – Setting up an Azure SQL Server
Part 5 – Loading Data Lake with Azure Data Factory
Part 6 – Configuring and Setting up Data Bricks
Part 7 – Staging data into Data Lake
Part 8 = Provisioning a Synapse SQL Data Warehouse
Part 9 – Loading Data into Azure Data Synapse Data Warehouse
Modern Data Architecture – Part 3 – Creating Data Lake Storage