Modern Data Architecture – Part 7 – Staging Data into Data Lake
In Lab 5, we demonstrated loaded data from our sample SQL server into our data lake. Although this tool is very handy, it creates a new service and set of tools that need to be provisioned and monitored. It is best to use ADF more as an workflow orchestration tool and use a single tool to handle all of your ETL. Data Brinks provides a robust tool to handle this task. We will investigate how to query data from your SQL Server and load your data lake.
For this lab we will be hard coding the access keys and credentials directly into the code. The optional section at the bottom will demonstrate how to configure Azure credentials secret keys to configure this the correct way as you would in production.
Connect and Query SQL Server
- Create a new workbook in your Data Bricks and name it “staging_customers”
Copy the Query below and paste it into your Data Bricks Window. Replace the highlighted sections with your database information. Most of our code will be writing using Scala to start. We will place each piece of code into its own cells to make the job a bit easer to read and test. Copy and paste the following code into your notebook. We will be using hard coded credentials and keys for the lab but in real world, be sure to configure secret key access only.
Example
Block 1
Description – This sets the initial driver for our database connection
Code
%scala
Class.forName(“com.microsoft.sqlserver.jdbc.SQLServerDriver”)
Block 2
Description – We configure and test the connection. Replace the items highlighted below with you SQL Server name, database name, username and password respectively.
Code
%scala
val jdbcHostname = “training-sqlserver-vimal.database.windows.net”
val jdbcPort = 1433
val jdbcDatabase = “training_sqldatabase_vimal”
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s”jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}”
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put(“user”, “sqlserveradmin”)
connectionProperties.put(“password”, “Password123”)
val driverClass = “com.microsoft.sqlserver.jdbc.SQLServerDriver”
connectionProperties.setProperty(“Driver”, driverClass)
Block 4
Description – This allows us to review and test the connection to our sales.customers table in the database
Code
%scala
val customer = spark.read.jdbc(jdbcUrl, “sales.customers”, connectionProperties)
Block 5
Description – Select the data from the table for display
Code
%scala
customer.select(“*”).show()
Block 6
Description – This section connects your data bricks storage to your data lake so that we can push data to the data lake in the next step. Insert your blob name, your storage account name, the destionation folder, the destination folder, the storage account name and your access key respectively.
Code
%scala
dbutils.fs.mount(
source = “wasbs://root@trainingsavimal.blob.core.windows.net/raw”,
mountPoint = “/mnt/raw”,
extraConfigs = Map(“fs.azure.account.key.trainingsavimal.blob.core.windows.net” -> “ youraccesskey “))
To access “youraccesskey”, we are going to need some additional information. From the azure portal, navigate back to your Storage Account “trainingsayourname” and select “Access Keys”
Block 7
Description – This last section pushed your data to a csv on to the data lake. Upon completion, you can now see you new data file on your storage via the storage explorer. There will also be a folder containing some of the data tranfer metadata.
Code
%scala
customer.coalesce(1)
.write.format(“com.databricks.spark.csv”)
.option(“header”, “true”)
.save(“/mnt/raw/customers.csv”)
Be sure to check out my full online class on the topic. A hands on walk through of a Modern Data Architecture using Microsoft Azure. For beginners and experienced business intelligence experts alike, learn the basic of navigating the Azure Portal to building an end to end solution of a modern data warehouse using popular technologies such as SQL Database, Data Lake, Data Factory, Data Bricks, Azure Synapse Data Warehouse and Power BI. Link to the class can be found here or directly here.
Part 1 – Navigating the Azure Portal
Part 2 – Resource Groups and Subscriptions
Part 3 – Creating Data Lake Storage
Part 4 – Setting up an Azure SQL Server
Part 5 – Loading Data Lake with Azure Data Factory
Part 6 – Configuring and Setting up Data Bricks
Part 7 – Staging data into Data Lake
Part 8 = Provisioning a Synapse SQL Data Warehouse
Part 9 – Loading Data into Azure Data Synapse Data Warehouse
Modern Data Architecture – Part 7 – Staging Data into Data Lake