{"id":356,"date":"2019-03-21T21:54:38","date_gmt":"2019-03-21T21:54:38","guid":{"rendered":"http:\/\/jackofalltradesmasterofsome.com\/blog\/?p=356"},"modified":"2023-02-28T14:51:08","modified_gmt":"2023-02-28T14:51:08","slug":"modern-data-architecture-streaming-etl-using-cdc-and-azure-event-hub","status":"publish","type":"post","link":"https:\/\/jackofalltradesmasterofsome.com\/blog\/2019\/03\/21\/modern-data-architecture-streaming-etl-using-cdc-and-azure-event-hub\/","title":{"rendered":"Streaming ETL using CDC and Azure Event Hub. A Modern Data Architecture."},"content":{"rendered":"\n<p>In Modern Data architecture, As Data Warehouses have gotten bigger and faster, and as big data technology has allowed us to store vast amounts of data it is still strange to me that most data warehouse refresh processes found in the wild are still some form of batch processing. Even Hive queries against massive Hadoop infrastructures are essentially fast performing bath queries. Sure, they may occur every half day or even every hour but the speed of business continues to accelerate and we must start looking at architecture that combines the speed and transactional processing of Kafka\/Spark\/Event Hubs into creating a real time streaming ETL to load a data warehouse at a cost that is comparable and even cheaper then purchasing an ETL tool. Let&#8217;s look at Streaming ETL using CDC and Azure Event Hub.<\/p>\n\n\n\n<p><a href=\"https:\/\/www.udemy.com\/course\/modern-data-architecture-using-microsoft-azure\/learn\/lecture\/18527998#overview\">Interested in Learning More about Modern Data Architecture? <\/a>.   <\/p>\n\n\n\n<p><em>Side Note: Want to learn <a href=\"https:\/\/subscribepage.io\/8daystosqlcourse\">SQL <\/a>or <a href=\"https:\/\/subscribepage.io\/Py2Y8D\">Python <\/a>for free. In less then 10 minutes a day and less than an hour total? Signup for my free classes delivered daily right to your email inbox for free!  <\/em><\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/subscribepage.io\/8daystosqlcourse\"><img loading=\"lazy\" decoding=\"async\" width=\"558\" height=\"280\" src=\"https:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2023\/02\/image-1.png\" alt=\"\" class=\"wp-image-1305\" srcset=\"https:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2023\/02\/image-1.png 558w, https:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2023\/02\/image-1-300x151.png 300w\" sizes=\"auto, (max-width: 558px) 100vw, 558px\" \/><\/a><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/subscribepage.io\/Py2Y8D\"><img loading=\"lazy\" decoding=\"async\" width=\"572\" height=\"281\" src=\"https:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2023\/02\/image-2.png\" alt=\"\" class=\"wp-image-1306\" srcset=\"https:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2023\/02\/image-2.png 572w, https:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2023\/02\/image-2-300x147.png 300w\" sizes=\"auto, (max-width: 572px) 100vw, 572px\" \/><\/a><\/figure>\n\n\n\n<p>Now back to the article&#8230;<\/p>\n\n\n\n<p>For this series, we will be looking at Azure Event Hubs or\nIoT Hubs. These were designed to capture fast streaming data of millions of\nrows from IoT devices or streaming data like Twitter. But why should this tool\nbe limited to these use cases? Most businesses have no need or requirement for\nthis use case, but we can use this technology to create a live streaming ETL to\nyour data warehouse or your reporting environment with out sacrifice\nperformance or creating a strain on your source systems. This architecture can\nbe used to perform data synchronization between systems and other integrations\nas well, and since we are not using it to its full potential of capturing\nmillions of flowing records, our costs end up being pennies a day!<\/p>\n\n\n\n<p>Others have emulated this sort of process by using triggers on their source table, but this can potentially add an extra step of processing and overhead to your database. By enabling change data capture natively on SQL Server, it can be much lighter than a trigger. You can then take the first steps to creating a streaming ETL for your data. If CDC is not available, simple staging scripts can be written to emulate the same but be sure to keep an eye on performance. Let\u2019s take a look at the first step of setting up native Change Data Capture on your SQL Server tables<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong><u>Steps <\/u><\/strong><\/h2>\n\n\n\n<ol class=\"wp-block-list\"><li>First, enable Change Data Capture at the\nDatabase and Table Level using the following scripts. More information is\navailable on the Microsoft Site. If you are not using SQL Server or a tool that\nhas organic CDC build it, a similar process can be hand built a read only view\nby leveraging timestamps on last created date and last updated date.<\/li><\/ol>\n\n\n\n<p>sys.sp_cdc_enable_db<\/p>\n\n\n\n<p>EXECUTE sys.sp_cdc_enable_table&nbsp; <\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp; @source_schema = N&#8217;dbo&#8217;&nbsp; <\/p>\n\n\n\n<p>&nbsp; , @source_name = N&#8217;Task&#8217;&nbsp; <\/p>\n\n\n\n<p>&nbsp; , @role_name = N&#8217;cdc_Admin&#8217;;&nbsp; <\/p>\n\n\n\n<p>GO&nbsp; <\/p>\n\n\n\n<p>This step will enable CDC on the database as well as add it to the table \u201cTask\u201d. SQL Agent must be running as two jobs are created during this process, one to load the table and one to clean it out.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"267\" height=\"79\" src=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2019\/03\/image.png\" alt=\"\" class=\"wp-image-357\"\/><\/figure>\n\n\n\n<p>Once the CDC is enabled, SQL Server will automatically create the following tables \u201cschema_tablename_CT\u201d in the System folder section. This table will now automatically track all data changes that occur on that table. You can reference the _$operationcode to determine what change occurred with the legend below. If you wish to capture changes to only certain fields, see the Microsoft documentation on CDC to see how that can be set. If you are hand writing your SQL, this can also be programmed in when building your staging query. <\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"276\" height=\"205\" src=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2019\/03\/image-1.png\" alt=\"\" class=\"wp-image-358\"\/><\/figure>\n\n\n\n<p>1 = delete<br> 2 = insert<br> 3 = update (captured column values are those before the update operation). This value applies only when the row filter option &#8216;all update old&#8217; is specified.<br> 4 = update (captured column values are those after the update operation)<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"623\" height=\"113\" src=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2019\/03\/image-2.png\" alt=\"\" class=\"wp-image-359\" srcset=\"https:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2019\/03\/image-2.png 623w, https:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2019\/03\/image-2-300x54.png 300w\" sizes=\"auto, (max-width: 623px) 100vw, 623px\" \/><\/figure>\n\n\n\n<p>Now once you add, edit or delete a record, you should be able to find it in the new CDC table. Next, we will look at scanning this table and turning the data to JSON to send to an Event Hub!<\/p>\n\n\n\n<p>For more information on SQL CDC please see their documentation <a href=\"https:\/\/docs.microsoft.com\/en-us\/sql\/relational-databases\/track-changes\/about-change-data-capture-sql-server?view=sql-server-2017\">here<\/a>.<\/p>\n\n\n<p><!--StartFragment--><\/p>\n\n\n<p>Be sure to check out my full online class on the topic. A hands on walk through of a Modern Data Architecture using Microsoft Azure. For beginners and experienced business intelligence experts alike, learn the basic of navigating the Azure Portal to building an end to end solution of a modern data warehouse using popular technologies such as SQL&nbsp;Database, Data Lake, Data Factory, Data Bricks, Azure Synapse Data Warehouse and Power BI. Link to the class can be <a href=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/2020\/04\/08\/modern-data-architecture-using-microsoft-azure-online-class-and-free-ebook\/\">found <\/a><a href=\"https:\/\/www.udemy.com\/course\/modern-data-architecture-using-microsoft-azure\/learn\/lecture\/18527998#overview\">here<\/a>.  <\/p>\n\n\n<p><!--StartFragment--><\/p>\n\n\n<p><a href=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/2020\/04\/08\/modern-data-architecture-part-1-navigating-the-azure-portal\/\">Part 1 &#8211; Navigating the Azure Portal<\/a><\/p>\n\n\n\n<p><a href=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/2020\/04\/13\/modern-data-architecture-part-2-resource-groups-and-subscriptions\/\">Part 2 &#8211; Resource Groups and Subscriptions<\/a><\/p>\n\n\n\n<p><a href=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/2020\/04\/13\/modern-data-architecture-part-3-creating-data-lake-storage\/\">Part 3 &#8211; Creating Data Lake Storage<\/a><\/p>\n\n\n\n<p><a href=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/2020\/04\/13\/modern-data-architecture-part-4-setting-up-a-sql-server\/\">Part 4 &#8211; Setting up an Azure SQL Server<\/a><\/p>\n\n\n\n<p><a href=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/2020\/04\/13\/modern-data-architecture-part-5-loading-data-lake-with-data-factory\/\">Part 5 &#8211; Loading Data Lake with Azure Data Factory<\/a><\/p>\n\n\n\n<p><a href=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/2020\/04\/13\/modern-data-architecture-part-6-configuring-and-setting-up-data-bricks\/\">Part 6 &#8211; Configuring and Setting up Data Bricks<\/a><\/p>\n\n\n\n<p><a href=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/2020\/04\/13\/modern-data-architecture-part-7-staging-data-into-data-lake\/\">Part 7 &#8211; Staging data into Data Lake<\/a><\/p>\n\n\n\n<p><a href=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/2020\/04\/14\/modern-data-architecture-part-8-provisioning-a-synapsis-sql-data-warehouse\/\">Part 8 = Provisioning a Synapse SQL Data Warehouse<\/a><\/p>\n\n\n\n<p><a href=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/2020\/04\/14\/modern-data-architecture-part-9-loading-data-synapse-data-warehouse\/\">Part 9 &#8211; Loading Data into Azure Data Synapse Data Warehouse<\/a><\/p>\n\n\n<p><!--EndFragment--><\/p>","protected":false},"excerpt":{"rendered":"<p>In Modern Data architecture, As Data Warehouses have gotten bigger and faster, and as big data technology has allowed us to store vast amounts of data it is still strange to me that most data warehouse refresh processes found in the wild are still some form of batch processing. Even Hive queries against massive Hadoop [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":135,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4,28,27,11,29],"tags":[44,13,83],"class_list":["post-356","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-analytics","category-azure","category-big-data","category-data-warehouse","category-hdinsight","tag-azure","tag-etl","tag-event-hub"],"_links":{"self":[{"href":"https:\/\/jackofalltradesmasterofsome.com\/blog\/wp-json\/wp\/v2\/posts\/356","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/jackofalltradesmasterofsome.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/jackofalltradesmasterofsome.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/jackofalltradesmasterofsome.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/jackofalltradesmasterofsome.com\/blog\/wp-json\/wp\/v2\/comments?post=356"}],"version-history":[{"count":8,"href":"https:\/\/jackofalltradesmasterofsome.com\/blog\/wp-json\/wp\/v2\/posts\/356\/revisions"}],"predecessor-version":[{"id":1313,"href":"https:\/\/jackofalltradesmasterofsome.com\/blog\/wp-json\/wp\/v2\/posts\/356\/revisions\/1313"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/jackofalltradesmasterofsome.com\/blog\/wp-json\/wp\/v2\/media\/135"}],"wp:attachment":[{"href":"https:\/\/jackofalltradesmasterofsome.com\/blog\/wp-json\/wp\/v2\/media?parent=356"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/jackofalltradesmasterofsome.com\/blog\/wp-json\/wp\/v2\/categories?post=356"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/jackofalltradesmasterofsome.com\/blog\/wp-json\/wp\/v2\/tags?post=356"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}