{"id":186,"date":"2018-12-04T21:21:24","date_gmt":"2018-12-04T21:21:24","guid":{"rendered":"http:\/\/jackofalltradesmasterofsome.com\/blog\/?p=186"},"modified":"2021-10-04T02:42:09","modified_gmt":"2021-10-04T02:42:09","slug":"the-modern-data-warehouse-running-hive-queries-in-visual-studio-to-combine-data","status":"publish","type":"post","link":"http:\/\/jackofalltradesmasterofsome.com\/blog\/2018\/12\/04\/the-modern-data-warehouse-running-hive-queries-in-visual-studio-to-combine-data\/","title":{"rendered":"The Modern Data Warehouse; Running Hive Queries in Visual Studio to combine data"},"content":{"rendered":"\n<p>In previous posts we have looked at storing data files to blob storage and <a href=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/2018\/08\/30\/the-modern-data-warehouse-the-low-cost-solution-using-big-data-with-hdinsight-and-powershell\/\">using PowerShell<\/a> to spin up an HDInsight Hadoop cluster. We have also <a href=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/2018\/11\/27\/setting-up-visual-studio-to-work-with-hdinsights-and-run-hive-queries\/\">installed some basic software<\/a> that will help us get going once the services are provisioned. Now that the basics are ready, it is time to process some of that data using Hive and Visual Studio. In this scenario, we will be loading our Happiness Index data files into Hive tables and then consolidating that data into a single file.<\/p>\n\n\n\n<!--more-->\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"623\" height=\"275\" src=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2018\/12\/image.png\" alt=\"\" class=\"wp-image-187\" srcset=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2018\/12\/image.png 623w, http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2018\/12\/image-300x132.png 300w\" sizes=\"auto, (max-width: 623px) 100vw, 623px\" \/><\/figure>\n\n\n\n<p>For this tutorial we are going to\nleverage Azure Storage Browser to view our storage files as well as create\nfolders. You can use this tool as well as the shell to create new folders and\nupload your data files. The actual folder will not be created until you upload\nfiles using this tool. In a real-world scenario, you would use a file transfer\ntask or data factory to stage your files.<\/p>\n\n\n\n<p>Open the tool and sign into your Azure account. Once created, navigate to your HDInsights Cluster and in your storage and create a new folder in your Hive Storage called \u201cdata\u201d and uploaded all 3 of our files to this location.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"575\" height=\"301\" src=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2018\/12\/image-1.png\" alt=\"\" class=\"wp-image-189\" srcset=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2018\/12\/image-1.png 575w, http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2018\/12\/image-1-300x157.png 300w\" sizes=\"auto, (max-width: 575px) 100vw, 575px\" \/><\/figure>\n\n\n\n<p>Open Visual Studio with your newly installed \u201cAzure Data Lake Tools\u201d installed. If you look at your server explorer and ensure you are signed in with the same Azure account as where your cluster is located, you will see the cluster listed in the menu.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"239\" height=\"180\" src=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2018\/12\/image-2.png\" alt=\"\" class=\"wp-image-191\"\/><\/figure>\n\n\n\n<p>You can create a new table directly from the server explorer. In the tool you can either script or use the wizard to create the new table. Create it so that it has the same column names as the file. For the example below we will just run the create table and the load table step as one Hive script. Update the script to load the 2016 and the 2017 files as well.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"440\" height=\"202\" src=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2018\/12\/image-3.png\" alt=\"\" class=\"wp-image-192\" srcset=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2018\/12\/image-3.png 440w, http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2018\/12\/image-3-300x138.png 300w\" sizes=\"auto, (max-width: 440px) 100vw, 440px\" \/><\/figure>\n\n\n\n<p>CREATE TABLE IF NOT EXISTS\ndefault.sourcedata2015(Country string,<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Region string,<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; HappinessRank float,<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; HappinessScore float,<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; LowerConfidenceInterval float,<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; UpperConfidenceInterval float,<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Economy_GDPperCapita float,<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Family float,<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Health_LifeExpectancy float,<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; FreedomTrust_GovernmentCorruption&nbsp; float,<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; GenerosityDystopiaResidual float)<\/p>\n\n\n\n<p>ROW FORMAT DELIMITED<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; FIELDS TERMINATED BY &#8216;,&#8217;<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; COLLECTION ITEMS TERMINATED BY &#8216;\\002&#8217;<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; MAP KEYS TERMINATED BY &#8216;\\003&#8217;<\/p>\n\n\n\n<p>STORED AS TEXTFILE;<\/p>\n\n\n\n<p>LOAD DATA INPATH &#8216;\/data\/2015.csv&#8217; INTO table sourcedata2015;<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p>Once you have created your 3 new Hive tables, it is time to consolidate them into one table using a simple SQL like union statement and adding the year column to the end.&nbsp;<\/p>\n\n\n\n<p>Select * from <\/p>\n\n\n\n<p>(<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp; SELECT *, &#8216;2015&#8217; as Year FROM sourcedata2017 b<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp; UNION ALL<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp; SELECT *, &#8216;2016&#8217; as Year FROM sourcedata2017 c<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp; UNION ALL<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp; SELECT *, &#8216;2017&#8217; as Year FROM sourcedata2017 d<\/p>\n\n\n\n<p>)\nCombinedTable<\/p>\n\n\n\n<p>Once the job is complete, you should be able to view the results in see by clicking \u201cJob Output\u201d. All we have done here is created a select statement, but this data could have also been inserted into a new Hive table for either more processing or a push to a targeted data warehouse. <\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"277\" height=\"383\" src=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2018\/12\/image-4.png\" alt=\"\" class=\"wp-image-194\" srcset=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2018\/12\/image-4.png 277w, http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-content\/uploads\/2018\/12\/image-4-217x300.png 217w\" sizes=\"auto, (max-width: 277px) 100vw, 277px\" \/><\/figure>\n\n\n\n<p>I hope this introduction helps you get off the ground in the basics of running Hive queries. In later posts we will look at how HDinsights intersects with Azure Data Lake and Data Factory. Before you leave, be sure to delete your HDCluster. Since this services runs as an hourly service and is not billed on a job basis, so it will continue to charge your account. You can either delete via the Azure dashboard or <a href=\"http:\/\/jackofalltradesmasterofsome.com\/blog\/2018\/08\/30\/the-modern-data-warehouse-the-low-cost-solution-using-big-data-with-hdinsight-and-powershell\/\">using the PowerShell script<\/a> from our previous posts. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>In previous posts we have looked at storing data files to blob storage and using PowerShell to spin up an HDInsight Hadoop cluster. We have also installed some basic software that will help us get going once the services are provisioned. Now that the basics are ready, it is time to process some of that [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":135,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4,28,27,11,29],"tags":[44,15,45],"class_list":["post-186","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-analytics","category-azure","category-big-data","category-data-warehouse","category-hdinsight","tag-azure","tag-big-data","tag-hive"],"_links":{"self":[{"href":"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-json\/wp\/v2\/posts\/186","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-json\/wp\/v2\/comments?post=186"}],"version-history":[{"count":7,"href":"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-json\/wp\/v2\/posts\/186\/revisions"}],"predecessor-version":[{"id":200,"href":"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-json\/wp\/v2\/posts\/186\/revisions\/200"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-json\/wp\/v2\/media\/135"}],"wp:attachment":[{"href":"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-json\/wp\/v2\/media?parent=186"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-json\/wp\/v2\/categories?post=186"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/jackofalltradesmasterofsome.com\/blog\/wp-json\/wp\/v2\/tags?post=186"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}