DP-203 Data Engineering on Microsoft Azure Topic: Design and implement data storage – Azure Synapse Analytics Part 3

December 19, 2022

37. Lab – Example when using the right distributions for your tables

Now, just to drive home the concept when it comes to data shuffles, what I’ll do is create our sales fact table again and then create our dimension product tables a couple of times. And we’ll perform a join query between both tables, which is a very simple query. And then we look at where the data shuffling actually takes place. This will help you understand when it comes to distribution for your tables. So, first and foremost, I’m going to delete any existing sales fact tables. So here in my session for my dedicated SQLpool, I’ll first drop the Sales Fact table.

Next, I’ll create the table. So far, the hash distribution is based on the product ID. So I’ll copy this, I’ll paste it here, and I’ll hit execute. That’s done. Now I’ll drop my dim product table, if I have one, and let you create a dim product table. Again, let me drop the table. So that’s done. And then next, let me create the table. So, what will be the default distribution? If you don’t mention one, it will be the round-robin distribution. Now I’ll go on to Azure Synapse Studio. Let me go on to the integrated section and let me create a new copy data tool pipeline. So here, I’ll just run this once. Now I’ll move on to the next one. Here. I’ll choose my Adventure Works Connection. I’ll choose my product table. And I should also choose my sales fact table. I’ll proceed to the next one. I’ll go on to next year. I’ll choose the antics of Azure Synapse antics. I’ll leave the targets as they are. So I’m just copying the data as it is from the source. That’s the Azure SQL database on a dedicated SQL pool. Here I won’t enable staging, and I’ll just do a normal bulk insert. I don’t want to add any complexity. And let’s ensure that the pipeline goes on to completion.

So it’ll just take a few seconds. So I’ll just go on to finish. I’ll go on to the monitor section. So let me look at my pipeline. So it’s in progress. So I said it wouldn’t take that much time. Now, after the time when I hit refresh, I can see it has succeeded; it just took 28 seconds. So, if I select Star from the Product table and click Execute, I can see my data is in place, and if I go to my Sales Fact table, I can confirm I have the information in place. So this is also done. Now I’ll just perform a very simple query. So here I am taking the product ID for my sales fact table and the product name for my dimension product table. Here. I’m doing a join between the sales fact table and the dimension product table based on the product ID. So give me this and let me run with it. Here. I’ll hit execute. So, we’re getting the information in place. Now here in the Monte section, I’ll go on to SQL requests. So I want to look at my dedicated SQL pool. So here I can see my statement, and you can see the join. I’ll hit it close here. Let me go on to this request ID.

And here in the request steps, you can see that you have a shuffle-move operation in place. As a result, many operations are now involved in the join. And that’s because, remember, you now have your distributions. So in our fact table, the rows are distributed based on the product ID. So you have different product IDs that have been distributed across these distributions. But when it comes to our dimension product table, because this is now a round-robin distribution, So the rows are randomly distributed across the different distributions. So when we are doing a join now of the product ID of the sales fact table with the product ID of the dimension product table, there needs to be some data shuffle to ensure that the product IDs that are being joined are on the same distribution. So, if there are two product IDs in the salesfact table, say on one distribution, and the product ID in the dimension product table is on another distribution, it needs to be shuffled here so that the join can be performed on this distribution. So this can actually slow down a query when you have a large amount of data.

Now, let’s see the impact if we create a dimension table based on replication. So here, let me drop my dimension product table, and now let me create the dimension product table with the replication distribution of replicate. So I’ll take this and copy it here. So we are now ensuring that the product’s data is available on all distributions; it is being replicated. Now let me return to the integrated section and make a new copy of the data. Again activity. I’ll proceed to the next one. So now I just want to copy the product dimension table data again. So here, I’ll choose the dimension product. I’ll go on to the next one. I’ll go on to the next one. I’ll choose Azure Synapse. So the target is there. I’ll proceed to the next one. I’ll go on to the next one. I’ll disable staging. Here I’ll talk about inserting. I’ll go on to the next one. I’ll go on to the next one. so that it now creates the pipeline to copy our data. Hit finish, go on to the motor section, and I’ll go on to pipeline runs. So I’ll just wait for this run to be completed. So, keep in mind what we’re attempting to accomplish here. So when we have our different distributions, the dimension product table will be there on every distribution. So, even if your fact table rows are distributed using a hash, when you do a join, only those rows that have the product ID, say, on this fact table distribution, will be joined with the product ID of the rows in the product dimension table.

And since all the rows are present, it can directly do a join in parallel based on the different distributions. But let’s confirm this. So let’s wait until we have all the data in place. So when I click on “refresh,” we have our dimension product table in place. So, if we just want to confirm, we’ll do a select star from so we can see our tables in place, and then we’ll perform the same join on our tables. So this is also done if I now go on to the SQL request. So here I can see that this was a query that I just executed. If I go on to the request ID here, I can see there is no data shuffle operation in place; it has directly gone ahead and fetched the rows. So far, we’ve seen what the impact of using the correct distribution type has on affect and dimension tables. So normally, when it comes to fact tables, because you have a large amount of data, you should have it as a hash-distributed table. And when it comes to your dimension tables, it’s always good to have a replicated table in place.

38. Points on tables in Azure Synapse

So in the earlier chapters, we looked at our external tables. We had looked at tables in our dedicated SQL pool in this Microsoft documentation. So I’ll attach a link as a resource to this chapter. So this is also important from an exam perspective. So, as I previously stated, table persistence is one of the most important reasons why you should have a dedicated SQL pool in place. So you can have the tablet that has actually persisted in storage. Remember, when it comes to external tables, external tables point to data in an external storage location.

So with the help of table persistence, you can create regular tables in the dedicated SQL pool. Then next is distribution. So we have covered the different types of distributed tables. Again, when you look at the different distributed tables, these are only available with tables in the Derek SQL Pool. Again, an important note: you can’t use external tables and then use these different distribution techniques. And when it comes to partitions, you can see that this is available in both systems. You can use it as part of your dedicated SQL pool or in your service SQL pool. And then column store indexes are something that is only supported in your dedicated SQL pool. It is not included in your SQL Serverless SQL Pool. So I said I’d give this link as a resource to this chapter just to go through this table.

39. Lab – Windowing Functions

In this chapter, I want to go through the windowing functions that you can actually apply to your data in the dedicated SQL pool. So a windowing function allows one to apply a mathematical equation to a set of data that is defined within a window. With a function, you can split the rows of data into different sets and then apply and aggregate the operations to the data in each set. When using windowing functions with SQL pools, you have to use the overflow.

This clause determines the partitioning and ordering of a row set before the window function can be applied, and then you can apply the aggregate function accordingly. So I’ll show an example of how you can use the windowing functions. Please note that when it comes to the exam, windowing functions are very important. When we come to another service known as Azure Stream Analytics, there is also this feature available for Azure Synapse. That’s why I want to show windowing functions. So here I am showing these statements that will work with our Sales Fact table. Here, in addition to selecting some of the rows that I have in my sales fact table, I am also using the overflow. In the overflow, I am saying, “Please partition my rows based on the product ID.” And here we can use the row number function to identify the row number for each row, and we are ordering that by the order quantity. So let me take this statement and run it. In SQL Server Management Studio, I’ll run a new query for our dedicated SQL Pool. Let me execute the statement and see what we get. So now I can see the row numbers for my row.

This is now splitting the product ID. So I can see all of my product IDs over here. If I scroll down, this is ordered by the order quantity if I scroll down. So the product ID is seven, not seven. So, when I scroll down to the product ID of 70 eight, I see that the row number has reset to one. So now my rows are being partitioned by the product ID. And let’s say that you want to perform any sort of aggregation on any of the rows based on the product ID. So I can proceed to my next statement, in which I use the sum’s aggregation function. So I’ll select this. Let me run this query. Here I can see the total order quantity for the product ID. So you can use the overclass along with window functions to actually compute aggregated values such as moving aggregates, cumulative aggregates, running totals, et cetera. I just want to give a very simple example when it comes to applying a windowing function. This is more important when we actually go on to stream analytics, which we’ll cover in that particular section.

40. Lab – Reading JSON files

In this chapter, I want to go through how to read JSON-based files. We are going to be doing this from our serverless SQL pool. Now for this, we have to use something known as the “Open Row Set” function that allows us to read a JSON-based file, which will be available in our Azure Data Lake Gen 2 Storage account. The first thing I will do is upload my JSON-based file. So I have a log JSON file that is based on the same data that is available in my log CSV file. So I’ll go on to my data container, and I’ll go on to my raw folder. I’ll upload my log JSON file. I’ll have this as a resource attached to this chapter. So in my temporary directory, I’ll upload the log JSON file. Once we have the file in place, I’ll open up Synap Studio as a workspace. I’ll go on to data.

I’ll go on to my linked space here in the Azure Data Lake. So I’ll go on to Asyl Data Lake, Storage One. So we had this mapped out early on. I’ll go on to my data container, and I can go on to my raw folder. On the log JSON file, I can first right-click the new SQL script and choose to select the top 100 rows. And what this will do is that it will now use the “Open Row Set” function to take my log JSON file. It will read that everything has only one string if I run this hazardous code, so I am getting an error wherein it is saying incorrect syntax. So here is the problem: the row terminator. So the row terminator is the same as the field terminator. That’s why here in my code I have the change that we need to make in order to ensure that the right hexadecimal value is used as the row terminator for our file. I’ll copy this, paste it here, and hit the run button again. So now it’s reading that data from our log JSON file.

Yeah, you can see the content, but everything is being returned as one string. So we have one row for the entire string of JSON data. That’s why I’ve taken it a step further, where we’re first mapping. Each row has JSON content. Here I’m just using Barca Max. Then for the JSON content, I am using something known as the “JSON underscore value function” to take each property from the JSON content. And then for the different data types, I am using the Cast function to cast it as an integer and to cast this as the date time offset. So I’ll copy this gesticulation and let me replace it, right? So I have the select; let me just make sure it’s correct if I go back. So everything else is the same, right? So everything else is the same. Here, we can use this to have rows; if you want to make this have rows, let me run this command. And now I can see everything coming in separate columns. So in this chapter, I just want to show you how you can use the open row set function to query for data that’s in a JSON-based file.

41. Lab – Surrogate keys for dimension tables

Now in this chapter, I want to go through surrogate keys. There was something I mentioned earlier. So you implement this for your dimension tables. In our example, when we create a dimension table, we have the product ID. Now this product ID is coming in from the Products table, which is part of our SQL database transaction system. When we come onto our dimension table, as I mentioned before, the data for your dimension table can come from a variety of sources, and if the product ID from two different sources is the same, it will be an issue when it comes onto your dimension table.

In this case, the product ID is also known as the “alternate key” or the “business key.” In order, this refers to the key that is used in your transactional system. Now, to uniquely identify each row, you can create a surrogate key. As a result, you can use or create this as a simple integer incrementing value. Now, the surrogate key is also sometimes referred to as the “non-business key.” Now, in the dedicated SQL pool, we can use the Identity column feature just to have incrementing integer values when it comes to the surrogate key. So here is the table definition that we are going to implement for our product dimension table. Here I am using the Identity column feature. So I’m naming this column now “Product SK.” So, first things first, let’s drop our current dimension product table. So here I am in SQL Server Management Studio. I’ll first drop the product table.

Then next, let me create a product table again. So we have the product table in place. Now let’s move on to the integrated section, where we’ll use our Copy Data tool to copy our table from our SQL database onto Azjustin apps. Here. I’ll choose Adventure Works, which points to our SQL database. I’ll select my dimension product table here. I’ll go on to the next one. I’ll go on to next year. I’ll go with Synapse because it has a connection and the target matches the product table. I’ll proceed to the next one. Obviously, this will cause a problem in the product mapping because we have this Product SKcolumn in the destination but not in the source. But now we don’t need this column as part of our mapping because the values for this column are going to be generated automatically. So now I’m going to leave all of this dangerous material.

I’ll proceed to the next one. I’ll choose my staging location. I’ll browse for Synapse. Hit OK again here in advance. to keep it simple. I’ll just choose bulk inserts. Proceed to the next step. Proceed to the next step. It will create a pipeline and start copying our data. I’ll hit “finish.” I’ll proceed to the Monitor section now. So here I have the pipeline in progress. Now, once the pipeline is complete and it has succeeded, if I do select Star from the products table, I’ll hit Execute now, and here I can see the product skid. Now, I’ll order it by the product SK column. I’ll hit execute. Now, here you can see that when it comes to the product SK column, you know, the values are not incrementing one by one. It’s actually incrementing by the number of distributions. Now, I’ve seen quite a few people complain in forums about why this isn’t a proper incrementing column. So, incremental columns are part of the SQL Database implementation, but I don’t see them implemented here. So I said, “I’ve always seen these issues brought up in a lot of forums.” But when we come on to the Azure Data Factory there, I will show you a way in which we can ensure that the surrogate column has proper incrementing values. So wait on that here. I just wanted to demonstrate the concept of having this key in your dimension table.

42. Slowly Changing dimensions

Hi, and welcome back. Now, in this chapter, I’d like to go through the process of gradually changing dimensions. When it comes to your fact tables, you will typically only be appending data to the fact table. So, for example, if you have sales data, you will keep on adding sales information. As soon as it is available in your transactional system, you’ll take it from there. Whatever the incremental inserts that have happened in your transactional table are, you’ll do the same inserts in your fact table as append operations. But when it comes to your dimension tables, sometimes there are updates that can happen to the rows in your dimension table.

For example, let’s say you have a dimension table that stores information about the products. Let’s say that for a particular product ID in the transactional system, the product name changes. So there’s a small change, say, in the product name. So there are different ways in which you will actually handle this change in your data warehouse. So these are known as “slowly changing dimensions.” Now there are different categories when it comes to slowly changing dimensions, and it just depends on the change that you’re going to make in the dimension table in your SQL data warehouse. So I’ll just quickly discuss how types one and two change dimensions. Type-two changing dimensions are very important from an exam perspective, but it’s very easy to understand what a type-two dimension is. So in type one, if there is a change for, let’s say, the product name for a particular product ID in the transaction system, let’s say in your SQL database, in your dimension table for the same product, you can just go ahead and simply update the product name for that product ID. It’s as simple as that. So this is when type 1 gradually changes dimensions. Now, if you are looking at type 2, you would want to keep both the old value of the product name and the new value of the product name.

So your fact table would be looking at the product ID and not the product name, but in certain situations you might want to have both the old information and the new information in the dimension table. So here is an example. We have the surrogate key, we have the product ID, and we have both of the product names. So this is the old product name, and this is the new one. So what you can do is that, in your dimension table, in such a case, you would introduce two additional columns. One is known as the start date, and the other is known as the end date. So you’re saying that in procedure eight, the name was valid between this start date and this end date, and when the product name was changed. Now we have a new row, and here we have mentioned the start date and the end date. So the end date is an infinite date at this point in time, and then your query needs to address the start and the end dates when it’s performing. Join your fact tables, and then you can have one more column just to specify what the current value is for the product itself. So there are different ways in which you can actually handle the changes in your dimension tables. In this particular chapter, I just want to go through the two main types when it comes to slowly changing dimensions.

43. Type 3 Slowly Dimension dimension

I want to get to COVID-type three dimensions quickly in this chapter. And the reason I want to COVID this is because I got a question on type 3 dimensions on the exam. So I only thought that I would get questions based on type 2, but I did get a question on type 3. That’s why I just want to quickly go through the type 3 dimension. So in the type-two dimension here in terms of the columns, we would have a start date and an end date, and you can have an “is current” column as well. So here, if the name of, let’s say, the product changes, remember we will add an additional row. But when it comes to a type 3 dimension-based table, you will not have the additional row in place. Instead of actually having the start and end dates in different columns, you’ll actually have the original name and, let’s say, the new name of the product here, and then you’ll have the effective date from when the change took place. So this is another way in which you can represent changes in your dimension table. The issue here is that if the product changes for the third time, you will most likely need to add a column, which can be a problem. Adding multiple rows is fine. As a result, I’ve seen people use type II dimensions instead of type III dimension tables in the past. So again, just because I got a question in the exam, I want to just have a quick view on type III dimension-based tables.

44. Creating a heap table

In this chapter, I want to give a note when it comes to the snowflake schema. So we have seen the star schema, and this is important from an exam perspective. Now, the major difference here is when it comes to your dimension tables, so your fact table remains the same. But then, when it comes to your dimension table, in addition to, let’s say, having a custom customer table that has a dimension table, there might be another table, let’s say a dim location, that has a relationship with the customer table. So here you have relationships between the dimension tables as well. So this is like having normalised tables in a traditional SQL database. That is one of the key differences when it comes to the snowflake schema. Now, as I previously stated, I will not go into detail about the schema because the star schema is critical for the exam. That’s why I just want to give a note when it comes to the snowflake schema.

45. Snowflake schema

In this chapter, I just want to go through the process of creating a heap table. So, for example, if you want to create a staging table in your dedicated SQL pool for data loading and then put that data into another table after performing some sort of transformation, you can do so. You can first create a heaptable for your staging data. So here I am, creating a table. Sales facts underscore staging. Here, if you want to create a heaptable, you can mention the heap keyword here.

Also, if you want to create a round-robin distribution table so that the distribution is also catered for having faster loading, of your data. So here, I can also mention that the distribution is round robin. So let me create this table. So the SQL Server administrator is in the SQL Server management studio. Let me hit execute.

So we have this table in place. Now, let’s say you also want to optimise the queries that actually work on this particular table. Now, please note that this is not going to be created as a clustered, columnar, store-based table. You. Have two different types of tables. Either it can be a clustered column store table or it can be a heap table. So the clustered column store table should be used for your final tables in your data warehouse. Whereas I said that if you want to have a temporary staging table in place, you can create a heap table. So here you don’t have the option and flexibility of creating a clustered column store index. Instead, what you can do is create a normal, non-clustered index with the help of the create index cross. Here you give the name of the index, and here you mention which column of the table you want. To create the index on So if you have queries that are mostly using the product ID column, you can create an index based on that particular column. So if you’re searching for data based on product ID, you can create a non-clustered index. So again, from the perspective of the exam, try to give some important points here. Firstly, when it comes to the creation of the heap table, Next is that you can include the round-robin distribution, and finally, you can create an unclustered index on the heap table.

46. Lab – CASE statement

Now, in this chapter, I just want to go through the case statement that you can use as part of your SQL queries. The first thing I will do is create a simple product details table in our dedicated SQL pool. So here in SQL Server Management Studio, I’ll first create my product details table. Now, I have a very simple products file that I want to load into this Products Detail table. The first thing I’ll do is go to my data container that’s in my Azure Data Lake Gen 2 storage account. I’ll go on to the raw folder and upload the file that I have in my temporary directory. So I’m uploading a simple product file. I’ll hit upload. If I go on to the product CSV file here, I can see the details in my file, so I can ensure that I don’t even have the extra line. I can hit on Delete or just hit on Save. Now, I’ll quickly use a pipeline to copy this data from this file onto our table. Pretty simple. I’ll go on to my workspace on Azure Synap Studio. I’ll go on to the integrated section. I’ll create a new copy data tool. I’ll proceed to the next one. In my connection, I can select Azure data as storage One.

We had used this connection type when we were playing with our staging area when copying data from the SQL database onto Azure Synapse. So now that we can reuse the same connection to copy our data here into the file or folder, I can now browse for the file. I can go on to my data container. On the Raw folder, I can select my Product CSV file and press OK to see the full path. I’ll unselect the recursively option because I only have one file that I need to copy onto my table. I’ll proceed to the next one. I’ll leave all these settings as they are. Here, you can preview the data. You can see the data right here itself.I’ll proceed to the next one. I’ll now select Azure Synapse in my connection. I’ll use an existing table. I’ll choose now my product details table. I’ll go on to next year in the column mapping; it’s a very simple mapping of just giving a number of one, two, three, and four for the source columns. And that’s because I really don’t have any column names defined in my file. I have the data directly. So it is giving a number for the source columns, and it is mapping it on to the targets. You can change the column number in the destination here if you want. I’ll go on to next year. I need to again choose my staging area. So I’ll choose the same connection. I’ll browse for the same container, which is Synapse, in advance.

I’ll use it again as a bulk insert. I’ll proceed to the next one. I’ll proceed to the next one. Yeah, it will create my data set and my pipeline, and it will run my pipeline. So in this chapter, I want to actually show two things. One was a case statement, which we’ll just see. It’s very simple. But the additional thing that I want to show you is now in the pipeline in the integrated section. We can not only copy data from our SQL database onto Azure Synapse; we are also copying data now from a file that is available in our Azure Data Lake Gen 2 storage account. Let me go on to the motor section so I can see that my pipeline has succeeded. Let’s confirm if I go to SQLServer management studio. So we have all of our data in place. So now I am just working with the product status column and the case statement I am trying to make. Right, first of all, let me just select the product ID and the product name in the status. If the status has a value of W, then display The name is Warehouse. If it’s S, then save it. If it’s t, then transit. And here we have the N statement in the NR case. So this is like having a switch statement in a normal programming language where, based on a particular value, we are trying to display something else. So if I take this and copy it here and hit execute here, you can see the status is now changing. So we are now using the case statement to display values based on the value in the status column itself.

Uncategorized

Related posts:

Leave a Reply Cancel reply