DP-203 Data Engineering on Microsoft Azure Topic: Design and implement data storage – Azure Synapse Analytics Part 2

December 19, 2022

27. Lab – Transfer data to our SQL Pool

So in the Pride chapters, we had gone ahead and built our fact and dimension tables in the SQL database itself. Now we’ll see how to transfer these hazardous tables onto Azure Synapse in our dedicated SQL pool. If I go on to my SQL pools, if I go on to my new pool So I had paused this pool in the middle because I wasn’t working on it. So I said that at any point in time, you can go ahead and pause your dedicated Spool so you don’t bear a compute cost. I’ll go ahead and resume the pool. Allow me to return to your Synapse in a new tab. So I’ll go on to my workspace. Just open up Synapse Studio again. We can actually perform our load operation into the tables in our dedicated SQL pool using the integrated feature that is available in Azure Synapse Studio. This is actually built on creating pipelines that are based on Azure Data Factory.

This is something you will notice as we progress through the Azure Data Factory section. For now, we are going to be creating a pipeline that will copy our tables from our SQL database onto our dedicated SQL pool. The first thing I will do is ensure that I have my fact and dimension tables defined in my dedicated SQL pool. You can use pipelines to automatically create the destination tables themselves, but normally the case will be that you already have these tables in place in your data warehouse. So here I have these statements for creating my tables. Here you can see that most of the columns, especially those that are actually based on an ID on a key, are not null because I don’t want them to be null. It’s good practice to not have these null values in your fact and dimension tables, especially when it comes to the keys that are based on the intersections of your fact and dimension tables. So, first and foremost, let me confirm that my pool is operational. So it is now in the running state.

So in SQL Server Management Studio, let me just refresh the connection to my dedicated SQL pool. I’ll go to my databases and make a new pool, then right-click and select New Query. I’ll copy all of this. I’ll place it over here. So I’ll first create my Sales Fact table, then I’ll create my Dimension Customer table, and then I’ll create my Dimension Product table. Now in subsequent chapters, we’ll actually look at what other aspects go into creating tables in Azure Synapse. At that time, we are going to be looking at the different distributions that are available when we are creating the tables. At this point in time, I am not mentioning any sort of distribution, but when we go into later chapters, you’ll actually understand this concept of distribution. For now, I’m just creating the tables as they are. So if you go on to all of the tables, there won’t be any information in these tables. So this is all now in Azure Synapse; if you want to drop the tables, I have just kept the commands in place. So, returning to Synapse Studio So I’m in the integrated section, and I’m going to choose the option of the Copy Data tool.

Here we are presented with a wizard for copying data from the source to the target. Now, when we’re using a pipeline to copy data onto a target that is based on Azure Synapse, we need to have a storage account in place that is used for the staging area. So the staging area is used to take data from the source, perform any sort of work on that particular data, and then send it on to Azure Synapse. So just for this, I’m going to ensure that in the Azuredata Lake Gen 2 storage account So, if I go to the storage account, which is DataLake 2000, I’ll go to containers and create a new container called Synapse. So this will just be like a temporary container with all of the data that’s required by the pipeline. We’ll also reuse the same container in other chapters as well. So now I want to run this particular pipeline only once. So this wizard is going to create a pipeline that is going to have an activity to copy data from the source to the target. So I’ll go on to the next one. Now here, I need to create a new connection. So here we are on the part of defining our source; we have to define something known as the data set. So what is the data that we want to copy from the source? So for that, we need to create something known as a connection. So let’s go ahead and create a new connection.

Now you can see that you have the capability of connecting to different types of data stores. You can connect to data stores in Amazon, WebServices, and other data stores in addition to Azure. So I’ll choose Azure, and since our data is in an Azure SQL database, I’ll choose Azure SQL Database and hit Continue. I can give the connection a name here. Here I can choose my subscription and my server name. So it’s App Server 1000. I’ll name my database Adventure Works. I’ll select SQL Authentication as the authentication type. I’ll mention my username and password and test the connection. I’ll then hit “create.” So now we have the connection in place. And now you can see all of the tables that are actually present in our Adventure Works database. So let me filter based on our fact table. So I’ll choose my sales fact. I’ll then search for my customer table and my product table.

So now I have chosen three tables, right? There are three that have been selected. So my customer table, my product table, and my fact table So I can copy all three tables at one time. I’ll go on to the next one. If you want, you can preview the data in advance. You have other parameters as well. I’ll leave everything as it is. I’ll proceed to the next one. Now we have to define our target settings. So where are we copying our data to? Again, we have to create a new connection because we are now connecting to Azure Synapse. So I’ll choose Azure. I’ll search for Azure Synapse. Here on, continue. Yeah, just leave it dangerous and let me change my subscription and server name. It’s thus apps workspace 9000. I have to choose my database. I’ll go with SQL Authentication. I need the username.

So if I go on to the overview, I can get the username from here. I’ll enter the password; I can test the connection, so it’s fine. I’ll hit create. So this is also done. Now, it has automatically mapped the tables in our source onto the tables in our target. and that’s because the table names are the same. So it’s doing all of the mappings as they are. So here you are mapping the tables from the source to the destination. I’ll go on to the next one. Now, you can perform column-based mapping for each of the tables. So if you don’t want to copy certain columns, you can actually go ahead and delete those columns. But since I want to copy everything, I’ll leave everything as it is, and I’ll go onto Next. We now need to have a staging account linked service in the settings. So I said that when we are now copying data onto Azure Synapse, we need to have a staging account in place. So the copy activity will internally use this staging account as a staging area for loading the data before it copies it onto the dedicated pool in Azure Synapse.

So we currently have one link service on our SQL database and one on Azure Synapse. Now let’s create another link service, which we’ll have to do over here. So we’ll create a new link service that will be on our Data Lake Gen 2 storage account and that will actually point to this Synapse container. So I said that the Synapse container will be kind of a temporary area that’s going to be used by this copy activity. So I’ll hit you. I’ll name it Synapse Data Lake. Now, this is the type I’ll use for Azure Data Lake Storage, Generation 2. I’ll go onto my subscription, and I’ll choose my storage account name. I’ll hit “create.” Now, next, I need to browse for the storage part. So I need to browse for the container. So I’ll just click on Synapse and hit OK, so it’ll come over here. Now, if I expand the advanced method, if I go and scroll down, So currently, in the copy method, it’s using the PolyBase technique. But in this case, we don’t need to use PolyBase because it is copying everything from the SQL database directly onto a zoo of Synapse.

So I can just do a very simple bulk insert. I’ll then go on to the next one. I’ll leave everything as it is. I’ll go on to the next one. And now it’s going to go ahead with the creation of the data sets and the creation of the pipelines, and then it will go ahead and even run the pipeline. So now we have gone ahead, and with the help of a wizard, we’ve created a pipeline. The pipeline is generating activity. The activity will be used to copy our data from the source to the destination. If I hit “finish,” I can now see my pipeline in place. So this is in the integration section. If we actually go on to our data section, if I go on to link here now, you can see something known as integration data sets. So now our destination data set actually points onto Azure Synapse, and our source data set basically points onto the data in our SQL database. Again, we’ll be going through data sets. As I previously stated, everything in Azure Data Factory has a link connection. If you now go on to the Monitor section in the pipeline runs, you can see your pipeline is now in the running state, and you can also see it has succeeded.

So if I now go on to SQL Server Management Studio and select Star from the Sales fact table, you can see all of the data in place. You can do account-based learning. So it’s always good to try to see the number of rows that have been copied. You can check the ones from the customer table again, and you can check the count over here as well. Check the number of rows in the product. so that is also fine. So everything is back to normal; as previously mentioned, you can perform a confirmation. So if you want to do a count in your SQL database, this is actually matching the number of rows in your destination. So now what? We’ve done it’s.We’ve used now the pipelines in the integrate featurein Azure Synapse studio to copy our data soeasily from the source onto the destination. If you had to use other tools, it would take so much time to get everything set up just to copy data from the SQL database onto Azure Synapse. But you can see from Synapse Studio itself how easy it was to go ahead and copy the data.

28. Other points in the copy activity

Now, just a quick note when it comes to the Copy Data tool. So I just went ahead and started the visit again. So in our destination data store, we went ahead and created a new connection to connect to our dedicated pool. And here, it will again automatically populate the tables. So this is in place because we’re doing it from Azure Synap Studio. Now, when we go onto Azure Data Factory, you’ll see the same screens there as well. But there, we’ll have to explicitly create the connection. That’s why I made the connection from here. But you can reuse the same connection that we have in Azure Synapse Studio. When you get to Azure Data Factory, you’ll have a much better understanding of this concept.

Now, next, when it comes to the mapping from the source to the destination, what is very important are the types that are being mapped from the source. And if I scroll on to the right, what are the types that are being mapped onto the destination? So, in our case, the types for the columns are very simple. The only transformation that is happening over here is that Barca has been transposed onto Barca itself. This is because having ambiguous ad columns in your destination is good practice. So the copy pipeline is just using these best practices. But it’s up to you to understand whether the data types in your source are being mapped onto the right data types in your destination. Right? So in this chapter, I just want to give a couple of quick points when it comes to the copyactivity we had implemented in the last chapter.

29. Lab – Using Power BI for Star Schema

So in the prior chapter, we had gone ahead and copied our dimension and our fact table onto Azure Synapse. So, normally, you can now use visualization tools to work with the data in your fact and dimension tables. And remember, these are built on the concept of that star Schema. And tools like Powerbase can actually understand that the tables are based on that star Schema.

And you can start analysing the data in your fact and dimension tables. So, in power bi desktop, I’ve launched it again. I’ll click on “Get data.” Here. I need to choose Azure. and I’ll choose Azure. Synapse antics. I’ll double-click on the SQL data warehouse; I need to give the server name. So I’ll go into my workspace, I’ll take the dedicated SQL endpoint, and I’ll place it over here. I’ll hit OK; yeah, I’ll choose Database, and I’ll give my username and password. So I’ll give the username, I’ll get the password, and I’ll hit Connect. So I’ll choose my dedicated SQL pool. I’ll choose my dimension table and my sales fact table. So we can see a preview of the sales fact table. Let me go ahead and hit “load.” Once this is completed, it will load all of the tables in PowerBI. So on the right-hand side, we can see all of our queries based on the data in the table. I can see that there is a relationship between my dim product table and the sales fact table if I go back to the model and just hide everything.

Now, please note that, remember, we have not defined any relationships on tables because this is not possible. But here, based on that star schema concept, it has been doubted that there is some relationship between the sales fact table and our dimension product table. If you can’t see the relationship between the Sales Fact table and the Customer table, we can go ahead and create one. So here I can click on the context menu and hit “Manage Relationships.” Here, I can click “New,” select my customer table, and select my sales fact table. And here, I can choose OK or Close based on the customer ID. So now we have the relationship over here as well. I’ll go back to my report screen. Now I’m going to put the clumped column chart on my report. Here, I’ll just hide the filters. I’ll drag the product ID now for access. And for the values, I’ll choose the order quantity. So now I have this graph in place. Now next, I’ll just go ahead and close this up. Now I click on the main screen, and I’ll add something known as a slicer. And in the slicer, I’ll go on to my dimproduct table, and I’ll choose my product’s subcategory name.

Now with my main chart, which is showing the order quality by product ID, I can now also base my data on the product subcategory name as well. So if I choose road bikes, I can now see the product ID and the product quantity that actually correspond to the products in the subcategory. So with the help of dimensions, you are now looking at a different view of your facts, and that’s the entire purpose of having dimension tables in place. So you have your fact information that has all of the facts in place, but if you want to see your facts from a particular angle or from a particular dimension, you can now do it with the help of the dimension table. And this is because you have that product ID in your fact table, and we have it in our dimension table as well. So again, there is a lot that you can do in Power BI. There’s a lot that you can actually focus on or build when you’re working with your fact and dimension tables. This is obviously outside the scope of the current exam, but I thought it might be useful for students who are trying to figure out why we would want to build fact and dimension tables in the first place. I just want to give you a very simple example.

30. Understanding Azure Synapse Architecture

Now, in this chapter, I want to go through the different types of tables that you can create in Azure Synapse in your dedicated SQL pool. Earlier on, we had seen how we could create our fact tables and how we could create our dimension tables. So these are table types that were based on a particular pattern when it comes to a data warehouse. So, whether you’re building the data warehouse in Azure Synapse or in another system, normally you will use this concept of fact and dimension tables.

This is when it comes to the pattern of how you have your data in your tables. But when it comes to Synapse, when it comes to your dedicated SQL pool, you have something known as “table types” that are also available. And these different table types are meant to help when it comes to the distribution of the data in the dedicated SQL pool. Hash-distributed tables, replicated tables, and round-robin-distributed tables are the three different table types. Now, one of the main reasons why you have these different types of tables is because of the amount of data that you store and process in a data warehouse. See, when you look at a traditional SQL database, you might be dealing, yes, with a lot of data, but not so much data when you want to perform analysis. So if you’re looking at working with petabytes of data, then the way that you actually distribute the data in order to ensure that you get the best efficiency when it comes to analyzing data is very, very important.

So when it comes to Azure Synapse, when it comes to the Derrick Spool, you have something known as “distributions.” So what happens when you look at the architecture of the dedicated SQL pool? Let’s say a user fires a query that works on the data that’s stored in a table in the SQL pool. This query goes on to something known as the control node, which is part of the SQL pool. The control node then distributes that request to different compute nodes. So in your pool, you will have different compute nodes. It’s like having multiple virtual machines running side by side, each processing your query and data in parallel. This is an advantage because if you are working with large amounts of data and can work with the data in parallel, you will be able to complete the job much faster. It’s like the “divide and conquer” aspect, wherein you are dividing the query to ensure that you get the fastest result.

So that’s the idea over here, wherein you can have your compute nodes work with your data in parallel to achieve a particular result. Now, the number of compute nodes that are actually allocated onto your dedicated SQL pool depends on the number of data warehousing units that you allocate onto the SQL pool. So remember, when you create your SQL pool in the beginning, you have to decide on the number of data warehousing units. We are currently using the least expensive option, DW 100. And I believe with this, you only get one compute node. As a result, using only one compute node is inefficient. Normally, in a large organisation in an enterprise where you have large amounts of data, you will use a higher data warehousing unit capacity to ensure that you have multiple compute nodes. Now, the next concept is your underlying data. Now, there is a separation between the data layer and the compute layer. in Azure Synapse as part of your dedicated SQL pool. In the data layer, you have something known as distributions. So by default, there are 60 distributions. Here, I’m just showing three distributions for simplicity’s sake. So when you have data in a table, it is actually split across these different distributions. So a distribution is like having an allocation of space, right? And your data is split across these different spaces.

This actually gives an opportunity for a compute node to take a set of distributions and work with the data accordingly. So here, in the simplest of ways, you can have one compute node that works with one distribution, another compute node that works with another distribution, and so on and so forth. So your data is split across multiple distributions. And when you have your compute nodes that are working with the data in these different distributions, it helps to break down that work in parallel, achieve a faster result, and then send the result back on to the user. As a result, the data is essentially shared into these distributions in order to optimise work performance. And all of your distributions are actually stored internally in your storage. One of the other big advantages of separating your data and your compute is that your data and your compute can scale independently. You don’t have a dependency on one over the other. So one of the main reasons why I’m explaining this concept is for students to understand: what are these different table types? Because these table types are actually based on the underlying workings of the distributions and the compute nodes, So let’s call this chapter a wrap; I’d like to divide it into two parts in the next chapter. Let’s look at the different table types.

31. Understanding table types

So in the last chapter, I gave kind of a primer when it comes to the architecture for the dedicated SQL pool. Now, in this chapter, let me go through the different types of tables. The first type of table is the round-robin distributed table. Now, here, for the sake of simplicity, let’s say you’re loading this data table into Azure Synapse in your dedicated SQL pool. And here I’m just showing three distributions. I said that by default, there will always be 60 distributions when you’re looking at the underlying data storage layer. Then one row of data will be allocated to each distribution.

This is in round-robin fashion. So let’s say that you have a query that actually looks at working with all of this data. If you had three compute nodes, these three nodes could target all three distributions at the same time. They can work with this data, get the result, and give it back to the user. If you only look at these three rows of data, you will not understand why you need to distribute your data. But when it comes to working with data at scale, you have to distribute your data across different distributions. When it comes to the architecture of systems that work with big data, this is always the underlying concept. So your data is distributed not only in Azure Synapse but also in a number of other popular database systems. There is always some way—some methodology—that is used in other database systems to split the data. If you have large amounts of data, there are different technologies being used. It’s not the same as distributions, but the concept is always the same: to partition your data in such a way that it becomes easier and more efficient to query the data.

This is when it comes to working with big data and large amounts of data, and it’s not only from the querying aspect; it also helps from the storage aspect as well. Now, next, let’s go on to hash and distribute tables. Now, normally, you will use hash dispute tables when you are creating your fact tables. Now, what happens in a hash-distributed table? So here, the data will be distributed based on the hash value that is calculated on a particular column. So when you create a hash-distributed table, you have to mention which column is going to be used to hash the data. Assume you select the category column in this case for hashing the data. In this case, one row of data will be in one distribution, whereas the rows that have the same value of architecture for the category will be in this distribution.

So, when it comes to hash dispute tables, this is the case. Now, I’ll explain what the benefit is of using your fact tables with hash dispute tables a little later on. At this point in time, I’m just trying to explain the different table types. And then finally, we have the replicated tables. In replicated tables, a complete copy of the table is cachetted on each distribution. Over here, I will compute nodes. But let’s look at the distribution aspect for now. So here, all of the rows of data are one distribution that will be attached to, let’s say, a compute node, assuming that you have multiple compute nodes. So, keeping this in mind, we’ve got replicated tables, hashed tables, and round-robin tables. Now, when it came to our tables, which we had copied earlier on from our SQL database onto Azure Synapse, we never mentioned what the table types were. So if I go on to the object explorer, if I go on to my new pool over here, if I go on to my tables, then by default, all of the tables will be based on the table type of round robin distribution.

Even when you have a closer look at the icons for the table, this actually denotes that these are round-robin distributed tables. So our data will be spread across the multiple distributions. So here in one of my SQL scripts, when it comes to creating round robin tables, since all of our tables are based on round robin distribution, you don’t need to mention anything when it comes to the table type; automatically, it will choose round robin distributed tables. If you want to see the distribution, you can issue this command. Here, you just need to mention what the table name is. So, if I take this command and close it, this is actually connected to our SQL database. Here we are connected as a SQL admin user. So I’ll remove all of this, I’ll execute the statement, and here you can see all of the distributions. So here you can see that there are 60 distributions. I can see that when it comes to the node ID, there is only one. That means we only have one compute node. I said based on the data warehouse units that you actually use, right? Then there are the rows that make up the distribution. Now, when it comes to the rows, you won’t see a uniform distribution of the rows, but here the concept is to try to uniformly distribute the rows across all of the distributions.

32. Understanding Round-Robin tables

Now, when it comes to our round robin dispute table, I have a query that I’m going to execute. So I’ll execute the same query for our round-robin dispute table. And in the next chapter, when we are looking at hash dispute tables, I’ll execute the same query.

I want you to notice the differences between creating a round robin dispute table and a hash distribute table. In this query, I’m using a groupby clause for the customer ID. So, now that our rows of data have been distributed across the different distributions, when you try to perform a join or a group based on a specific column, all of your nodes may require. Once again, all data must be entered in order for it to be grouped by the customer ID. Now, even though you can have nodes in parallel that can work on the different distributions, if you don’t use the right table distribution and if your query is designed in such a way that it does not make effective use of the distribution, then there is something known as “data shuffling” that is going to happen. So I stated that in a round robin distribution, the data of the table and the rows of the table are distributed in different ways. Now, in order to group everything together, it needs to do some sort of data shuffling activity, and that’s because it has just done a random distribution of your data based on the round robin technique if I execute this query.

So, for my table, I’ll run this query and call it a day. Now, if I go on to the monitor section and if I go on to SQL requests, we should be able to see the details about each SQL request that is executed, either in our dedicated pool or in our built-in pool. Now, it will take around five to ten minutes for our query to come up over here. So I’m going to wait for sometime, and then let’s come back. Now, after waiting for some time for my SQL request, I can see my request in place. Now, please ensure that when it comes to the filter, if you don’t see the query, it might be based on the SQL request for your built-in server pool, but we’re looking at our dedicated SQL pool, so ensure the filter is based on your dedicated SQL pool.

Now, next, I am just looking at the select customer statement. If I go on to more here, I can see that this is a statement that we executed. Now if I go on to the query ID, the request ID, and if I look at the steps, I can see a lot of steps that are being carried out. The main step over here is the shuffle-move operation. So because the data is not available on all of the nodes, that is why time was taken for this particular shuffle operation, and you can see all of the rows that were involved in this shuffle operation. And this is because, in round robin table fashion, the rows are just distributed in a round robin fashion. It’s not based on a particular pattern. When we look at the next chapter on hash-distributed tables, we’ll actually see an improvement on the request steps. So it kind of depends on the type of query. Also, you’re firing on your data. So since we were now using a group by clause based on the column ID, it had to do this data shuffling operation to ensure that it gets all the rows based on that group ID. So here, because the data is just randomly distributed across the distributions based on the round-robin technique, it had to collate the data together in order to give us the response. That’s why we had that data shuffle. We are actually going to see an improvement on this when we look at hash-distributed tables.

33. Lab – Creating Hash-distributed Tables

Now, in this chapter, I want to show you how you can create a hash distribution table. Are we doing this for my Sales Facts table? Here we are going with the normal creation of a table, but with the width clause. Here we are saying that for the distribution created based on the hash of the customer ID. So I’ll first drop the existing Sales Fact table that we have. I’ll hit execute. Once that’s done, I’ll issue the create table command. Now I’ll hit “execute.” And now we need to load our data into our Sales Fact table. So I’ll go on to our copy data tool. I’ll discard the changes. So here, let me create a new copy data tool pipeline. I’ll run it once. So here in the connection, I’ll choose AdventureWorks, which points to our SQL database. I’ll choose our sales fact table. I’ll go on to the next one. I’ll proceed to the next one. And here, in the connection, I’ll choose the Xyos synapse. So it’s loading my Sales Facts table; go on to Next. I’ll go on to the next one. I’ll re-enable staging next year.

I’ll choose the Data Lake storage account. I’ll browse for the Synapse folder and hit OK in advance. I’ll just do a simple bulk insert. I’ll proceed to the next one. I’ll go onto Next, and it will create the pipeline along with the copy activity. I can then click on the “Monitor” section. If it does not proceed to the Monitor section, you may proceed to the Monitor section. So here we can see the pipeline in progress. If you’re not seeing the earlier pipeline, it’s because there is a filter. Now, you can actually select all of the pipelines right here itself.at any point in time. I can click on Refresh to see the status of the pipeline. So it’s still in progress. Now, once it has succeeded, let me look at the distribution of the data. So here we can see our distributions, and you’ll see a slightly different arrangement when it comes to the row distribution. Now, please note that we had unique values for the customer ID. It does not imply that each CustomerID value must be assigned to a different distribution. Remember, there is a hash function that calculates a hash, and based on that hash, it is allocated on a particular distribution.

One of the primary advantages of using a hash-distributed table is that queries will be much more efficient when performing joins or grouping by the hash-distributed column itself. So let me run the query that selects the CustomerID, the count from the underlying table, groups by the Customer ID, and also orders by the Customer ID. So here, the group P is very important. So here, I’ll execute this statement. So, we are getting the desired results. Now I’ll wait for some time till we get the output from the SQL request. Now, after waiting for some time with the SQL request, I can see the request over here. So, again, the request is the way I understand it, and I’ll go on to more. So this is a statement that we issued. If I click on the request ID, you can see that there is only a return operation. That’s it. We don’t have any other steps. So this has made the query much more efficient because we have based it on a particular column that has been hash-distributed. So one of the biggest advantages of using a hash dispute table, assuming that you understand the column that will be used for hash distribution,

34. Note on creating replicated tables

Now, when it comes to a replicated table, I’m not going ahead with the complete creation of the replicated table. I’m just showing you how you would create the replicated table. So it’s very easy. Simply adding the keyword “replicate” to the distribution will result in the creation of a replicated table. Even though I’m showing you how to create a replicate table for your or sales fact table over here, you can do it in this case because we don’t have that many rows. But you would not create a complicated table for a fact table. See, your fact table is going to be large in size, and if you are replicating that data across all the nodes and across all the distributions, you are just eating up space. So for a fact table, you should always look towards creating a hash-distributed table. The only reason I’m showing this particular script is because you can use the same table without rows and look at the different distributions. That’s all. When it comes to a best practice, you will normally create replicated tables for your dimension tables and use hash-distributed tables for your fact tables. Right? So in this chapter, I just want to kind of give that note for replicated tables.

35. Designing your tables

Now in this chapter, I just want to reiterate the point of ensuring that you use hash dispute tables for your fact table and replicate tables for your dimension tables.

So your fact table will normally be large in size, and using replicate tables is not the ideal approach because then your data will be replicated across each compute node, which results in a waste of storage space. Instead, you can use a hash display table and ensure that the data rows in the fact table are displayed across different distributions depending upon the column that you choose as part of the fact table for distributing the data. Since your dimension tables are smaller in size, you can actually consider using replicated tables for your dimension tables. So these are very useful for your joints. So normally, when it comes to your fact tables and your dimension tables, you’ll actually be performing joins between both of the tables to analyse the data. So if you consider our last scenario for our sales fact table, let’s say that you have distributed the data as per the customer ID. So the hash function is taking the customer ID and then distributing the rows across the different distributions. So, for example, you could have rows with the Customer ID of one customer in one distribution, two in another, and three in another. This is just one example. And then you have your dimension table.

So let’s say you have a dimCustomer table; that’s a dimension table. Now, if this is a replicated table, all of the information about all of the customers will be available on each compute node. So you can now effectively perform joins for a specific customer because your customer information is already present over here. So then this allows the compute nodes to effectively work in parallel because the dimension data is only available on each compute node. If this were not the case, this compute node would be incorrect if it had to perform a join of customers where the ID was equal to one, and let’s say the dimension Customerinformation for the customer ID was equal to one. Another compute node on another distribution, basically. Then they would need to have data shuffling in place to ensure that the data is shuffled across to make it available on the other compute node. and this would make your query less effective. It will then take time. That’s why in this case you have these different table types, and you can use these table types based on the design patterns for your tables. That’s why it’s always recommended to use hash dispute tables for your fact tables and replicate tables for your dimension tables.

So the entire purpose of me trying my best to explain to you the internal workings of distributions, compute nodes, and these tables is just to try to iterate what the benefit of using these different tables is for you when you are working with large amounts of data. This does have an impact because, in the end, you want your queries to execute as quickly as possible to get the desired results. And when you are working with large amounts of data, this can really make a difference, right? So in this chapter, I just want to kind of explain the core concept of using replicate tables for your dimension tables and hash display tables for your fact tables.

36. Designing tables – Review

So in this chapter, I just want to go through some important points when it comes to designing your tables in Azure Synapse. So first, when it comes to hash distribute tables, this table takes the data and distributes the rows across the compute nodes. The distribution of data is done via a deterministic hash function that assigns each row to one distribution. Always consider using hash dispute tables if the table size is more than two gigabytes on the disk. Next, when it comes to choosing a distribution column, remember that when you create your hash dispute tables, you have to decide which table column will actually be used for distributing the data. So, when choosing a column, ensure that you avoid something known as “data and processing skew.” So, data skew means that the data is not distributed evenly across all of the distributions.

So, if you don’t have many unique values for the column value itself, all of the data might result in the same hash. And that means all of the data might be in the same distribution. So it will not be equally distributed across the 60 distributions. This would then lead to processing skew. Because if your data is only in some distributions and not across all distributions, then when you are processing the data based on the queries, that processing will only be done on some distributions and not all of the distributions themselves. So, it’s very important to choose the right column when it comes to a hash-distributed table. So choose a column for distribution that has many unique values, that does not have nulls or has very few nulls, and that is not a date column. Try not to use a date column because if the date does not change again, it will go ahead and put everything in one distribution. This is commonly used in the join the group by clause, and the having clauses as part of your query are not used in the where clause. Now, the where clause is used for filtering your results, and using hash-dismissed tables is not ideal for that column that is used in your where clauses. For that. There is a separate technique known as partitioning, which we’ll go through a little bit later on. Next, your round-robin tables.

So, this type of table distributes evenly across all distributions. If a query needs data from different distributions, then data movement might be required to get the results of the query. Now, if there are no joins performed on the tables, then you can consider using this table type. Also, if the table lacks a clear candidate column for hash dispute, consider using round robin tables. Sometimes your temporary staging tables can actually use round robin tables because they’re easy and faster for loading the data. So, once you have your statement tables and you’ve done your transformations, everything is in place. Then you can copy the data from the staging table onto your final table in Azure Synapse and use a different distribution type for that particular destination table. And then we have replication. Table Each compute node has a full copy of the table. Consider using replicate tables when the table size is less than two gigabytes in compressed format. And it’s always good to consider this table type for your dimension tables. Don’t consider this table type if the table has frequent insert, update, and delete statements; this could actually require a rebuild of the replicated table. And normally, when it comes to your dimension tables, you won’t have that many inserts, updates, or delete statements. Right, so in this chapter, I just kind of want to go through some important points when it comes to the different table types.

Uncategorized

Related posts:

Leave a Reply Cancel reply