DP-203 Data Engineering on Microsoft Azure Topic: Design and implement data storage – Azure Synapse Analytics Part 4

December 19, 2022

47. Partitions in Azure Synapse

Now, in this chapter, I just want to have an introduction to table partitions. So we have looked at distributions when it comes to the different table types that are available for your dedicated SQL pool. But you also have this concept of table partitions in place. It actually helps to divide your rows of data into different kinds of logical partitions. All of this data can reside on the same physical disk. So you just have the idea of having some sort of logical partitions here. So these partitions help to divide the data into smaller groups of data.

Normally, the data is partitioned by dates. So if you have a date-based column, you will normally partition your data with the help of dates. Now, we’ll understand this better in our lab. Partitioning also helps in filtering data when using the where clause in your queries. So if you want to get better efficiency in your queries, which actually use the where clause, you’ll actually use partitioning. Here, the engine can just process the data in the partitions based on the conditions that are actually mentioned in the where clause. Now, don’t worry, I’ll go into a lab. Before I show you how we can actually work with partitions here, I’d like to give you a quick overview of the partition concept.

48. Lab – Creating a table with partitions

So in this chapter, we are now going to look at an example of how we can create a table based on partitions. So I’m going to be making use of the logdata table we have seen earlier on and also taking data from our log CSV file, which is in our Azure data like a Gen 2 storage account. Now, if you don’t have these tables in place, we can just go ahead and easily recreate these tables. So firstly, let me drop the existing table if I have one, just to ensure that I am working with you when it comes to the tables. So here, I don’t seem to have the table in place. Let me also check if I have any sort of external table. So I had an external table.

So I just want to go ahead and delete that. Now let me go ahead and just create the log table without any sort of partitions. I’m making the table hazardous. I’ll hit execute. So this is done. Then I’m doing a simple copy of my data. When it comes to the log CSV file, there is one distinction. So here’s the change I’ve made: I’m picking up the log CSV file from my clean directory. Remember, I mentioned this earlier on in the chapter, wherein we could not create an external table. When it came to the Hadoop driver, which is available as part of the dedicated SQL pool, we were getting an error wherein we couldn’t convert the barca into a date and time. And then I mentioned that now I have a newLog CSV file, and I said I’d explain how to generate this file in the chapter in Azure Data Factory. But I’m using this file when it now comes time to loading data into this log data table. This file is in a folder known as “Cleaned” in my data container. I’ll ensure to attach this log CSV file as a resource to this chapter so that you can follow along. I said I would show you how to generate this particular file. So here we can see that we have a time column in place.

Now, I’d like to see what the different variations of the dates I have are. So I am going to just format the time column in this particular format, and I want to see the number of rows based on the date. So let me execute this to give you a better understanding of what I am trying to achieve. So I’ll hit execute. So here I can see that when it comes to the date of March 20, 2021, I have 46 items. Similarly, for another date, I have some other number of items. So I’m attempting to determine the number of months I have. So I can see that my first month is three. If I scroll down, I have zero for April, and then I have zero for May. Finally, I have June, which is six zeros. So now I want to create a table that is partitioned based on the date in the time column. So now I will drop my log data table. Again, the reason that I did this initially was to show you what we are trying to achieve when it comes to the time column. So let me first drop this log data table. Now I am going to again create the table, but this time I am mentioning a partition. So let me copy this and paste it onto SQL Server Management Studio.

So this part remains the same. What’s changing is I’m using the partition clause, and here I’m saying partition by the time column. Now, when it comes to the range, you can have a range to the right for values and a range to the left for values. Now again, I’ll explain this with a diagram showing what happens if you specify the range right for values based on the values mentioned here. So here, I’m showing you what’s going to happen. So based on that particular statement, first of all, remember that in our data we had rows for months starting from zero three, then 0405, and zero six. So now we are saying, “Create partitions based on the time values.” So if I go back to the SQL Server Management Studio, what is the first partition range I am giving? So the month is number four. So, when you say a range and write for values, the first partition will be those rows with a time less than the month of April. So that means all of the rows that have the month of March are going to come in partition number one. And then in partition number two, we are going to have values that are greater than April and less than May.

That will be partition number two. This is due to the fact that the range was clearly stated in the second partition. So here I am actually creating partitions for each month based on the row values that I have in the table. If you had a range for the left partition, then here it would be less than or equal to, and here it would just be greater than this particular date. So now we are trying to create, as I mentioned, partitions of rows based on the values in the time column. So let me go ahead and execute the statement to create the table. So that’s done. Now I’ll execute the statement to copy data into the table. I’ll hit execute. Now when you do select a star from the log data table, let me hit execute. So you won’t see any changes. It’s going to be dangerous with this data. It’s only when it comes to the logical partitions. It’s actually now based on the values in the time column. Now, some of the statements that you can actually use to view the partitions So, all of this is taken from the Microsoft documentation.

Here we are making use of a lot of system tables. So if I copy this, the only change that I have made to the statement is to change the name of the table. So I’ll execute this. Now, as you can see, we have four partitions in place. Here. It’s showing the row count is equal. Now remember, it will not be equal. This is now an internal implementation of how the partitions have been made. And I’m not going deep into how the partitions are working on the back end. Instead, in the next chapter, I’ll show you one of the core benefits of actually having partitions in place. This chapter was just about creating a table using partitions. Finally, there is one more statement that is present to look at both the distribution and the partitions. So in SQL Server Management Studio, if I execute this, you can see a lot more information.

Here you have the partition ID, and then you have the node, et cetera. So, as I previously stated, I will not go into detail about how these partitions were structured in the back end. Now, just one important note when it comes to general practice. So, yes, we can create partitions in our table, but please remember not to create too many partitions, because in your tables, you already have the implementation of distributions. Remember, you have 60 distributions in place, and on top of that, you’re having partitions as well. So, as per the Microsoft documentation, for optimal compression and performance of clustered column store tables, you should have a minimum of 1 million rows per distribution and partition if you want to have optimal compression and performance. So in this chapter, I just want to show you how to create a table with partitions. In the next chapter, let’s see the benefits of using partitions.

49. Lab – Switching partitions

One of the primary benefits of using partitions is that if you use the time column on which the partitions are based in your where clause, it can actually help to direct the query to the correct partition. Assume that in your where clause, you are only working with times that fall within these date ranges. So the query will only be targeted against one partition. So instead of going and performing an entire table scan for your where clause, it can now be much more efficient and only be directed at a particular partition. But another advantage of partitions is that, let’s say in this example, you don’t want the values of the date that are less than April.

You can just delete this partition. You can issue delete, insert, and update statements on your data, just like you would with a traditional SQL database or a SQL database. However, because the amount of data in a data warehouse is much greater than the amount of data in a traditional SQL database, that operation may be slightly more expensive. That’s why you should be very careful when you perform those sorts of operations, and that’s why you have partitions and you can actually issue statements like delete to delete an entire partition. So sometimes companies do want to do that. They might want to remove stale data that’s very, very old that’s no longer required in the antique data store. So instead of actually issuing a delete statement where the time column is within a date range, you can just simply delete a partition.

Now I’ll show you an example in which we create a new table with partitions and simply switch the partition from our existing table to that new table. So here I am creating a new table in terms of the partition. I have a subset of the partitions from the original log data table. I’m making this table from log data with Hasselect Star and the condition that one equals two. This means that it will only copy the schema structure of the log data table onto the log data. Underscore the new table. Here, no rows will be copied because this condition evaluates to false. Now, when you create a table with the Has select statement, you have to mention a distribution for this table. Now, if we look at our earlier statement to create the log data table, yeah, we didn’t mention any distribution. So, remember what the default distribution will be? It will be a round robin. That’s why I said the distribution is round robin here.

So let’s create this table. So I’ll hit execute, and there should be no data in this new table. So that’s done. Now here I am issuing an altered table command. I am saying, “Please alter my original log data table.” Change to partition number two. So, for partition number two, if I go on to my diagram, it’s all of the values that are greater than April and less than what is there for the month of May. Remember that my new table begins with range, right, for values beginning in May. So this partition will only accept values that are less than the month of May. But here, since my original table has a boundary from the month of April to the month of May, I can now only switch partition number two. So what are we doing? We are now just transferring this entire partition from our original log data table onto our new table. So it means now the data in our log table should only be this, and this data should belong on a new log data table.

So let me try this out. So I’ll take this. So I’m saying, please switch this on to partition one. This is our first partition one.So let me execute this. So this is completed, and if you select a star, we can now see some information in place. And here, if you look at the time, it will all be in the month of April. So, just to be sure, if I take our same statement and group it now, by the time this is finished, this will be our log data new table first, so we can see that everything is only in the month of April. And if we do the same thing now, just for our log data table here, you can see that we have the month of March, and then we don’t have the month of April until we get to May. So how easily have I just transferred data from one table to another by just switching the partitions, right? So in this chapter, I want to show you some of the other benefits that we have when it comes to working with partitions.

50. Indexes

Hi, and welcome back. Now in this chapter, I just want to go through the theory when it comes to the indexes that are available, when it comes to your dedicated SQL pool, and when it comes to your tables. So normally in database systems, when you define your tables, you have the concept of also defining indexes. So an index is like having an index for a book. If you want to find something much more quickly, you can actually refer to the index of the book. Similar to a table, when you have a query and the database engine needs a faster way of retrieving the data, it can actually use the index to find where the data is actually located. Again, this is a concept that is normally implemented in a lot of database systems.

Now, we don’t need to go into details on the indexes themselves. It is only now that Azure Synapse comes into play. When it comes to the tables in the decade of SQL pool, we need to have some theory in place when it comes to the different types of indexes. So normally, your tables will be created automatically with cluster column store indexes. So when it comes to the data warehouse system, your tables are actually stored column by column when it comes to the underlying physical infrastructure, the underlying physical disks. So this makes it much faster when it comes to working with the data. Normally, in database systems like, let’s say, the Azure SQL Database or any SQL database system, they are stored row by row. When you work with a large amount of data in a data warehouse, you typically perform aggregation and analysis, looking at columns of data. So that’s why the data is stored column by column. When it comes to how data is stored, column by is essentially what it is. So when it comes to an AzureSynapse table, you have the clustered columns-to-index that gets created automatically. So this sort of index provides the highest level of data compression and the best overall query performance. But there are limitations. So you can’t have a clustered column store index if you have columns that are of the type Waka MAX, NVaca MAX, or binary MAX. So also remember that it’s not ideal to have columns where you have a maximum representation of Parker.

So that is also another performance overhead. If you can’t avoid it, you won’t be able to create a table based on the clustered columns to index, but you do have other options. Also, the clustered column store index is not ideal for small tables that have less than 60 million rows. So these are some of the points when it comes to clustered columns to index. Now, next is something known as heap tables. Now, if you are temporarily loading data into a table for staging purposes, then you can actually make use of a heap table. For example, if you want to perform transformations on the data before it is actually moved into an actual production-based table, then you can consider the use of the heap table. The heap tables are faster to load, and the reads can be taken from the cache. So it makes it much faster when it comes to loading data and reading data. Now if you want to create a heap, paste tableuser to make use of the width heap option when you are creating the table. Now back onto indexes.

Now, apart from the clustered columns, there are indexes. So if you can’t make use of that, you can specifically create a clustered index. The clustered index will be created based on just the columns that you specify as part of the table. So here I am saying, “Please create a clustered index based on the product ID.” So, if your queries frequently use the product ID to search for data, you can create a clustered index based on the product ID, followed by non-clustered indexes. So when it comes to the clustered index, you can only create one per table. But let’s say that you have queries that, in this case, also make use of another column in the table when it comes to searching for data. You can create a non-clustered index in this manner.

So if you want to improve the filtering on other columns, you can create a nonclustered index for those other columns. However, the disadvantage in this case is that it actually increases table space because the non-clustered index is like having another index in addition to your table data. When it comes to clustered indexes, that built-in algorithm for data searching is built on top of the table itself. There is no need for an additional index view of the data. When it comes to nonclusive indexing, however, an additional index is required. So when a search is done correctly, it will first refer to the index, and the index will then fetch the data from the underlying table. And remember, all of this is very important because you’re looking at working with a huge amount of data when it comes to your data warehouse, right? So in this chapter, I just want to go through some other important concepts, which include indexes when it comes to your tables. Azure Synapse.

51. Quick Note – Modern Data Warehouse Architecture

Welcome back. Now, in this chapter, I just want to give a quick note when it comes to a modern data warehouse architecture. So we have already seen how we can take all of our files and actually place them as part of a data lake. As a result, Nasira Bird Azure Data Lake Gentle Storage Account So you can use this data lake to store all types of files, whether they are in JSON format, Parquet-based format, or CSV format. You can then later on take this data, along with data from your structured data sources, and create tables in a data warehouse.

So if you need to build reports using PowerBI from data in the data warehouse, you can actually take the data from either your structured data sources or even files in your Azure data lake. Now, the data lake is not only meant for taking your data and then publishing it onto a data warehouse; you can actually have the data available for your data scientist to perform data science itself, or you can also use the data for machine learning. So there’s a lot that you can do. So once you have your data in place in the data lake, you can use it for multiple purposes. So in this chapter, I just want to give a quick note when it comes to the modern data architecture.

52. Quick Note on what we are taking forward to the next sections

Now, before we actually wrap up, I want to give a quick note when it comes to the next section on Azure Data Factory. Now, over there, we’ll be using a lot of resources that we’ve already seen or that we’ve created in this particular section. So for example, I’m going to show you how to build the fact and dimension tables using Azure Data Factory from the tables that we have in our new DB database. So I’m keeping my database servers and my databases in place because I’ll be reusing them when it comes to the section on Azure Data Factory.

And I will also be keeping my workspace, my Synapse workspace, in place. I’ll also have a dedicated SQL pool set up. So I just pause the pool whenever it is not required, and then I’ll resume the pool when it is required because, see, Azure Synapse is something that we need to see not only in Azure Data Factory, but we’ll be seeing parts of it in other sections as well. Even when it comes to the usage of Azure Data Bricks, there’ll be one or two scenarios wherein we need to see how we can take data and put it in a dedicated SQL pool. Jose I’m reusing all of the artefacts that we have created up until now. And when it comes to our containers, I’m also keeping our Azure Data Link Gen 2 Storage account in place. So again, we’ll be making use of files such as our park-based files, our log CSV file, et cetera. So I’m keeping all of this hazardous. I just want to give the students a note on what we are taking forward to the next section.

53. What about the Spark Pool

Right? So with this, the section on Azeosin apps comes to an end. But that doesn’t mean we won’t be working with Azeosin apps. In subsequent sections, I will continue to use the dedicated SQL Pool wherever possible. In future sections, I’ll also be keeping my SQL databases in place. Also, you will notice that in this particular section, I have not covered the Spark Pool. As you can see, Aziosin apps provide a variety of compute options. We have the serverless SQL pool; we have the dedicated SQL pool, which we have covered; but we do not cover the Spark pool. Now, the reason I have not covered the Spark Pool is because before I cover the Spark Pool, I just want to have a quick primer when it comes to Python, when it comes to Scholar, and when it comes to notebooks. And then we’ll move into the Spark Pool, because when we look at building notebooks in the Spark Pool, we have to look at working with Scala and then also working with Python. So, before we start working on the spark pool, I thought we’d go over Python and Scala basics.

Uncategorized

Related posts:

Leave a Reply Cancel reply