DP-203 Data Engineering on Microsoft Azure Topic: Design and implement data storage – Azure Synapse Analytics Part 1

December 19, 2022

17. Lab – Loading data into a table – COPY Command – Parquet

Now in this chapter, I want to show how you can load data that’s based on the park-based files. So in the last chapter, we saw how we could load data from the log CC file. You can do the same thing with your park-based files as well. So remember, we do have park A files that are available in our Azure Data Lake Gent storage account. Now I’ll just go ahead and delete the data from our log table. I won’t go ahead and create our table. The only reason I’ve kept this great table command in place is just for reference; we’re going to use the same table. So I’ll first delete the existing data from our table.

So now over here, it’s saying the permission is denied, and that’s because we haven’t given permissions to our new user. So I said that in order to make all this much simpler, I’ll go on to the tab where we’re logged in as our SQL admin user. Delete all of the data just to confirm. So we don’t have any data in our table. Now I’ll use the same copy command. I have changed the name of my Data Lake Gentoo storage account. Here I’m using my parquet folder, which is in my raw folder, the park directory. And here I’m saying, please take all of the parquet-based files that are available in this particular directory.

When I scroll down here, I’m using the file type as parquet. I’m requesting that you please use the park-based files found in this directory. And here I am showing now how you can use the identity to authorize yourself or to authorize this particular command to take the data from the Data Lake Gen 2 storage account. Here we need to use a shared access signature. So delete the existing one that I have first and let’s create a new shared access signature because I’m not sure if the previous one has expired. So I’ll go on to my shared access signature. I’ll hide this, I’ll choose the Blob service, I’ll scroll down, I’ll generate the shared access signature, I’ll copy the SAS token, I’ll place it over here, and I’ll just remove the question mark. And let’s now issue this command. So this is done. I can see the same number of rows have been affected. Let me get the data from the table, and as expected in this chapter, I just want to show you how you can pick data from parquet-based files.

18. Pausing the Dedicated SQL pool

Now in this chapter, I want to give two notes. So the first note is about pausing your dedicated SQL pool. So, as I previously stated, one of the ways you can actually reduce your compute cost if you are following along is to pause the dedicated SQL Pool. So, if you go to your SQL Pool, or if you go to the new pool here, you have the option to pause your SQL Pool. So once your pool is actually paused, you will not be charged for the computed cost.

So I normally do this. So at the end of the day, I go ahead and pause my pool, and then on the next day, I start the pool again, and then I start working on it. But again, if you’re scared of not having any sort of cost, you can just go ahead and delete the pool altogether. The only thing is that you lose the data in that pool. But you can always go ahead and recreate the data if you have your log CSV file in place, and if you have your other data in place, it’s very easy to go ahead and copy the data once again into your pool. Another thing to keep in mind is that you can use Synapse Studio to execute your commands. But a couple of reasons as to why I’m considering SQL Server Management Studio are that I’m much more familiar with using SQL Server Management Studio. The other thing is, if you want to work with SQL-based authentication, then you need to use SQL Server Management Studio. Here, when you’re actually working with the data, you’re actually logged in as your admin root user.

So with this user, again, you have the ability to work with the underlying data in your dedicated or service SQL pool. But with the help of SQL Management Studio, you can log in either with SQL-based authentication or even with your ad-based authentication. And in the section where it comes to security, I will show you how you can log in with a user based on Azure ad authentication. So Azure, remember, is your identity store in Azure. In Azure Active Directory, you can go ahead and create users, groups, et cetera. And in this section, when it comes to security, at that point in time, I’ll actually go on to Azure Active Directory. I’ll explain Azure Active Directory, and I’ll show you how you can use Azure Active Directory authentication when it comes to Azure Synapse and your dedicated SQL Pool. So just a couple of points before actually moving ahead.

19. Lab – Loading data using PolyBase

Now in this chapter, I’ll show you how you can load data using PolyBase. So far, we’ve looked at the Copy command. Now we look at PolyBase, and in a subsequent chapter, we’ll actually see how to use the Pipeline feature. That’s the data integration feature that’s available with Azure Synapse. We’ll actually use that feature when we’re looking at fact and dimension tables. So for now, we’ll just look at these two operations. One was the copy command, and the other is using PolyBase. Now, here we are again when it comes to the script.

So we don’t need to create the master key for encryption. Again, I’m just giving it here so that we have the complete script in hand. If you want to run this in isolation, if you want to look at the existing database scope credentials, So we’ve already done this before, but if you want to execute the command to see if you already have a database scope credential with this particular name, here you can look at the system view of Sys database underscore scope credentials. So if I copy this, and here I am in the tab as a SQL admin user, let me quickly close the other tabs. So I’ll hit execute. So I can see that I do have the name “Azure Storage Credential.” This is the credential identity. If you want to look at your external data sources, you can issue this command.

So this is another look at Sys’s external underscore data underscore sources. I can also see one of the log lines with highlighted data. Here is my data source, so we already have that in place. Next, we have our file format. So earlier on, we defined the parquet-based file format. So let me look at this view. So you can use this view to see if you have existing external file formats in place. As a result, I have viewoffsexternal underscore file underscore formats. I’ll hit on Execute so I can see that I have my text file format in place and that I have my Park-based file format in place as well. Now I’ll first drop my existing table, which is logdata, because we are going to recreate this table. So I’ll execute this command to delete the table. Now I’m going to create an external table with the name “log data underscore external.” It will have my location, my data source, and my file format. We’ve already seen this before.

So first, I am creating my external table, and if you want, you can confirm this. So, from our logged data underscore external hit on Execute, select Start. So we have the information in place. But remember, this is now an external table. And now we can use the Create table command with a select star that selects all of my rows for my external table. Here I’m mentioning something known as a distribution, and I’m mentioning something known as the cluster index. Please note that I have chapters that are going to explain both of these concepts. For now, I’ll take this command and let’s create a table. So I’ll hit execute. So we have now an external table which pointsonto our external data, but at the same time, we also now have an internal table. So select a star from the log data just to confirm that we have the data in place. I’ll hit execute, and now I can see my data. So using PolyBase, you can actually take the data from an external table and push it onto an internal table in your dedicated SQL pool.

20. Lab – BULK INSERT from Azure Synapse

Now, in this chapter, I want to show you how you can perform the copy operation in Azure Synapse. So we can copy data from our log CC file, which is in our Azure Data Lake Gentoo storage account, onto our dedicated SQL pool and onto a table in the pool itself. What we can do is actually link AzureSynapse to our Azure Data Lake Gen 2 storage account. It brings our storage account much closer to Azure Synapse. Here I can choose “Connect to external data.”

And here we have some options in place. I’ll go with Azure Data Lake Storage Generation 2. And here on out, I’ll give this name as it is. So it’s creating something known as a new Linked Service. And we’ll cover all of these aspects when we go onto Azure Data Factory, because this link service concept is actually something that is actually part of Azure Data Factory. The Link service is just like now—a connection to a data store. So we’re creating a link service, a connection to our Azure Data Lake Gen 2 storage account. The authentication method is the account key. Here I’ll choose my subscription, and I’ll choose my storage account name. I’ll choose my storage account name, “DataLake 2000,” and I’ll hit on “Create.” Now, before we can copy the data, when it comes to fetching the data itself, we have to give some permissions specifically to our storage account. Normally, the Azure admin user is used. We have seen that we can browse for our data either from the Azure Storage Explorer or from the Azure Portal itself.

However, for certain other services, you must explicitly grant permissions for either reading or writing to the storage account. The same thing we have to do over here So I need to move on to Access Control. So now I need to add a role even for my Azure admin account user to work with the data, which is in our Data Lake Gen 2 storage account. Please keep in mind that I will go over all of this in detail in the security section of this course. At this point in time, let’s just add the required permissions. Here I’ll click on “Add” and add a role assignment. I’m going to select the role of a Blob data contributor. So this allows my user to both read and write data. Here. I’ll choose my Azure Admin account. So I’ll choose the first. I’ll just click on it, and then I’ll hit Save. When you return to your workspace after adding a role, it usually only takes a couple of minutes to reflect. If you go on to the Linked tab, Here you will see what the primary data LakeGen Two storage account is that is attached to your workspace. Remember when we created the Azure Synapse workspace? At that point in time, we had to mention the Azure Data Lake Gen 2 storage account. But now, since we have created or linked another storage account, I can click on the context menu and hit Refresh. And here I can see my Data Lake 2000 account in place. Let me just collapse this. If I click now on my data container, I can see my files.

If I go onto my log CSV file, if I right-click, I can do a preview of the data so I can see it. Over here, I’ll hit OK, I’ll right-click on the new SQL script file, and I’ll do a bulk load. Here we have the options for the text-based file and for the PDF-based file. If I scroll down, I can specify that the first row has two. This is to ignore the header row here and continue. Yeah, it will create a new sequel script. So first, I need to choose an existing table. That’s our log data table. I’ll open up the script. If you want, you can configure the column mapping, but I’ll directly open up the script. I’ll just collapse this. I’ll hide the properties, and here it is generating the copy statement for you. So I’ll just run this copy statement. So this is done. Now let’s select the top 100 rows from our table and see my data in place. The main difference here is that we have now created a linked service on our Data Lake GenTwo storage account, and we can now issue the copy command from there.

21. My own experience

Hi, and welcome back. Now in this chapter, I just want to share a couple of my experiences when it came to working with both the external tables and the dedicated SQL pool. So in the beginning, when I was also working with external tables and trying to pull data from my log CC file for my parquet-based files, I did find some issues. So understanding these various concepts on how to pull your data takes time. And the main reason for this is because your source data, your values, could be in a different format. The destination, as I mentioned, could have different data types that would not be compatible with the data that you have in your source. So there are actually a lot of things that can go wrong when you are trying to, say, create an external table or even load data from your source.

So, even when creating the table, it takes a lot of R&D to figure out what is the best way to either create an external table or load data from an external data source. So initially I used to get a lot of truncation errors, and what do I mean by that? And in the error message, you’ll not even find out whether there’s any sort of truncation that is actually happening; it will just give some sort of generic error. It’s only after a lot of digging and researching that I will find out that this is due to a truncation error. And what do I mean by a truncation error? So, let’s say I’ve defined a column based on barca, and I have a string that will be stored in this column. Let’s say if I specify the width as only 20, If my source contains a longer string that cannot fit in this column, it will throw an error. This is known as a truncation error. And PolyBase might not be able to give you the best message when it comes to the truncation of data. Again, it depends from scenario to scenario.

I said this is based on my working with both external tables and fetching data from an external source. So initially, what I had done was just to ensure that data could actually be populated, because this is only one issue; you could have many issues.So what I should do is that, in the beginning, I should define the max as the barca. So that means that any string could potentially fit in this particular column. That was just my initial step. However, you should never define a column based on barca max, especially in your dedicated SQL pools. That’s because this is not a recommendation when it comes to performance. So you should find the ideal width when it comes to what you should define for your columns. Now, I stated that this is not a sequel because I will not go into detail about data types and how to define them.

I’m just trying to share all of my personal experiences. So I said at the start that I should have defined it as Maxso to ensure that my data was coming in. Because I stated that there are numerous things that can go wrong, Firstly, when it comes to creating your database code credential, there could be something wrong over here when it comes to your data source. So over here, I’m actually picking up data from a Data Lake Gen 2 storage account. That’s why I’m using this particular protocol. If you are taking data from, say, Azure Blob storage, this location would change, and the protocol would change. So that is also a potential issue that could actually happen.

Then there’s the file format. So in the CSE file format, remember that I said to take the first row from the second row and to ignore the first row, which had header information. If your file didn’t have header information, you would actually be missing one row. If your file didn’t have header information, you don’t need to mention that particular part when it comes to creating the external file format, right? So there are a lot of steps, and things can go wrong at each step. Now, obviously, next is: what are the allowed data types? So again, this depends on your RND. You have to go through the data types of your destination. So, in the delegate SQL Pool, you look at the available data types and try to match them to your source. So, in my labs, I actually give you the files that I’ve been working with, and I said that I’m attaching all of these files as resources to this chapter.

Now, you can use any file or any data source that you have, but if you’re working with your own data, all of this is something that you need to consider. You could spend some time making sure everything is perfect before putting your data into the dedicated SQL Pool. And if you do encounter errors, go to forums like Stack Exchange to see if other people have encountered the same issue and to try to find a solution so you can use whatever data set you have. The reason I’m using my own data sets is because you could get a lot of open-source data sets, but often there could be a change saying that you cannot use those data sets anymore. So I didn’t want to take a chance. That’s why I’m using my own data sets, which you are free to use. There is no problem. So you can download my data sets, play with them, and understand what goes into working with data.

22. Designing a data warehouse

Hi, and welcome back. So far, we have seen how to load data into a table in our dedicated SQLpool or how to build external tables. Now we are going to look at a pattern when it comes to building tables in a sequel data warehouse, and that pattern is when it comes to building tables such as your fact tables and something known as dimension tables. This is based on the pattern that’s either known as the star schema or the snowflake schema. So before we actually dwell on this, if you look at how the tables are arranged when it comes to a traditional SQL database, So remember, over here you can have primary keys that are defined for a particular table.

This helps to uniquely identify the rows within the table, and then you can define relationships between one table and another with the help of foreign keys. When it comes to a traditional SQL database system, you normally follow the pattern of database normalization, wherein you ensure that duplicate data is not present in your tables. And you structure your tables in such a way that you develop relationships between the individual tables. This is in a normal relational database that is used for your OLTP workloads and your transactional workloads. Now, when it comes to the analysis of your data and when it comes to building a SQL data warehouse, you would develop or have your tables in the form of fact tables and dimension tables.

So, what exactly are fact tables? Fact tables actually store information or measurements or metric units that actually correspond to your facts. So, for example, you can have a sales table; that’s a fact table. This actually records all the sales that have been made. So let’s say for the Umi platform, they have a separate sales table wherein all the purchases or sales of the courses have been made. These are your actual facts—what has actually happened. As a result, the sales figures reflect actual sales. When you now want to perform analysis on the data in the fact table in your SQL data warehouse, you supplement this analysis with the help of dimension tables. And don’t worry, we will actually see a simple example of how to build a fact table and how to build dimension tables.

So dimension tables help provide some sort of context for the facts that are being presented. So, for example, let’s say you want to know the types of products that are being sold. So you might have a separate dimension table known as “Dim Product” that would have all of the product information in your sales fact table. You might have something called a product key that corresponds to the product key available in the dim product table. When you now perform a join on the sales table and your dim product table, you can then have a view of those products that have been sold. Or you might have another table for the customers that’s known as a “dim customer table.” And again, in your fact table, you might have another key known as a customer key. You can perform another join on the dim customer table.

Now, for those who are not aware of fact and dimension tables, again, you might be thinking this is somewhat similar to a traditional SQL database. Even over there, we have the concept of having primary keys and then foreign keys and then having joins across these different tables. But there are some subtle differences when it comes to building your fact and dimension tables. And as we go through the duration of creating these tables, you will understand this a little bit better. And another important point is that when it comes to a SQL data warehouse in the dedicated SQL pool in AzureSynapse, there is no concept of a foreign key. You can only perform joins on your table. So this is another important concept. In a traditional SQL database, you can create foreign keys, but you can’t do this in the SQL data warehouse when it comes to the dedicated SQL pool in Azure Synapse. So what goes into actually building your fact table?

So I said, “You build your fact table.” Let’s say that in this case, if you have a sales table in your traditional SQL database, you can actually build facts around the sales data. Now in your fact table, the sales data will keep getting added. And what do I mean by this? So let’s say that you have your application that is interacting with a traditional SQL database, which holds all of your transactional data. Let’s say you have very simple orders and sales data that are being held in a table in the transactional system. So again, let’s say if you have the Udma platform, they are storing these sales data in a separate table. This is in your transactional data system. Now, when it comes to your OLAP analysis and your ecosystem, So we are talking now about your data warehouse, and you will have a similar table. The sales data table has a fact table in the SQL data warehouse.

So this contains facts about the sales that are being made. Now, when it comes to your transactional data in your OLTP system, So you might have transactions that are present for, say, the year 2021 and maybe for the year 2020. But in your fact table in your AntiguaSQL data warehouse, you might have transactions for all of the previous years as well. So all of the transactions in your SQL Data Warehouse article just keep getting added to the sales fact table. And the reason for this is because, as I mentioned before, in your SQL data warehouse you will have terabytes or petabytes of data, and you will be performing analysis on all of this data. That’s why you will have this historical data as well. So normally in your transactional data system, you’ll have only the most recent data, whereas in your SQL data warehouse, you’ll have historical data as well because you want to perform analysis. So let’s say you want to perform year-on-year analysis on how the sales have been progressing. You can use all the data from the previous years as well. And then I mentioned the dimension tables.

So to supplement your analysis of the fact table, you might then take or create dimension tables based on the tables that you already have in your transactional data store. So again, let’s say that you have a SQL database, and you have your application that is actually connecting onto the SQL database. I mentioned that you could have a sales data table in your SQL database, and you might have a similar type of fact table in your SQL data warehouse. Then you might also have a course table, a customer table, and an instructor table in your SQL database system. And using these tables, you can go ahead and build your dimension tables in the SQL data warehouse. You could use those dimension tables to answer questions like which courses sold the most during the year, which regions have the most customers, or who the most popular instructors are. So I said the dimension tables just helped to supplement the analysis of the facts that are in the fact table. And then you come on to this particular structure that is known as the star schema. You have your fact table in the middle, and then you have your dimension tables that actually form some sort of star. You could add another dimension table here.

And this sort of pattern representation of your fact and dimension table is known as the “star schema.” There is another data pattern known as the snowflake schema. We’ll go through that a little bit later on. We’ll actually be focusing more on the star system when it comes to this particular exam, right? So in this chapter, we want to start with an overview of the fact and dimension tables. In subsequent chapters, we’ll see how to build the fact and dimension tables.

23. More on dimension tables

I discussed fact and dimension tables in the previous chapter. In this chapter, I just want to go through dimension tables a little bit more. So you will construct your fact and dimension tables based on information that you already have. So you might have, let’s say, your sales data, your product data, and your customer data all in your SQL database.

And I mentioned that you might have an application that interacts with that SQL database. The data will then be continuously transferred from the SQL data store to your SQL data warehouse for analysis. Now, when it comes to building your dimension table, remember that your dimension table gives an extra aspect to how you would analyse the data in your fact table. Now, it’s not necessary that the data for the dimension table will come only from the SQL database; it can actually come from different source systems. So you might have some users or some people in your company who are managing the products via CSV files. So you could have that information in CSV format. And sometimes, in order to build your dimension table, you might be taking information from different source systems.

So you might have one department that manages the products via CSV files, and you might have another department that manages similar types of products, let’s say via a SQL database. I mean, this is possible. So you might have data that’s coming from different source systems. So now let’s say that you have two source systems, Source System A and Source System B. Now, just for the easy understanding of dimension tables, I am showing both systems’ tables over here. So each table has a product ID and a product name. The difference is that the products are based on course names, whereas here the products are based on books. So let’s say that the company is selling both courses and books based on those courses as well.

So one could be video courses, and the other could be books based on those courses themselves. And let’s say you have one department that manages the products based on the courses and another department that manages the products based on the books. Now, both of these tables have the same product ID but different product names. And let’s say that you now need to build a dimension table based on the data from these different source systems. So, if I only had the product ID, the product name, and the product price, there will be a problem when you start building your final report, especially if it’s based on the product ID.

So in your source system, you might have the product ID, which has helped uniquely identify that product within that particular source system. But now, when it comes to your product dimension table, it’s becoming difficult to identify the product based on just the product ID. So what you will normally do is make this a practice. So when you build your dimension table, you will be creating an extra column that will help uniquely identify the row within that dimension table. This is normally known as the surrogate key. This helps to uniquely identify each row in the table. Now, when it comes to Azure Synapse, you can just easily create something known as an “identity column” to generate this unique ID. So this unique ID can just be a simple number, a simple integer, to just represent how to uniquely identify the row in the dimension table. So this is one of the important concepts I actually want to introduce when it comes to building your dimension table.

Now, some other important points when it comes to building your dimension table So try not to have null values when it comes to the properties in your dimension table, because remember, you’re using a dimension table to perform analysis on the details in your fact table. So having null values will not give the desired results when you’re using the reporting tools. Also, you can’t try to replace null values with some default values, but again, this depends on the type of reports that you are trying to build. So again, it’s up to you to decide what you should do with those rows that have null values, right? So in this chapter, I just want to go through some extra concepts when it comes to building your dimension tables.

24. Lab – Building a data warehouse – Setting up the database

Now, in order to see an example of how to build fact and dimension tables, I’m again going to be looking at using the Adventure Works sample database. So if you don’t have any data to work with, you can actually use this Adventure Works sample database. If you have your own data, you can actually use the same concepts to build the fact and dimension tables.

Now, in the earlier chapter, we saw how to look at the sample data that is available in the Adventure Works database. So if I go on to the Object Explorer, we had already done this earlier on, but this is a very light version of the Adventure Works database, and this won’t be enough to actually build our fact and dimension tables. I want more. I’d like to show you how to properly build your fact and dimension tables to some extent. Obviously, there is a lot that actually goes into building your dimension tables, but just to get an idea, I want to have more data in place before building the fact and dimension tables.

So for that, I want to use a full-blown version of the Adventure Works database. All of these backup files are now useless for creating a database on Azure SQL. So what I’ve done is that, in the background, I’ve created an instance of this plain SQL server on a virtual machine. I created a database based on this backup file, and then I created something known as a backpack file that can now be used with Azure SQL so that we can have all of the data in place. So you will actually find this backpack file attached as a resource to this chapter. So you can go ahead and download that particular file. If it is zipped, you can just go ahead and unzip the contents of the file. So in my temp directory I have the backpack file in place, and we are now going to be using this to create a new database.

So in Azure, I need to go onto my SQL database server, so I can go on to any database. I can click on the server name, and here I can go ahead and choose the option of importing a database. Here I can click on “Select Backup.” So now it’s asking me to select the backup from one of my storage accounts. So I’ll do something like open all resources in a new tab. Let’s go over to our data lake. Gen Two storage account So I’ll go on to my Data Lake 2000 account, I’ll go on to my containers, I’ll go on to my data container, I’ll just create a new directory, I’ll click on Save, I’ll go on to the SQL directory, and I’ll hit on Upload. So I’ll upload that data backpack file, and I’ll hit Upload. So once this is done, this is Data Lake 2000. Here I’ll choose Data Lake 2000. I’ll choose my data container, and I’ll choose a SQL directory. I’ll choose the backpack file. I’ll hit select here, and I’ll configure the database. So again, we can choose the standard DTU model. Again, in terms of the data size, I can choose one gigabyte I can apply here.

I can give a database name and my password for the SQL Server, and let me go ahead and hit OK. Now, this might take around five to ten minutes. So what is it going to do? It’s going to create a brand new database on our database server, and then it’s going to import all of the data from that backpack file. Then we’ll have more data to work with when it comes to building our fact and dimension tables. So let’s come back once we have the database in place. Now, once I’ve received notification that a request to import the database has been submitted, I’ll proceed to SQL Server management studio. I’ll go on to my SQL database server for my databases. I’ll right-click and hit on “refresh.” And here I can see my Adventure Works database. If I go onto my tables, I can now see a lot of tables in place. So we’re just going to be using three or four tables. But then I wanted more information to be in place to build our fact and dimension tables, right? So let’s put it in this chapter, where we now have our data.

25. Lab – Building a Fact Table

So let me go on to the Sales Order Detail table. I’ll right-click; let me return the rows. I’ll also go on to my sales order header table. I’ll right-click, and I’ll select the rows. So in my sales order header table, I have information such as the sales order ID, order date, due date, et cetera. So I have a lot of information over here. So I’m just taking a very small subset of this information to build my actual fact table. If I go on to my sales order detail table, I have some information over here as well. Here. I have the product ID. So I’m going to be mapping this product ID onto our product dimension table that will be created a little bit later on. And from my Sales Order Header table, I actually should be having something known as a customer ID, and we’ll be mapping this onto the dimension customer table that we’ll create a little bit later on. Now, your fact table could just contain facts from one of your tables in your SQL database in your transaction system. Since this is already a built-in database,

I am joining the information from two tables to build my Fact table. Now, the first thing I am doing is creating a view. So I’m creating a view based on the information from both tables. So I’m joining my sales order detail table with a left join. That means I am taking all of the rows in my Sales Order Detail table and only taking the matching rows in the Sales Order header table. I’m doing a match based on the SalesOrder ID that is available in both tables. And I’m taking the information such as the product ID, the sales order ID, the order quantity, the unit price, the order date, the customer ID, and the tax amount. And then I’m using the select in Cross to create a new table known as Sales Fact from this particular view. Now, please note that in this particular demo, I’m actually creating my Fact table in my SQL database itself. Now, the right approach is to take your data and build your fact table directly in Azure Synapse in your SQL Data Warehouse. And I’ll actually show how to do this when we look at Azure Data Factory.

The reason I am creating a view in my SQL database and building the fact table is that I first want to show you how I’m building my actual fact table so that when we look at Azure Data Factory, you’ll have a better idea of what we have done so far. Because if I went ahead and built a pipeline to transfer data onto a table in AzureSynapse for my Fact table, you might not understand what I am trying to achieve. So that’s why what I’m doing initially is I’m going to be building my fact and dimension tables in the SQL database itself. And then I’ll directly copy these tables onto Azure Synapse. And I said, “Don’t worry.” In the section in Azure Data Factory, I’m going to show you how you can build a pipeline to take the joined data from these two tables and then build your table in Azure Synapse. So normally, you will not build your fact and dimension tables directly in the transaction database system. As I previously stated, I am creating fact and dimension tables over here for the sake of simplicity and understanding. So first I am creating a view based on the information in both tables, and then I can create a table based on that view. So let me first create my view. I’ll hit Execute if you want, but you can’t see the data in that view. I’ll hit execute. So you’re seeing the data from both tables. So we have our product ID. And we have our customer ID.

Now I’ll create a table based on that view. So this is also done. We have a number of rows in place. So, if I select Star From and select my Sales Fact table, I should be able to see all of my information in one place, right? So at this point in time, I have just built a simple fact table. So the fact table has information about the product (the product ID), and the information about the actual product will actually be in our dimension product table. Then we’ll also have, as I mentioned before, a dimension table based on our customer. And then some facts I just have are the order quantity, the unit price, the order date, and the tax amount. You can have multiple columns in your fact table and any amount of information. I said, just to make this more simple, just to have a simple fact table, I’ve just chosen a certain number of columns for my fact table.

26. Lab – Building a dimension table

We had to create our fact table in the final chapter. Now let’s go ahead and create our dimension tables. So I have the code over here. I’ll go on to SQL Server Management Studio. Actually, let me just delete the open tabs and continue to my database. Right-click “new query,” and I’ll paste in my code over here. So, firstly, I’m building my Dimension customer table.

Now, I’m building my Dimension Customer table based on the Sales Customer table that is available in the Adventure Works sample database. I’m doing a left join onto the store table that is part of the sales schema. And here the join is based on the store ID being equal to the business entity ID. And I’m ensuring that I don’t take any null values. So since my business entity ID and my store ID are very important, I want to ensure that I only take the rows where the business entity ID is not equal to null. Now, the way that I’ve actually formulated this statement is that I first try to understand how I can get information about the customer. So again, this Adventure Works database is something that I’m not familiar with. So I took time to understand: what are the columns? What is the data in all of these tables?

So, based on how I get my customer information right, I built my select query and decided which columns I want. So, you can have many columns as part of your dimension table, but again, I’m keeping it simple. Then I noticed that when I was building my view, I could see null values when it came to the business entity ID. And this is something that I do not want. So when you build your dimension table, it’s not a good idea to have any sort of null value because remember, you’re using your dimension tables when it comes to the analysis of your data. So it’s not a good practise to have null values in your dimension table. So here I am actually ensuring that we don’t take any null values, and this is something that we also consider when we are transferring data using Azure Data Factory.

So for now, I’m building this customer view. So the same concept: I’m building everything in my SQL database itself, and then I’m building the Customer table based on this Customer view. So again, you can go ahead and look at the information in both the customer table and the store table. So I’ll execute this query so that this is in place. Then I’ll load data into my customer table, so that is also done. So I have some rows in place so we can look at the information in our Dimension Customer table. So I have the customer ID, the store ID, and the business entity ID. And what is the store name? Now, next, I want to build myproduct table, my Dimension product table. Now, in order to get the information that I was looking for, I had to do a join with multiple tables. So, my main product table, which has all of my product information, along with the product model table, which has some information about the model of the product, and also the subcategory table to get the category of the product itself So, I wanted to get all of this information, and here’s the interesting part: I am now taking information from more than two tables.

So I’m doing a left join now on two tables onto my primary product table. So remember all my rows in the product table and all the matching rows in the product model table and the product subcategory table. And then I’m taking my columns of interest, and I’m making sure to not take any product model ID where it is null. So again, when I was looking at the information, I could see some product models where the product model ID was null. So again, in a similar process, I’ll build my product view, then next I’ll build a product dimension table, and then we can select Star from my dimension product. Yeah, I can see my product ID, my product model ID, and the product subcategory ID. I can see the product name, the product model name, and the product subcategory name. Now, in case at any point in time you want to drop the view or the table, I’ve just added these statements over here, right? So now we’ve also built our dimension tables.

Uncategorized

Related posts:

Leave a Reply Cancel reply