DP-203 Data Engineering on Microsoft Azure Topic: Design and Develop Data Processing

DP-203 Data Engineering on Microsoft Azure Topic: Design and Develop Data Processing – Azure Data Factory Part 3

December 19, 2022

11. Deleting artefacts in Azure Data Factory

Now, in this chapter, I just want to show you how you can delete your data sets, your pipelines, et cetera. So we might have a lot of pipelines and a lot of data sets. It might get a little bit too cluttered, and you might want to go ahead and delete them. Now, remember that you are not charged, as per my understanding, for having a data set in place because this is the view of your data. You’re also not charged for the pipeline. You are only charged when you run the pipeline, right?

So whenever a pipeline run occurs, or whenever there is any sort of data transfer or data transformation that requires compute infrastructure that is required for performing that process, then you are charged. You are not charged for just having the pipelines in place and for having the data sets in place, but from our perspective, you should keep everything clean. So I’ll delete the data sets and the pipelines, because it’s very easy to go ahead and recreate your data sets and pipelines. So for that, I’ll first close whatever windows I have opened. Next, I’ll go on to each dataset and choose the delete option. So first, I’ll delete all of my data sets and my pipelines. Now, please note that our linkages will still be in place. So if you go on to the Manage section here, you can see all of the linked services. So you have your Link service on your Azure DLX storage account, your SQL database, and Azure Synapse. So this still stays in place.

12. Mapping Data Flow

Now we are going to see chapters on mapping data flows. I just have a small slide presentation on mapping data flows because I want to show more clearly how you can use this feature. So far, we’ve seen very simple copy data activity when you want to perform transformation on your data. There are other activities that are also available in your pipeline. But when you want to create custom transformations, you’ll actually use mapping data flows. And mapping data flows gives you a visual way of creating this transformation process, so you can visually create your data transformations in Azure Data Factory with the help of mapping data flows.

Here, you can write the required transformation logic without actually writing any code whatsoever. The data flows are then run on Apache Spark clusters. So this is a little different from what we have been seeing so far when it comes to the Azure integration runtime. So now your data flows will run on separate Apache Spark clusters here, as your data factory will handle all of the transformations in the data flow. Now there is also something known as a “Debug Mode.” Now, the debug mode will actually help you see the results of each transformation in the mapping data flow itself. And I’ll show you why using debug mode is advantageous. In debug mode, the data flow is executed interactively on a Spark cluster. Now, in the debug mode, you’ll be charged on an hourly basis while the cluster is actually active.

I’ll actually go on to the pricing page for Azure Data Factory. Let’s look at the pricing when it comes to mapping data flows. So, again, we are on the pricing page for Azure Data Factory. So we had looked at this table, but when we wanted to look at mapping data flows, we had to scroll down. So here we have data flow, execution, and debugging. So now, when it comes to mapping data flows, since transformation is required, you’ll have the proper compute infrastructure in place that will be used for running your transformation. Here you can see that you are charged based on the cluster execution and debugging time per virtual core per R. Now, the minimum cluster size to run a data flow is eight virtual cores. Because, see, the assumption is that when you’re looking at large organizations that want to perform transformations, you’re looking at performing transformations on a lot of data. And if you want to get the desired results quickly, it’s ideal to spread that load over a number of virtual cores so that the transformation process can actually happen in parallel. So this is very important when it comes to the mapping data flow.

Now, here are the different compute options that are available, and what is the cost per virtual computer? So remember, this is the cost per R, right? Your mapping data flow might just take, say, four to five minutes. So, here for you, as an example, it could be 00:19, three into eight, right? That’s the number of virtual courses, and ideally, it should be the amount of time that is actually consumed when it comes to the mapping data flow. Now I understand why costing is so important for every student. That’s why I am telling you this upfront. Now, I am doing labs wherein I want to show you different examples of using the mapping data flow. But before that, I want to be very clear when it comes to the costing app aspect and working with the mapping data flow.

13. Lab – Mapping Data Flow – Fact Table

Now in this chapter, I want to show you how to use the mapping data flow to generate our sales fact table based on the data that we have in our Azure SQL database. Yes, we saw how we could use the query feature in the Copy Data Tool activity to take a query that we saw over here and use it to copy data from our source to our destination. But I want to show how we can achieve the same thing when it comes to the mapping data flow. We’re going to do the same thing even when building the dimension tables. So the advantage of mapping data flows is that, at a later point in time, you can add flexibility to your entire flow. Remember, with a query you are restricted to hard-coding the query itself, whereas in the flow you can add various other aspects to design your data flow.

So let’s go ahead and see how to create a mapping data flow to move data from the Azure SQL Database onto Azure Synapse to build our Sales Fact table. And here we are going to base this on our query, which is based on the query that is used for building a sales fact table. So in the Authors section of Azure Data Factory, under Data Flows, I’m going to create a new data flow. I’ll just hide this and give it a name. Now here is the first thing I will do: add my source. Now, if you look at our query, the data is being taken from the Sales Order Detail table and then performing a left join on the Sales Order header table. So I need to add multiple sources over here. So first I’ll click on “Add Source.” I’ll close this and give the name of the source.

I’ll copy this, just give the stream name, and I’ll create a new data set. I’ll choose to continue here because your SQL database provides the option. Now we’re going to give the data set a name. So I’ll put the data set under “sales order detail.” I can put a DT, and here I can choose my linked service, which is your SQL database. Let me search for the sales order detail table.

We’ll import the schema from the connection, which I’ll hit OK on, so in the first part of our data flow, we have mentioned our source. It has taken all of the columns from the Sales Order Detail table. Here you can see the various source options. So we’re taking everything off the table. You can also get it from a query as well. If you look at the projection here, you can see all of the columns along with the type. So, we have our first source in place. I need to now define my next source. I’ll now take the sales order header’s name. I’ll just mention it here with the stream. Next, I’ll create a new data set.

I’ll go with Azure SQL Database once more. I’ll hit “continue.” Here I’ll give the name of the data set. Again. I’ll choose a MySQL database. I’ll choose my table name for that sales order header. I’ll hit OK, and this table has 26 total columns. Now I want to perform a join between these two tables so I can hit on this plus symbol to add an activity to this particular data flow. So I’ll click on the plus symbol, and here you can see the number of operations that you can actually perform. Now, on your data, I’ll choose the first, which is the join. So here we can join multiple tables. Here I can give a name for the join, and I can say, “Please join the Sales order detail stream with the Sales order header stream.” If I scroll down, I can choose the type of join. So I want to have a left outer join so that we have all of the rows from the sales order detail table. I’ll scroll down, and in the joint conditions, I’ll just move this on top. Now, I want to join what in the joint conditions. So if you go back here, I’m trying to join the sales order ID. So here, I’ll just put the sales order ID. I’ll hit enter if I scroll on to the right. I’ll also create a sales order ID.

I’ll hit “Enter now” in your browser. So here you should be able to see all of the columns in the particular stream, but since I already know what column I want, I can just paste it over here. So here we are saying that please now perform a join between two tables based on the sales order ID. So we now have a combined stream over here, but we have 37 total columns. It’s now time to add a sync. So the sync represents our destination. Where do we want to take all of this data? So I’ll choose the sync here, and I’ll give the name so the incoming stream is the sales join. Here I’ll create a new data set. This time I’ll choose Azure Synapse. I’ll hit “continue.” Now this will be a sales fact table in AzureSynapse. I’ll choose my link service as Azure Synapse, and here I can choose our existing Sales Fact table. Let me ensure that I don’t have any data in my sales fact table. I’ll hit Execute to delete all of the rows so that I don’t have anything. in the sales fact table. I’ll hit execute. I don’t have anything. So here I’ll hit OK.

Now here you can see that we have 34 columns in total, but what I’ll do is that I’ll go on to the mapping section. So here it’s doing an automatic mapping, but I’ll just switch this so that we can see all of the columns that are being mapped. Now, as soon as I remove the auto-mapping here, we can see that the number of columns has now been reduced. So now it has detected that these are the output columns that we have in our sync in Azure Synapse. The only thing left to do is to decide on a sales order ID. because the sales order ID is present in both of our databases. So I can choose the one in the sales order detail stream, right? And we also have this in place. So now we have our data flow mapping in place. So, what are we doing? We are first taking the data from our sales order detail table, then from the sales order header table; this is in our SQL database. Then, on both of our tables, we perform a left-outer join. So we’ll have a combined data set over here, and then we are transferring it onto Azure Synapse. Remember that the data flow will execute all of this on Apache Spark clusters.

So this entire join activity—taking all of the data and then moving it on to the sync—will actually happen on Apache Spark. Now, yes, all of this could have been done via our query using the Copy data activity that Hierarch mentioned before. But in mapping data flow, I said you could add more flexibility, as you have seen over here in terms of a proper flow. Because if you want to add something in between to work with your sales fact data, you can do so. So, with this in mind, let’s go ahead and publish everything. So now we are publishing our data sets and our data flow. When I’m finished, I’ll hit the publish button. So I’ll expand this. So we have a mapping data flow in place. Now, how do we execute this? Well, for that, we have to create a pipeline. So I have to create a new pipeline. I’ll just close this here. I’ll give the pipeline a name here in the move and transform sections, and I’ll now move the data flow. So now one of the advantages of having a separate data flow is that you can have multiple activities as part of your pipeline over here.

So you could have one copy of the data activity, then your data flow activity to perform something, and then another copy activity. So you can keep on extending the activities in your pipeline over here. So, we now only have data flow activity. So here I’ll give a name for the data flow. This is just for the activity itself. Here in the settings, I need to choose my sales fact data flow. We just created this earlier on, so now when it comes to computing, So remember, in the pricing, as I told you, these are the compute options that are in place. So you get four cores by default, and this is the minimum. You can’t choose anything less than this. So it’s four cores. Now, because we’re moving this to AzureSynapse, we’ll need a staging service. So I’ll again choose Azure Data Lake Storage. Let me go ahead and just hide this. I’ll browse for my Synapse container. Hit OK; we have everything in place. Nothing in parameters is required.

Let me validate everything. Everything is fine. I’ll press the “Publish all” button. It will publish a pipeline once you have this in place. I’ll now trigger my pipeline. I’ll hit OK; I can then go on to the monitor section. Now, this pipeline will take time to run. It will not be as fast as the other activities, which are just used for copying data. And that’s because this activity first needs time to build our Apache Spark clusters. Once it has the clusters in place, it will run our activity on those clusters. So it takes time, right? So let’s come back once this is complete. Now, once the pipeline is complete, you can see it took around five minutes. Now, if I go on to my Sales Fact table and let me execute this so I can see that I have my data in place, I know that this is a longer version of ensuring that you have data in your Sales Fact table. But normally, when you want to build data at your destination and perform any sort of transformation, it’s always good to have a data flow in place.

14. Lab – Mapping Data Flow – Dimension Table – DimCustomer

Let’s take a look at how we can use the mapping-data flow feature to create our dimension tables now. Now, when it comes to creating our dimension tables, if you look at the Customer View table, we have the left join. So we already know how to combine two tables. It will be two data sets. But here we have the additional wear condition wherein we want to ensure that we eliminate the rows where the business NTID is not null. So this will be an additional step. So we need to first get our two tables and perform a join. So I’ll go on to the author section. Here I’ll create a new data flow. I’ll give the name; I’ll just hide this so that we have a better view and also hide the properties. So we need to add our source. So we have the customer table. So I’ll put this in the customer stream. I’ll choose a new data set. I’ll choose Azure.SQL.

I’ll hit “continue.” I’ll give the name of our data set. I’ll choose our Link service. I’ll choose the table that saves the customer money. I’ll hit okay, so we have that in place, and I’ll add my other source. The Store table is the other source. I’ll put it in the stream. I’ll create a new data set. I’ll choose Azure SQL Database, hit Continue, give the same name for the table, and give a name for the data set. I’ll choose my link service again, choose Sales Store, and then hit OK, so we have two streams in place. I’ll make a join. So I’ll choose the join condition. Here I’ll let the customer join the store. I’ll choose the right stream, which has the store stream.

I’ll just move this up. I’ll perform a left outer join. Now, in the customer stream column, we want to ensure that we take the store ID. So here. I need to find the store ID. And on the store counter. I need to use the business entity. ID. So I can enter the business entry ID here and press Enter. Now here, note that when it comes to the joint conditions, you have something known as an expression builder. So there are certain fields where you can actually open up this expression builder and create an expression that will be evaluated at runtime. And we’ll actually see some scenarios where we can actually use these expressions. Not right now, but I just want to point this out to you here in the Expression Builder. So we have different functions, and if we look at the input schema here, you can see all of the fields.

So we have our store ID over here. So I just want to introduce this concept of the Expression Builder at this point in time. So we are done with our joint. The select step is the next step I’m going to take. I want to now choose only the columns that I want from the 13 columns that I have over here. Now, the reason I am having this select step is because I want to perform an additional step of removing the rows that have null values. So I first want to select my columns, perform the additional step of removing the rows that have null values, and then put it on my sync. In the earlier chapter, when we looked at the mapping data flow for our sales fact table, there, when defining the sync directly, we had done the auto mapping, wherein we could remove the columns that were not required. But here in this flow, I want to show the select step, wherein now you can actually select which columns are of interest. So I’ll put select columns here. So my incoming stream is the Customer Join Store. Now, here, I can decide what columns I require, and you can also give a different name for the column as well. Now, what are the columns that we require? So, we need the customer ID, the store ID, and the business ID.

So we need these particular columns and the name. So let’s look at the customer ID and the store ID. So we have the customer ID. and we have the store ID. So we don’t need the person’s ID. So I can delete this; we don’t need the territory ID, we don’t need the account number, the row is good, we don’t need the modified date, we do need the business entity ID, and we do need the name. So remove the other columns. So now we only have our columns of interest. Now I’m going to select the alter row operation. So with this, we can actually alter the rows that we want based on certain conditions. So first I’ll give you a name, right? So remove nulls. So remember what’s happening. Over here, we have our entire data set, which is a combination of these two data sets. We are now limiting ourselves to only four columns. So now for this particular step, we have the entire data set, which contains four columns, but it also contains the rows that have null values, right? So we want to now remove those null values.

So, in the alter row conditions, I’ll say delete, which means delete the rows if the business entity ID is null. Now, here we can actually make use of the expression builder. Here we’ll see all of the expression elements that are there, and if I search for null, I’ll get all of the functions that actually have some sort of operation based on null values. So the function that I want is null. And what do I want to check for? I want to check for null values in the business entity ID. So I’ll choose the input schema in the expression elements, I’ll remove this null filter, and here I’ll click on the business entity ID, and I have my expression over here if you want to combine multiple expressions. You can use the different operators that are available. I’ll hit on, save, and finish. So remember here: what are we doing? We are now deleting those rows where the business entity ID is equal to null. And now, finally, I’ll add my destination. So I’ll add my sync. So I’ll go on to sync. I’ll give a name: DM Customer. My incoming stream is removing nulls. Let me create a new data set.

So I need to choose Azure Synapse. I’ll hit “continue.” I’ll choose the data set as “Dim Customer.” I’ll put up my link service. I’ll load the existing table and let me hit “okay.” Here, I’ll go on to the mapping here. Let me remove auto-mapping. because our store name is just the name that’s given over here. So now we have everything in place. I’ll go ahead and publish everything. So this will publish our data sets and our data flow. I’ll hit the publish button. While this is being done, let’s see if we have anything on our customer table. I’ll hit execute. So we do have something on the customer table. So we need to delete it. So all the rows are gone. So we had 1336 rows. So we have our data flow in place. Now let me create a pipeline based on this data flow. I’ll hide this. I’ll give you a name.

I’ll hide the properties. I’ll insert data flow. I’ll just give it a name in the settings. For staging, I’ll select my dimension flow. I’ll choose my data lake storage, browse for the container, hit OK, and publish everything. Once this is done, I’ll trigger my pipeline. So it will go ahead and run our pipeline. Now, in our next chapter, let’s wait till the pipeline has actually executed, and then we’ll see the results. But in the next chapter, we’ll actually add to our data flow, right? More steps are required to complete our product dimension table. So one of the advantages of having that data flow is that you can add multiple steps to the data flow itself. We are currently adding one flow to manage our customer table: our dimension customer table. But in the same vein, we can now add the same concept to our dimension product table. Another thing I want to point out, which you will see in another chapter, is this feature of the data flow debug.

So with this option enabled, you will have the ability to actually debug your mapping data flow in the designer itself. And this is very useful to see if your mapping data flow works as expected. So normally for me, when I am doing my Rand D and trying to understand how the mapping data flow works, I always use data flow debug. Now, there is obviously a caveat because if you enable data flow debugging, you are going to be paying for the underlying machine that’s going to run this debugging operation. So remember, this particular pricing is for data flow execution and debugging as well.

And, once again, how much do you pay per hour? It’s basically this price. So, if you work in a large organization, you may already have an enterprise contract with Microsoft for Azure. So at that point in time, it’s obviously very useful to use this debugging feature. And even if you are determined to understand how mapping data flow works, you will enable the data flow debug option to actually see how the flow works. I will definitely show you the advantages of enabling data flow debugging. In a subsequent chapter, let me go on to the monitor section just to see my dimension pipeline.

So it’s still running. Let’s wait till this is complete. Now, once the pipeline is complete and I can see it has succeeded, if I now check for the rows in the customer table, I can see the rows over here if I look at the count, and I can see the same number of rows. If you want, you can also order this by customer ID. And you can see all of the information here. Right? So in this chapter, I just want to show you how you can build a mapping data flow for your dimension tables.

15. Lab – Mapping Data Flow – Dimension Table – DimProduct

So in the last chapter, we had gone ahead and built our dimension customer table with the help of the mapping data flow. Let’s now use the same mapping data flow to also build a dimension product table. So in the dimension product table, we have our main product table that’s doing a left join with the product model table, the product subcommittee category. And again, we have to check whether the product model ID is not null. So I’ll go on to the author section here, and I’ll go on to my dimension flow, which will work on the same thing. So I’ll collapse this. Now here, let me add a source. So let’s start with the sources for our product table.

So I’ll choose the product table. I’ll put this under the product stream. I’ll create a new data set. Choose Azure SQL Database, hit Continue, give the data set name, choose my database, and choose my table source. production OK, let me add another source. This will now be our product model table. Create a new data set, choose the Link service, choose the table, hit OK, and let’s create another source. This time, we have our product subcategory as a stream, a new data set as your SQL Database, click Continue, select the Link service, select the product subcategory, and click OK, so we have all three tables in place. Let’s do a join now. So I’m going to do a join. So this join will be from our product table to the product model table.

As a result, I’ll insert “product join product model” here. My left stream is the product stream, and my right stream should be the product model stream. I’ll do a left outer join. Yeah, I need to decide on my column. So both of these are based on—wait for it—predicate. Here is the product model ID. So I’ll choose the product model ID. Let me just copy it from here. Place it here and hit on Enter.Place it here and hit on Enter.Right, we’ve got that done. Next, I’ll perform a join. So now our left stream will be the one that we just joined. And our right stream should be the product subcategory stream. Again, the left outer join Yeah, let me choose the column. So the column is the subcategory ID. Enter, enter, enter. So we’ve also got that in place. We can also use this name. Now that this is in place, I can now select my columns of interest. So I’ll choose the select schema modifier. So I’ll say choose columns. I will delete the columns that I don’t require.

So what are they? So I need the product ID. I need the product name, I need the safety stock level, and I need the product model ID. I need the model name, and I also need the product subcategory ID and the category name. So I need the product ID. I also need the product name. I don’t need the product number, the make flag, or anything else. I do need the safety stock level. I don’t need this. I don’t need this. I don’t need the list price. So just delete the columns that I don’t require. So I do need the subcategory ID. I do need the model ID. We already have one model ID in place. I need the model name. We already have the subcategory ID, so I don’t need that. I do need the category ID and the category name. I also don’t need the product category ID, right? So it’s 123-4567 if I go back. So here I also have 123456 and seven. So there are products listed under the category name that we can rename. So I’ll just copy this. So it’s this name over here that I’ll rename.

This next part is our product name. Let me copy this. So the product name is here. And finally, what else did we rename over here? The product model name So I’ll just copy this and place it here. So now we’ve also renamed all of our columns also renamed.Now next I need to just give a unique nameover here because we already have one select columns here. So that should be fine. Now next, I need to again select my rows. So I’ll choose an alternate row. So again, I’ll say remove nulls. This is for the product. And here, what do we need to check? We will check if the product model ID is not null. So here I’ll choose my expression builder. Again, I’ll search for Is. Now I’ll just choose that I’ll remove this. Go on to my input schema. What is a product model ID? So I can just search for it. Select that it’s over here. Click on “Save” and “Finish.” You have this in place. Now I’ll add my sync to the sync, making it a dimension product stream. Choose new.

I have to choose Azure Synapse. Continue by pressing Enter. But this is the data set. So it’s stem product DT, myLink service, and dimension product. Hit OK, go on to mapping, remove auto mapping, and make sure everything is being mapped properly. Now, in terms of my product table, I also have that surrogate key, which is also in place. So we don’t need to map this particular column. So I’ll remove this. So I just have seven columns over here. Let me now publish everything right, so we’ve done quite a lot. Let me hit publish. So we have a good mapping data flow for both of our dimension tables. I’ll now launch SQL Server Management Studio. Let me delete this from the customer table. Let me delete it from my dimension product table. I’ll hit execute. So this is also done. So going back here, let me now go on to my pipeline. So I can actually go on to my dimension pipeline.

I don’t need to change anything. I just made a change to the mapping data flow. I can now directly trigger my pipeline. Right, so we’ve made a change to the mapping dataflow to now consider both of our dimension tables. Let’s come back once the pipeline is complete. Now, once the pipeline run has succeeded, if I go on to SQL Server Management Studio, let me first check if the rows are there in the customer table. So that is in place. Now I’ll select Star from the product table, and I can see all of my information in one place. So in this chapter, we have seen how we could complete our mapping data flow to consider both of our dimension tables.

16. Lab – Surrogate Keys – Dimension tables

Now in this chapter, I want to show you how you can actually generate values for the surrogate key for your dimension tables in Azure Data Factory. So, earlier on, we saw how to use the identity column feature that is available for a table in a dedicated SQL Pool. This time, we’ll make use of the SurveyKey feature available in Azure Data Factory. Now, the first thing that we need to do is recreate our table because I don’t want the identity feature to be available for the product SK column.

So this is a surrogate key. So let me take all of this. I’ll go on to SQL Server Management Studio. So currently, I have this data in the table. So let me first drop the table, and then let me recreate the table again, right? So this is done, and I’ll go on to the author section. Now, remember earlier on we had created our dimension flow, which was used for our product table? So I’ll just expand on this. So I’ll just zoom in so that we have all of our data. Now, when it comes to our sync, we add seven columns, right? Now, after removing the nulls, after this phase, I’m going to add another derived column. But this time, we have something known as the surrogate key.

So I’ll choose that. Yes, I can make the stream dangerous. This is the incoming stream. What is the key column name? Sulfur produces SK. And the starting value is one. I’ll go now on to my sync. And I need to add one more mapping here in the mapping. This is now for the SK product. And here, this will be mapped onto the product output column. So let me publish this. And now let me go on to the dimension pipeline, and let’s trigger this pipeline. So here the only difference is that now we’re generating the values for the surrogate key for the dimension table from Azure Data Factory. Let’s wait until the pipeline is complete. Now, once the pipeline run has succeeded, I can see that it has succeeded if I go onto SQL Server Management Studio and look at the information in the dim product table. So here I can see that the product SK values are being generated.

17. Lab – Using Cache sink

Now, in this chapter, I want to show you how you can use the cachet sync and lookup feature, which is available as part of the mapping data flow. So you can use the cache sync feature to write data to the Spark cache and not the data store. The cache lookup can then be used as a reference to reference the data in the cache sync, for example, in our use case.

So let’s say that we had our product dimension table. We have the values for the product SK. Now we want the next set of rows so that if they are added on to the dimension table, we want to ensure that the product SK key starts from where it left off. It should not start from the value of one again. So the first thing that we can do is get the maximum value that is current in the productSK column and then store it in the cache. And then we can reference that value using the cache lookup feature. So let’s see how we can implement this. So in the last chapter, we used the surrogate key feature in Azure Data Factory to add our data into the product dimension table. Now, if I look at the maximum value of productSK, I can see it’s five, not four. So for the next values of product SK, they should be values greater than five, not four.

So in Azure Data Factory, first of all, let me enable the data flow debug option in this particular case because we are going to be making changes to our current mapping data flow and I want to use the data flow debug option. Now, the first thing that we need to do when it comes to our dimension product table is to first ensure that we get the maximum value of the current substitute key in our current dimension product table. And then our surrogate key over here should take it forward from there. So we now need to add another source. So I’ll go down onto the source over here and let me choose now. Let me give a name to the dimension product stream. Then I’ll choose a new data set of applications here on Continue, I’ll choose my link service, and I’ll choose my product table. I’ll hit OK; now I’ll go on to the source options. Yes, I’ll select a query and insert the query determining the maximum product SK value from my dimension product table.

Next, I’ll go on to the projection, and I’ll import the projection again. So this is also done. In place of the Type integer, we now only have one column. If I go on to the data preview and hit Refresh, I can see a value of five, not four. Instead of selecting a data set, I’d like to add this value to a sync in the destination. I’m going to choose the cache of the internal Apache Spark cluster. Simply enter a name in the mapping. I have it as product SK if I disable auto-mapping. So, what have we done so far? We made certain that we took the maximum value of product SK and added it to a cache and a sync. So remember, this is going to be added to the cache for the Apache Spark cluster. Now, on top of that, I have to go on to my surrogate key.

And here I need to now add, if I just go here, a derive column because I now want to change the value of the surrogate key. So here I’ll choose that product, SK. And here I’ll go on to the expression builder. And now here, if you actually go on to cache lookup, you can see you have the product cache in place. So here I’m going to enter an expression. So here I am saying now, please take the value of product SK, which will be one. So remember, in the earlier step when we create the surrogate key, it has a default value of one, and then keep on adding it to what’s there in the cache. Here I am saying, “Please take the first row, the first value,” because we only have one value in the cache, and that’s product SK. Here you can hit refresh to look at the data preview, just to confirm, right? So everything is fine.

As a result, you can also leave this dangerous. We don’t have to actually look into these problems. I can click on “Save” and “Finish.” So we have the expression in place. If I proceed to my dim product stream, this product SK remains unchanged. Now I’ll publish everything. Now, if I run this pipeline again, all of the duplicate data will also be added. But then we should be seeing different values in the product SK column. Let me also disable this data flow debug. So now that we’re in the mounting section, let’s come back once our pipeline has completed its run. Now, once the pipeline run has succeeded, if I go on to SQL Server Management Studio and let me select Star from the product ID, So here I can see that there are some product IDs with the value of 839. So it has gone ahead and again duplicated the values for the dimension product, but it has not duplicated the values for the product SK. So in this chapter, I’m just going to go over how to use the cache sync and lookup features.

18. Lab – Handling Duplicate rows

Now, in this chapter, I’ll show you how to ensure that duplicate values in duplicate rows do not enter your data flow. So we are going to be doing this for a product table, for a product mapping data flow. I just want to show you how many things you can actually put in your entire mapping data flow. Now, in the last chapter, we had done a duplication of our data in the dimension product table because I was showing the cache sync and lookup features. The first thing I’ll do is delete from the table where the product SK is greater than five, not four, so that we actually remove the first set of duplicate data and only have the original amount of 295 rows. Now, next, I just want to check on the existing product IDs that are in place in the table. So I’ll just hit “execute.” So I can see these are all of the product IDs, and I can see there is a unique value for each product ID. Now, let me do one thing. Let me delete again the product IDs or rows where the product ID is greater than 900.

So 99 rows have been removed. So, if you select count stars from the dimension product table, you’ll see that I have 196 in place. Our original count was 295; So, when I’m copying data, say, in our mapping data flow and building the dimension product table, I want to make sure that only new rows are added to my dimension product. I don’t want to copy the duplicate rows. I don’t have duplication of rows in my dimension product table because if I run this mapping dataflow hazardous, it will take all of the rows of the products that we have and again add them to our dimension product table. But I only want to choose those rows that are now missing. So for that, what we have to do is add yet another source. So the source is going to be the data set based on our current dimension product table. So here I’ll create a new data set. I’ll choose Azure Synapse; hit on “Continue.” I’ll just give a name, choose MyLink service, and choose my Dimension product table. OK, so we have this in place; we have this with a source; let me give you the name of the source. So all the data from the dimension product Now I want to add a step after removing all of the nulls. And here I want to choose the existing operation.

Now, in the exist operation, I can choose my left stream to have all of the rows that are coming in after removing the nulls, and my right stream to be my entire synapse. So my dimension product synapses all data, right? So we’re mapping this to that, and I’m telling you to only take the rows that don’t exist, with the condition that the product ID on the left hand side equals the product ID on the right hand side. So, if the product IDs don’t exist, if the equality doesn’t exist, only then do I want to carry this data forward, right? So I hope this is understood. Here we have our entire data set, which I have defined as a new source. Here we had our existing step of removing the rows, which basically didn’t have any nulls.

Now, those rows where the product IDs don’t exist and the equality doesn’t exist, carry that forward on to the next step for the surrogate key, and that’s it. We can leave everything as it is. I’ll go ahead and publish everything, and then I’ll trigger my dimension pipeline. I’ll hit on it. OK, now please note that our dimension pipeline also has the customer dimension table. And if you actually look at our customer dimension table, you will see duplicate values in the table. So, if I slightly select a star from the dimension customer table, I should only be counting 1336 rows, I believe. But then I have more than that because I’ve not been keeping track of it. So here I am only currently focusing on one of the tables, which is our dimension product table. Now, once the pipeline run has succeeded, if I go to SQL Server Magma Studio and look at the count of rows, I can now see IAP 295 if I do a select star from the product table. So here I can see all of the product IDs. So, when we were copying data onto our dimension product table, we made sure there were no duplicate rows in place.

19. Note – What happens if we don’t have any data in our DimProduct table

Now in this chapter, I want to talk about an issue that can pop up in your workflow if you don’t have any data in your dimension product table. This is when it comes to defining your surrogate key. So we’ve already seen step by step how we can actually build our dimension product table, and everything worked fine, and that’s because along the way we did have some data in our dimension product table. But if you don’t have any data, you will actually face an issue. So first of all, let me see if I have anything in my product dimension table.

So I have nothing in place. Now I also go ahead and enable the data flow debug. Now, if I go to the stage where we have our derive columns, let me go to the data preview and hit refresh. So here, if you scroll on to the right, please note that these are the rows that are not going to be included. But still, if you go on to the right, you can see the product SK column is coming up as null. So why is this the case? So if I go on to my surrogate key here, let me go on to the data preview, and let me hit refresh. So if I go on to write, I can see that I have my product SK in place, right? So we are getting the values. Then why is it that in the derive column, we are actually now receiving the product skull over here? So if I go on to our derive column settings here, let me open up the expression builder and let me refresh this. So we’re currently looking at what’s in the cache and adding it to product SK. That is a surrogate key. So our surrogate key, as per our steps, is returning the correct values. So we have 12345. That’s fine.

Then why are we still getting null values? So here you can see that when it comes to the output, we are getting null values. And that’s because here in this step, remember when we were getting it from the cache in one of our earlier steps in the workflow? It was looking at the most expensive product SK. But we don’t have any rows in our dim product table. That’s why now this value is coming in as null, and that’s why this expression is now invalid. Everything that is being returned is null. So this is just a use case. I’d like to point out that if you don’t have any rows in your tables, you should consider all use-case scenarios. So now here’s what I’ll do. I’m going to change the expression here; I’m going to be using the if null function that is available here in mapping data flow. And please ensure that if this value is null, it is replaced with a value of zero. So, remember, we’re looking at what this value is. If it is null, then return the value of zero and return what is present in the cache, and now if I hit on-refresh, you can see you’re getting the output you desired. So again, just a note: when you have a workflow in place, please look at all the use case conditions when it comes to working with your data.

20. Changing connection details

Now, in this chapter, I just want to give a note on changing connection details. So let’s say you’ve created a LinkService and you want to change the connection details for that Link service.So it’s very easy. You can go on to the Manage section, which is an Azure data factory. You can go on to your Link service, and you can go on to the Link service questions. So, for example, if you want to go on to Azure Synapse here, you can change different aspects and choose your pool, choose the authentication type, and change the details over here. So I just wanted to give a quick note in case you wanted to make use of an existing Link service and change the connection details.

Uncategorized

Related posts:

Leave a Reply Cancel reply