DP-203 Data Engineering on Microsoft Azure Topic: Design and Develop Data Processing

DP-203 Data Engineering on Microsoft Azure Topic: Design and Develop Data Processing – Azure Data Factory Part 2

December 19, 2022

6. Lab – Azure Data Lake to Azure Synapse – Parquet files

Now, in this chapter, we’ll look at how to transfer data from our park-based files onto our log data table in Azure Data Factory. So even in the section of Azure Synapse, we saw how we could transfer data from the park files. That, again, is based on the data that we have in our Log CC file and on our log data table. So again, just to confirm, if I go on to my Data Lake Gen 2 storage account, if I go on to containers, if I go on to my data container, if I go on to raw, we have this park-based folder in place in which we have all of these files. So we’ll make use of these same files. So before we move on, I’ll go to SQL Server Management Studio. I’ll delete whatever data I have in the log data table.

So that is done. going back onto the Azure data factory. Now, before we actually create another pipeline to copy data from Parquet onto our table, if I just go on to the authors section, So I just want to review what artefacts we have over here based on what we’ve done in the last chapter. So firstly, we have our pipeline. I’ll just hide this so we have a better view of our pipeline. Here we have our copy data activity. Here you can see that you can drag a lot of activities onto this canvas. And for the Copy activity, you have a lot of settings or properties for the activity itself.

So if you look at the properties of the activity here, we have the source that is defined. This is the source data set, which points to a file path. We have properties for the source. Similarly, we have the sync, which includes the copy method that we specified. We have the mapping between the columns from the source to the destination, and we have settings. So if you want to make any sort of change to the pipeline, you can do it right here itself. Next is the data set. So in the last chapter, we copied data from our log CSV file. So for this, it created a data set. So it’s like a view of the data at that point in time. At the same time, it’s created a destination data set. This is the data that would be copied onto Azure Synapse in our dedicated SQL pool. As a result, it has now created a data set for Synapse that is specific to that point of view and table. So for each activity that you have, you’ll have a source and a destination data set.

You can reuse the data set if the aspects of that data set are not changing or if there are no changes made to that data set. So, if the number of columns in the source data changes, it’s best to create a new dataset that can be used in a new activity, right? So I’ll create a new pipeline. So instead of actually going on to the Home section and creating a new Copy Data activity, let’s go on to the Authors section. And here on the pipeline, I can create a new pipeline. So I’ll do that. I’ll hide this. Here we can give a name to our pipeline. So I’ve given the name now under Move and Transform; I’ll take the Copy Data activity and just drag it onto the canvas. I can now give a name to the Copy Data activity. I’ll now hide the properties for this Copy Data activity.

We can give a timeout and a retry interval. Now I’ll go on to my source. And here I’ll create a new source data set because my source is different. Now, I want to take data from my park-based files. So I’ll hit on something new. Here, I’ll choose Azure. Here. Generation 2 Azure Data Lake Storage is what I’ll go with. I’ll hit “continue.” Here I can now choose the format of the files, so I’ll choose Parquet. Here I can give it a name now, so I’ll give it the name “parque log,” and I can reuse my same linked service, which I have on the Azure Data Lake Gen 2 storage account. Yeah, I’ll browse for my files. So I’ll go on to the data folder, and I’ll go on to my raw folder.

Now, when it comes to importing the schema at this point in time, I won’t import the schema when it comes to the properties of the data set. So, by “schema,” Azure Data Factory can try to figure out what the schema is for the specific data in that file. So if there are columns in terms of the column header, let’s say in the LogCSV file, it can take that information. And even when it comes to parquet-based files, remember that I said that the column names and the data type are embedded in the parquet-based file, right? So we can import the schema. But I won’t do this at this point in time.

And that’s because I still have to mention or tell Azure Data Factory to take all of the files in that part-based folder. So I’ll hit OK, and now I need to choose, right, so we have our source data in place, a wildcard file path. And here we can see that it has automatically taken all of the parquet-based files. However, you can choose which files to take at any time. Now, if I just go ahead and open up the data set one more time here, if I go onto the schema now, let me import the schema, so I can import the schema from the Connection Store. And here we can see the schema in place.

So I also wanted to ensure that I showed you how you could import the schema from here itself. So, one way of reporting the schema is when you define the data set. At the same time, you can open up the data set and import the schema. So by importing the schema, your data factory is now taking all of the column names and the types of the different columns. So, just not to lose track, we’re in the data set tab, right? So we can switch between tabs at any point in time. We’ll be back on track if I continue to parquet Synapse. A previous pipeline was taking the log CSV file here. So, now that we’ve completed this, we can actually close this.

So we’re back on the copy data activity. Now I’ll go on to sync here. I can choose the same destination data set. So this refers to the mylog data table in Azure Synapse. I can choose that here. Again, it’s using PolyBase. I’ll leave that dangerous. Next, let me go on to the mapping here. Let me click on “Import schemas.” So now it’s doing a mapping between the source columns and the columns on the right. So everything seems to be fine. I’ll go on to settings. Now I need to enable staging because we are copying data onto Azure Synapse. I need to choose my data lake Gen 2 storage account. I’ll browse for my Synapse container now.

OK, so this is also fine. So we’ve created our pipeline and activity on the canvas in the designer. Now, if we want to run this pipeline, the first thing we probably need to do is validate our pipeline to ensure that all of the properties have been correctly mapped. So if I click on validate, it will validate the pipeline. So that’s done. Next, we need to publish a pipeline. So I’ll hit “publish.” So this is now going to publish our pipeline and the data set for our source. It’s not publishing the data set for our destination because we already have that from before. I’ll hit the publish button. So it’s only once we complete the publish activity that we can actually trigger our pipeline.

One other thing is that you have to ensure that the validation is complete before you can actually publish the pipeline. Publishing the pipeline is the same as saving it and the changes you’ve made to it. But then, before you can actually publish the pipeline, you have to ensure that it has no sort of validation error. When we go on to working with a Git repository for Azure Data Factory, at that point in time, you will see that you can save your pipeline at any point in time. The restriction here is that in order to save your pipeline, it must be error-free. That means you should complete the whole pipeline before you can save it.

And that’s sometimes an issue because you may want to make changes to the pipeline at a later point in time. You want to first have it in, let’s say, draught mode, and then make the required changes. So for that, you need to actually connect the Azure Data Factory to a git repository, which we’ll see in a later chapter. Now you want to run this pipeline. So here we can trigger the pipeline manually. We can hit the trigger button now, and here I’ll hit OK, so now it’s going to trigger a pipeline. We should be getting a notification that it’s now running a new pipeline. If I go on to the monitor section, I can see my park pipeline in progress. After just a few seconds, I can see it has succeeded. I should be able to see my log data if I open SQL Server management studio. So in this chapter, we have seen how we could easily copy data from our park-based files onto a table in Azure Synapse. And this time, we have created the pipeline and the activity from scratch.

7. Lab – The case with escape characters

Now in this chapter, I just want to go through the scenario wherein you could use escape characters in your data. And what do I mean by this? So if I look at my log CSV file, I said I had taken this information from my activity log data data. Now, when you look at your activity log data, there will actually be an extra row, which I had not considered. I had deleted this row. So I mentioned this early on. So I didn’t have this column in place in any of our exercises where we’re looking at moving data from the source to the destination. Now this column actually had the full URL of the property of the resource that was being impacted. Now, if you look at the resource here, if you look at the first character, this is also used as an escape sequence. And what do we mean by this? Onto the Microsoft documentation.

So you have these different escape sequences to signify or represent something. For example, you could have the backslash n to signify a new line. So when you look at a dot CSV file, you have your values that are separated by commas. And then internally, you have a slash character that signifies that the next row is going to start from a new line. So these are escape sequences that are used to signify a particular action. Now, the problem that we have in our data is that when it comes to the value, we have an escape sequence here itself. Now, even though our data is not meant to actually indicate that this is an escape sequence, when you’re copying data from the source onto the destination, the engine or the software that is used for transferring data from the source onto the destination has to understand and not interpret this as an escape sequence.

Because by default, when it sees its character, it will think that this is a basic escape sequence, and it will throw an error when it tries to take the value of this row. So this is very important. Your data at the end is important. You need to understand the different aspects of your data. And that’s why I want to bring this up in this particular chapter. So in Azure Data Factory, I just want to show you an aspect that is available to understand that: how to deal with such characters. Firstly, in my data container, I’ll just upload, so I’ll go on to the raw directory. I’ll just upload that file, which has that resource data. So I’m going to upload my file as log underscore original. I’ll hit “open.” I’ll hit upload over here. So we have that file in place. If I go onto Azure Data Factory, if I go on to the home page, and let me click on “ingest,” I will just create a quick activity. So I’ll use the built-in copy task. I’ll click on “next.” Here I’ll choose the same connection as your data lake storage, and here I’ll browse for that file. So I’ll choose the log-underscore original CSV.

I’ll remove the “recursively” option. I’ll go on to the next one. So now it is detecting our data set. Now if we preview the data here, we should be able to see by default that the resource column is being detected properly. And that’s because there is one if you go on to the next chapter here when it comes to the escape character. So with the help of this backslash character, we’re telling Azure Data Factory that if you forget this character in the data in our file, then not to consider that it has an escape sequence. Now I’ll go on to SQL Server Management Studio. So in SQL Server Management Studio, I’ll right-click on my dedicated SQL pool. I’ll hit on a new query. I’ll drop my table first. I’ll also just copy the script. To create our table, I’ll need to add another column. This will be where we will keep our resources. So I’ll create this table. This is also done back on the Azure Data Factory. I’ll go on to the next one.

Now here, I’ll choose my existing connection as your synapse. I’ll use my existing table log data. I’ll go on to the next one. So here I can see the mapping is fine, but I’ll need to change the type. So I’ll just go on to the next one. Here I’ll just give the pipeline a name for the pipeline. I’ll choose my staging-linked account. I’ll browse for the synapse container. Hit OK in advance using PolyBase. Now if I go back to the previous screen, I can disable type conversion, and here I can again choose integer. So, once again, tab tab tab tab until I get to integer, then enter. Then select the date and time for the time. So this is also fine. So here at the end, you can see the resource is being mapped correctly. I’ll proceed to the next one.

Go onto Next, and let’s ensure that the pipeline is created along with the Copy activity. I’ll hit “finish.” I’ll proceed to the Monitor section now. I’ll just wait for this pipeline to complete so I can see if it has succeeded. If I go on to SQL Server Management Studio, I can see all of my data in one place. If I go on to the last row, I can even see the data in my resource column. Now, you might think that this was a very simple exercise, but then I want to bring this special thing to your attention. In my experience, when I was trying to transfer data that had this escape sequence to a destination, I did face issues. So I used to face errors where the copy activity would not happen. So, because your data factory has these various capabilities for understanding the data that you have, you can copy them onto the destination.

8. Review on what has been done so far

In this chapter, I just want to have another review of the entire Azure Data Factory process. So far, we’ve seen a couple of copy activities in which we’re taking source. We are then transferring it to a destination. We are not doing any sort of major transformation as of yet. All we are doing is a simple copy activity from the source to the destination.

Sometimes you might need a certain degree of simplicity in your pipeline. Later on, we will be looking at the mapping data flow to see how you can perform transformations in your pipeline. But just for now, I want to ensure that students have a firm understanding of the Azure Data Factory Service. So what have we seen so far? So we have been creating a pipeline in Azure Data Factory, but we have only been using one activity so far. That is, we copy activity to copy something from the source to a destination.

So far, we were actually copying data from our Azure Data Lake Gen 2 storage account onto AzureSynapse and then onto our data warehouse in the pipeline. For our copy activity, we were creating connections, or linked services. So the linked service is a connection between your source and your destination. Then a data set would be created for both your source and your destination. The pipeline would use this data set to copy the data from the source to the destination. And remember, this entire copy activity needs to happen somewhere. So your data needs to be taken from the source, wherever that is. It must be run on some sort of infrastructure before being moved to the Azure SQL data warehouse. You need something in between to initiate the transfer. As a result, data must be copied somewhere before being stored in the Azure SQL data warehouse.

Now, when it comes to Azure Synapse, yes, you do need to have a staging area in your Azure Data Lake Gentle storage account or in Block Storage. But that’s just for the data. You still need some sort of compute infrastructure to take the data, put it on the staging area, and then put it on Azure Synapse. Because this activity is not going to happen at your source, it is not going to happen at your destination. It needs to happen somewhere else. And that’s the integration runtime that’s part of Azure Data Factory. So Azure Data Factory will create compute infrastructure in the background. It’s similar to virtual machines that will take data from the source and send it to the destination. This is very important. So there’s something in between that is completely managed by Azure Data Factory to carry out this entire copy process in this particular case.

Now, next I want to talk about the costing when it comes to working with Azure Data Factorybecause there’s a cost for using every resource. So I want to bring this to your attention at this point in time. So when it comes to Azure Data Factory version 2, which you will be using, So, there are a lot of aspects when it comes to the pricing for Azure Data Factory. First is the number of runs. So it’s just one USD per 1000 runs. So you only pay $1 for a batch of 1,000 runs. Next is something known as a data movement activity. So this is zero point 25 according to a specific metric measured in Azure Data Factory.

In addition, we are currently utilising the Azure Integration runtime. So remember, this integration runtime is the underlying compute infrastructure that is used for moving your data from the source to the destination. So these are the two main parts. The main part is actually the data movement activity part when it comes to the pricing. When it comes to the integration runtime that we are using, even when you go onto Azure Data Factory here, if you go on to the settings for a pipeline here, you can see that you will be charged based on 00:25 USD based on the metric of “data warehousing unit per hour.” Now, to see how much you have actually consumed, you can go on to the monitor section here. For any pipeline run, you can actually see the consumption. So you can see that you use one activity run, and the data movement activities are 0333 diu per R. So you take that and multiply it by 25. So this is how you see how much you have actually consumed when it comes to the cost. So obviously, if you are practising along, it’s very important to understand the cost when it comes to Azure Data Factory.

9. Lab – Generating a Parquet file

Now, in this chapter, I want to show you how you can generate Park-based files. So earlier on, I had mentioned that when it came to my park-based files, which I’ve been using as part of my labs, I’d actually use Azure Data Factory to convert my log CC file into those park-based files. And then I could use that for my subsequent labs. So, let’s see how we can proceed with the conversion. Firstly, let me go on to my Data Lake Gentu storage account. So in my raw folder, let me create a new directory. Just give the directory a name, then hit Save. So in this directory, we don’t have anything at all. Now, as your data factory, I’ll go on to the author section. I’ll create a new pipeline. I’ll just say hello.

Next, I’ll go on to the source. Now, I can use my existing source data. So this is the data that actually points to my log CSV file. So if I hit on the preview data, I can see that this points to my log CSV file, so I can use the same data set. Always try to ensure that you give meaningful names to your data sets. Now, next, I don’t need the recursive option. I’ll go on to my sync. Here I need to create a new sync data set because I am now generating park-based files in my storage account. So I’ll hit on something new. Here again, I’ll choose Azure. I’ll choose Azure. Lake Gen 2 data. I’ll hit “continue.” and I’ll choose parquet. Continue by pressing Enter. I can name it here.

So I’ll say this is parquet output, and I’ll choose the same link service for my Azure datalink storage. Here, I’ll browse for the path. So in my data container, in my raw directory, and in my new park directory, I’ll hit on it. Okay, again, I’m not importing any schema; I’ll keep it as “none.” I’ll hit on okay, here, in terms of the mapping, and I’ll import the schemas. So I can see here that my ID will be linked to the ID in the destination. Now here, I want to ensure that this is taken as an integer. So, once again, I’ll use tab to ensure that I select integer 32. Because remember, in Parquet, even the type is part of the Parquet-based file. So I want to ensure that the right type is being used in my source. Again, remember, even though in our source, in our CSV file, everything is based on strings, in Azure Data Factory, we can tell it to map it onto a particular type.

I’ll again do a tab, and I’ll choose Daytime Hiton Enter just to ensure that this is also in place. So I’ll leave this hazardous area. Go on to the settings page. Just leave everything hazardous. Since we are not copying onto Azure Synapse, I don’t need to enable any sort of staging area. Now, let me publish everything. So I’m not doing any validation over here. So during the publish, if there are any validation errors, they will come up. It’s not showing any validation errors. So I can hit “publish.” So again, note that it is creating a new pipeline and a new data set. I’ll hit the publish button. Once this is done, I’ll trigger the pipeline. So again, a manual trigger. I’ll go on to the monitor section. I can see my pipeline over here generating parquet, and I can see it as having already succeeded. So if I now refresh my new parquet directory so I can see my log parquet file over here, it’s 1.25 megs.

10. Lab – What about using a query for data transfer

So again, just to confirm, if I go on to my Data Lake Gen 2 storage account, if I go on to containers, if I go on to my data container, if I go on to raw, we have this park-based folder in place in which we have all of these files. So we’ll make use of these same files. So before we move on, I’ll go to SQL Server Management Studio. I’ll delete whatever data I have in the log data table. So that is done. going back onto the Azure data factory. Now, before we actually create another pipeline to copy data from Parquet onto our table, if I just go on to the authors section, So I just want to review what artefacts we have over here based on what we’ve done in the last chapter. So firstly, we have our pipeline. I’ll just hide this so we have a better view of our pipeline. Here we have our copy data activity. Here you can see that you can drag a lot of activities onto this canvas.

And for the Copy activity, you have a lot of settings or properties for the activity itself. So if you look at the properties of the activity here, we have the source that is defined. This is the source data set, which points to a file path. We have properties for the source. Similarly, we have the sync, which includes the copy method that we specified. We have the mapping between the columns from the source to the destination, and we have settings. So if you want to make any sort of change to the pipeline, you can do it right here itself. Next is the data set. So in the last chapter, we copied data from our log CSV file.

So for this, it created a data set. So it’s like a view of the data at that point in time. At the same time, it’s created a destination data set. This is the data that would be copied onto Azure Synapse in our dedicated SQL pool. As a result, it has now created a data set for Synapse that is specific to that point of view and table. So for each activity that you have, you’ll have a source and a destination data set. You can reuse the data set if the aspects of that data set are not changing or if there are no changes made to that data set. So, if the number of columns in the source data changes, it’s best to create a new dataset that can be used in a new activity, right? So I’ll create a new pipeline. So instead of actually going on to the Home section and creating a new Copy Data activity, let’s go on to the Authors section. And here on the pipeline, I can create a new pipeline. So I’ll do that. I’ll hide this. Here we can give a name to our pipeline. So I’ve given the name now under Move and Transform; I’ll take the Copy Data activity and just drag it onto the canvas. I can now give a name to the Copy Data activity.

I’ll now hide the properties for this Copy Data activity. We can give a timeout and a retry interval. Now I’ll go on to my source. And here I’ll create a new source data set because my source is different. Now, I want to take data from my park-based files. So I’ll hit on something new. Here, I’ll choose Azure. Here. Generation 2 Azure Data Lake Storage is what I’ll go with. I’ll hit “continue.” Here I can now choose the format of the files, so I’ll choose Parquet. Here I can give it a name now, so I’ll give it the name “parque log,” and I can reuse my same linked service, which I have on the Azure Data Lake Gen 2 storage account. Yeah, I’ll browse for my files. So I’ll go on to the data folder, and I’ll go on to my raw folder. Here I’ll choose my PQ folder and hit OK because I want to take all of the files from this particular directory.

So I’ll hit OK, and now I need to choose, right, so we have our source data in place, a wildcard file path. And here we can see that it has automatically taken all of the parquet-based files. However, you can choose which files to take at any time. Now, if I just go ahead and open up the data set one more time here, if I go onto the schema now, let me import the schema, so I can import the schema from the Connection Store. And here we can see the schema in place. So I also wanted to ensure that I showed you how you could import the schema from here itself. So, one way of reporting the schema is when you define the data set. At the same time, you can open up the data set and import the schema. So by importing the schema, your data factory is now taking all of the column names and the types of the different columns.

So, just not to lose track, we’re in the data set tab, right? So we can switch between tabs at any point in time. We’ll be back on track if I continue to parquet Synapse. A previous pipeline was taking the log CSV file here. So, now that we’ve completed this, we can actually close this. So we’re back on the copy data activity. Now I’ll go on to sync here. I can choose the same destination data set. So this refers to the mylog data table in Azure Synapse. I can choose that here. Again, it’s using PolyBase. I’ll leave that dangerous. Next, let me go on to the mapping here. Let me click on “Import schemas.” So now it’s doing a mapping between the source columns and the columns on the right. So everything seems to be fine. I’ll go on to settings. Now I need to enable staging because we are copying data onto Azure Synapse. I need to choose my data lake Gen 2 storage account. I’ll browse for my Synapse container now. OK, so this is also fine.

So we’ve created our pipeline and activity on the canvas in the designer. Now, if we want to run this pipeline, the first thing we probably need to do is validate our pipeline to ensure that all of the properties have been correctly mapped. So if I click on validate, it will validate the pipeline. So that’s done. Next, we need to publish a pipeline. So I’ll hit “publish.” So this is now going to publish our pipeline and the data set for our source. It’s not publishing the data set for our destination because we already have that from before. I’ll hit the publish button. So it’s only once we complete the publish activity that we can actually trigger our pipeline. One other thing is that you have to ensure that the validation is complete before you can actually publish the pipeline. Publishing the pipeline is the same as saving it and the changes you’ve made to it. But then, before you can actually publish the pipeline, you have to ensure that it has no sort of validation error. When we go on to working with a Git repository for Azure Data Factory, at that point in time, you will see that you can save your pipeline at any point in time.

The restriction here is that in order to save your pipeline, it must be error-free. That means you should complete the whole pipeline before you can save it. And that’s sometimes an issue because you may want to make changes to the pipeline at a later point in time. You want to first have it in, let’s say, draught mode, and then make the required changes. So for that, you need to actually connect the Azure Data Factory to a git repository, which we’ll see in a later chapter. Now you want to run this pipeline. So here we can trigger the pipeline manually. We can hit the trigger button now, and here I’ll hit OK, so now it’s going to trigger a pipeline. We should be getting a notification that it’s now running a new pipeline. If I go on to the monitor section, I can see my park pipeline in progress. After just a few seconds, I can see it has succeeded. I should be able to see my log data if I open SQL Server management studio. So in this chapter, we have seen how we could easily copy data from our park-based files onto a table in Azure Synapse. And this time, we have created the pipeline and the activity from scratch.

Uncategorized

Related posts:

Leave a Reply Cancel reply