DP-203 Data Engineering on Microsoft Azure Topic: Design and Develop Data Processing

DP-203 Data Engineering on Microsoft Azure Topic: Design and Develop Data Processing – Azure Data Factory Part 1

December 19, 2022

1. Section Introduction

Hi, and welcome to the section on Azure Data Factory. So this is your ETL tool, your Extract, Transform, and Load tool, so you can use this tool for your batch processing needs. So we are going to look at, first of all, the copy activity that you can use in Azure Data Factory. This is the most common activity that is available. And then from there, we’ll actually move on to mapping data flow. In Mapping Data Flow, I’ll walk you through a use case scenario for creating dimension and fact tables. And then I’ll also go through some other scenarios when it comes to Azure Data Factory. Again, there is a great deal to COVID. So let’s move on.

2. Extract, Transform and Load

Hi, and welcome back. Now in this chapter, I just want to go through a quick overview when it comes to ETL. So that is your extract, transform, and load process. Assume, for example, that there is a requirement to take data or extract data from various data sources and then load the data into a charting area. So, for example, you might want to go ahead and take data from an Azure SQL database or from a SQL Server instance hosted in an on-premises environment.

And maybe you want to go ahead and take the data and transfer it or load it onto an Azure SQL data warehouse. And maybe you also want to go ahead and perform some sort of transformation or change on the data before it is actually loaded onto the SQL data warehouse. So this sort of process is known as the “extract, transform, and load” process. ETL over here: you’re performing the extraction of data, you’re performing the transformation, and you’re going ahead and loading the data into the target area. So over here, you would need to go ahead and have a separate transformation engine in place to go ahead and transform the data. The engine has the ability to go out and perform aspects of this filtering and sorting of the source data and join data from different data sources in the ETL domain.

So when it comes to Azure, you can go ahead and make use of Azure Data factory in terms of Microsoft SQL Server. They have SQL Server Integration Services that are also available when it comes to performing ETL. Now a variation of this is also available; it’s known as the “extract, load, and transform” process. So over here, the load and the transformations actually happen on the target itself. So over here, you don’t need a separate transformation engine. Now over here, the target system must be powerful enough to go out and perform the transformation. So this is something that is available in Azure Synapse as well, right? So in this chapter, I just want to give a quick overview of the Extract, Transform, and Load process.

3. What is Azure Data Factory

Hi, and welcome back. Now in this chapter, let’s have an introduction to Azure Data Factory. Now this is a cloud-based ETL tool. So in terms of ETL, we’re looking at the extraction of data. We’re looking at the transformation of your data and then at the loading of data into the destination data store. You can go ahead and create data-driven workflows in Azure Data Factory.

And these workflows can actually help go ahead and orchestrate data movement. So as part of this course, we are going to look at quite a few examples. when it comes to the Azure data factory. It also helps to transform your data. So you can go ahead and extract the data. You can go ahead and perform transformations, and you can go ahead and load the data into the destination data store. When you consider the overall process as your data factory process, the first step is to connect to the necessary data sources. The next step is to go ahead and ingest the data from the source itself. You can go ahead and transform the data in the pipeline if it is required, and then you can go ahead and publish the data to a destination.

So a destination could be an Azure Data Warehouse, which is Azure Synapse, Azure SQL Database, or Azure Cosmos DB. You can also go ahead and monitor the pipeline as it is running. Now, we look at the different components when it comes to Azure Data Factory. So first, you have the link service. So this enables you to go ahead and ingest data from a data source. The Link Service can be used to create the required compute resources to go ahead and take the data from the data source. We then have data sets. This is used to represent the data structure within the data store that is being referenced by the Linked Service object. And then we have the activity. So this contains the actual transformation logic. You can instruct simple copy activities to copy data from the source to the destination.

So, over here, I’m showing a representation of Azure Data Factory. So in the Azure Data Factory service, you’ll actually go ahead and define your pipeline. So your pipeline is nothing but an logical grouping of activities in the data factory. So, for example, if you want to go ahead and copy data from a source to a destination, you’ll go ahead and create a copy activity within the pipeline itself. You can go ahead and create many activities within a pipeline within Azure Data Factory. Now in Azure Data Factory, when you go ahead and create an activity as part of a pipeline, you’ll actually go ahead and create Link Services. So Link Service is similar to a connection string that is used to allow data factories to connect to either a source. So in this example, I’m giving the source as an Azure SQL database.

And over here, you can go ahead and create a Link service for a destination, which can be an Azure SQL data warehouse. So if you want to go ahead and perform a simple copy activity of data from an Azure SQL database onto an Azure SQL data warehouse, you’ll go ahead and create two Link Services. One is to the source, and the other is to the destination. Then you’ll go ahead and define a data set. So there’s nothing but a named view of the data that is used to go ahead and reference the data within the activities themselves. So over here, you’ll have one data set that is for the source, and you can also have the other data set for the destination. Now, Azure Pipelines is a completely managed service.

So over here, if you are performing a simple copy activity of data from, let’s say, an Azure SQL database onto an Azure SQL data warehouse, the Azure Pipeline will actually go ahead and create the required compute infrastructure. This is known as the integration runtime. That compute infrastructure will be in charge of retrieving data from the Azure SQL database and copying it to the destination. Remember that in Azure Data Factory, you can perform transformations on the data in addition to extracting it from the source and loading it into the destination. And again, that transformation of data will be done by the underlying compute infrastructure. So over here, this compute infrastructure is basically transparent to you. All of this will be managed by the Azure Data Factory. So this kind of alleviates the burden of having to go ahead and manage the underlying compute infrastructure, which is required to go out and perform activities in a normal data pipeline. Over here, everything is actually managed by the Azure Data Factory, right? So in this chapter, I want to give an introduction to Azure Data Factory.

4. Starting with Azure Data Factory

On to the advance. Leave everything as it is. Let me go directly on to review and create. And let’s hit “create.” This will just take a couple of minutes. Now, once we’ve established the Data Factory, I’ll go ahead and introduce the resource. And here I’ll open up the Azure Data Factory studio. And now we can start working with Azure Data Factory. So in this chapter, I just want to start simple. Just creating a resource based on the Azure data factory.

5. Lab – Azure Data Lake to Azure Synapse – Log.csv file

Now in this chapter, we’ll see how to copy our log CSV file, which we had seen in the previous chapters when it came onto Azure Synapse, onto our tablet, which we have in our dedicated SQL pool. So we are going to be making use of Azure Synapse with Destination and our dedicated SQL pool because most of the exam concepts are based on transferring data onto your dedicated SQL pool.

So I have ensured that I have maintained my Azure Synapse workspace. Remember, if you have deleted your Synapse workspace, it is fine; you can go ahead and recreate it. You can create your own dedicated SQL pool. Now, when you look at Azure Data Factory, there’ll be a lot of aspects that might be similar to what you have seen here in the Integrate section when it comes to your Synapse Studio workspace, because Azure Synapse has this integration with Azure Data Factory. So, returning to Azure Data Factory, we have the Home section, the Author section, and the Power Query section, where we can create pipelines, data sets, data flows, and Power Query. We have the Monitor section, and finally, we have the Manage section. I’ll go on to the Home section, and if I click on the Ingest section, this is going to open up a Copy Data tool. This is something we had seen earlier on as well. So if you choose that particular option, it will go ahead and directly open up the Copy Data tool.

Now, we are going to be looking at a lot of options that are available in Azure Data Factory. at this moment. I want to start as simply as possible. So I’d like to duplicate our log CSV file. So if I go back onto my Azure Data Lake Gen 2 storage account, we’ll be reusing this. If I go to my raw directory, I’m choosing my log CSV file. So I chose the option of running it once. Now I’ll move on to the next one. Here, we need to choose our source connection. You can see that you have no connections in place, despite the fact that we created connections. Remember in the integrated section here when it came on to Azure Synapse? It’s not available here in Azure Data Factory. So this is a separate resource altogether. When we look at the chapters on source configuration with Git, then we’ll see some things that can be shared between Azure Synapse and Data Factory. So for now, we need to create a new connection.

So I’ll create a new connection, and remember, we are creating a connection for our source. Again, you can see a lot of options for connecting to different data stores. I’ll choose Azure. And I need to choose Azure. Data Lake Storage Generation Two I’ll hit “continue.” Here. I can only mention Azure Data Lake Storage. So this will point to our Data Lake Generation Two storage account. Again, when it comes to the connection, I’ll choose my subscription. I’ll choose my account name, test the connection, and hit Create. Now remember, this is creating a new connection. It’s creating a linked service that will connect to this data store, which is an Azure Data Lake Gen 2 storage account. I’ll hit “create” next. Once we have our connection in place, I’ll browse for my file. I’ll go on to my data container; the raw folder logCSV hit OK, and I have the location here itself. I’ll remove the recursively option because we are only migrating the contents of one file onto our table in Azure Synapse. Now, before I move ahead, let me go to SQL Server Management Studio and see if we have data in our table. So I’ll hit execute.

So we do have data in our table. Remember, this is the same table we used in the chapter on Azure Synapse. So if I go on to one of my earlier sequel scripts, this is the command that was used for creating the table. So you can refer to the same command if you don’t have the table in place. But I have made sure that I have my dedicated pool in place, and I also have my table in place. So first things first, let’s delete the data. From this table, I’ll hit execute. So I don’t have any data within this table that we can confirm. So no data is going back. Let’s go on to the next one. So here we can preview our data, so we can see our data in place.

Here in the file format, you can see that there is support for a variety of file formats. Here you can choose what character actually separates each column and what character separates each row. Here you also have the option of putting the first row as the header. And we have other information in the advanced section as well. You can also choose to skip the line count entirely. So if you want to start from a particular line, you can go ahead and choose that option. I’ll go on to the next one. Now we need to specify our target. So let’s specify our target. I’ll create a new connection. I need to proceed to Azure and ensure that I select Azure Synapse here on Continue. Yeah, let me leave this hazardous area. I’ll choose my subscription. I’ll choose my Synapse workspace, my database, and SQL authentication.

So there are many authentication types that are available. I’ll go on to my workspace. In another tab, I’ll take the SQL admin username, mention it here, put the password, and let me hit Create so that I have the connection in place. Here I’ll choose an existing table. So I’ll choose my log data table. I’ll proceed to the next one. It is now performing column mapping. Now, when you look at the column mapping, remember that for your CSV file, everything is taken as strings. So all of the columns on the left-hand side are strings. The destination will be reached if I continue to the right. The type is being taken based on the type of columns in our destination. So if you look at our table definition, the ID is an integer, and we have the time as date time over here. So going back here also, we can see a date and time. So I’ll leave this hazardous area. I’ll go on to next year. We can give a name to our task next. I can scroll down. Now again, since we are copying data onto AzureSynapse, we need to have a staging area. The staging area can be in our area. Account for data lake generation two storage Now, since we already have a link service for that storage account, I can choose that. I can select Browse, select my Synapse container, and click on.

OK. Now I’ll go on to advance. Here I’ll leave PolyBase to handle the copy method. Now, if I try to click on Next, it will give me an error message. It says that PolyBase cannot be used with something known as type conversion. So there is an automatic type conversion that is happening with Azure Data Factory. So I need to go on to the previous step here. There is now a “Type Conversion” setting. So I’ll disable this. Now, here on the right-hand side of the source, you can also now mention what the types are. So here I can actually filter by the different types. Now, I should see a list of the types in the drop-down filter. But here in Chrome, I know I am not able to see all of the types. This seems to be some sort of UI issue. But if I hit tab on my keyboard, I can see the different types that are now available that you can choose when it comes to the source. So I want to select integer int 32 and then press enter.

So here I’m specifically saying that please treat myID input as an integer so that it can be mapped onto the integer type over here. Next, I’ll scroll on to my time. I’ll keep pressing tab once more. I want to reach the date time.I’ll hit “enter.” So we have this in place now. If you didn’t have these settings in place when you ran your pipeline, you would actually have an error wherein you would not be able to convert the string into a date and time. I’ll move on to the next. So here, everything remains the same. I’ll move on to the next, then the next, and finally finish. Here. If I go on to the monitor section here, I can see my pipeline in progress. Let’s just wait for some time. This should be done within 30 seconds. So within 30 seconds, I can see that it has succeeded. If I now select my data from the log table, I can see all of the data in one place. So we’ve started simple. We have looked at how to copy an analogue CC file using the Copy Data tool.

Uncategorized

Related posts:

Leave a Reply Cancel reply