DP-203 Data Engineering on Microsoft Azure Topic: Design and Develop Data Processing

DP-203 Data Engineering on Microsoft Azure Topic: Design and Develop Data Processing – Azure Data Factory Part 4

December 19, 2022

21. Lab – Changing the Time column data in our Log.csv file

Now in this chapter, we’ll see how to use Azure Data Factory to convert our log CSV file into another log CSV file. But the change that we want to make is that we want to format the time column. Earlier on in some of the chapters, I had shown you that I had a log CSV file in a folder that is actually called the data container. So now I have gone ahead and deleted that file. But I told you that I would show you how to generate this file. So I want to make sure that this time column is in a general format when it comes to having the date and time and not in this long date and time format. So we are going to be using mapping data flow for this particular use case.

Again, you can try different ways to implement this. In Azure Data Factory, I’m showing you how to use the mapping data flow. So in Azure Data Factory, I’ll go on to the authors section. Here I’ll create a new data flow. So just give a name. At the same time, I’ll also enable data flow debugging. I’ll hit on OK. Let me just hide this for the moment and let me add a source, give it a name, and let me choose a new data set. I’ll go with Azure Data Lake. hit on “Continue.” I’ll choose my delimited text file. Just give a name. I’ll choose my linking service. I’ll browse for the file path. So it’s data; it’s in the raw directory. It’s my CAC log file. I’ll hit OK.

The first row is a header. I’ll hit OK here now; once this is done, I want to now add a derived column. So I’ll just put the name “change time” here. Here, I’ll scroll down. Let me just hide this. So I want to now choose my time column, and here I’ll enter or open up the expression builder. Now in the expression, I’m using the two timestamp functions to convert my string. So when it comes to my data source there, my time is coming in as a string because it is being picked up from the CSV file. I’m not also making a change by telling us your data factory how to infer that column. I’m leaving; it has a string. Then I take the substring from index position zero to index position ten of the time column.

This will give me the date. Then I’m putting a space here. I’m using the concatenate string function to concatenate the date that is being extracted. There is a space here, and then I take the time to specify what the format should be. So let me just hit save and finish, and let us wait for this data flow debug to be in place. Now that we have the data flow debugging in place, if I go back to the expression builder and hit refresh. So you can see that we are getting the desired output, where we have the date and time separated by a space. So here, the first substring function is being used to extract the first ten characters. So starting from index position zero up to a length of ten, the next substring was to only take the time and then use the concatenation function to combine the strings.

And then I’m converting it to a timestamp. Again, let me save and finish. If I go on to the log source again, if I open up the data set here when it comes to the schema, I have not made any change. It is deducing the schema provided here. If you actually wanted, if you went on to the time flow, if you went on to the log source, if you went on to projection, you could project the types in your source. So for the time being, I’m leaving it alone because I only want to change the time. Now, let me add a sync here. I’ll create a new data set again. Here on Continue, deliver a text file here on Continue, just give it a name, and select my link service. I’ll just browse for my data container in the clean directory. This time I’ll hit OK; I won’t import any schema. The first row has the header, and let me hit OK. Now in the sync, I’ll go on to settings, and here, in terms of the file name option, I’ll output onto a single file. Here I’ll set it on a single partition.

That should be fine. And here I’ll give the name as “log CSV,” otherwise it will split the file into multiple parts. Let me now publish everything. I’ll hit publish, then let me go on to the pipeline. So I’ll just make a new pipeline to finish the publishing. I’ll give a name to the pipeline. I’ll just hide this here. I’ll go ahead and put the data flow here. I’ll proceed to Settings and select my data flow, after which I’ll publish this. We don’t have to make any changes to the pipeline. Then let me trigger the pipeline. OK, I’ll hit on it. I’ll go on to the monitor section, and let’s wait until this is complete. Now, once the pipeline is complete, I can hit refresh, and I can see that it has succeeded. If I go and refresh this particular folder, I can see my Log CSE file. And when you look at the time column in this file, you will see that it is as per the date and time specification that we have mentioned as part of the mapping data flow. Right, so this marks the end of this chapter.

22. Lab – Convert Parquet to JSON

Now, in this chapter, I want to show you how to convert the PNG-based files that we have onto a JSON-based file. So again, we’re using the Copy Data tool in this particular case. So if we go on to our containers, if I go on to my data container, if I go on to my raw folder, if we go on to our parquet-based folder, here we have three files in place. Now, I want to use the Copy Data tool to take all of these files and only generate one log JSON-based file. So, in the data again, let me go back to the raw folder. I’ll actually delete the log JSON file that we have. Allow me to access Azure Data Factory. I’ll continue on my way home. Otherwise, let’s go on to authors and create a new pipeline.

So I’ll create a new pipeline. I’ll convert this parquet to JSON, hide the properties, use the Copy Data activity, and give it a name in the source. So I’ll create a new data set. I’ll choose Azure, Data, and Lake. Hit on. Continue. Choose parking. Hit on. Continue. Simply enter your name and select my Link service. I’ll browse for the files. So on the park directory, hit OK. I’ll just import the schema so it’ll actually read all of the parquet-based files. I’ll hit OK. Yeah, I need to put a wildcard file path so that it takes all of the Patty-based files. Yeah, we can choose the recursive option. Now I’ll go on to my sync. I’ll come up with something new. Again. I’ll choose Azure, Data, and Lake. Gen two. I’ll hit “continue.” I’ll choose JSON here and continue. So I’ll name the data set “log JSON.” Choose my Link service, and I’ll browse for the file. So I want it to be in the data container in the raw directory.

I’ll hit OK. And I’d like to call it “log JSON.” I’ll hit OK. So it’s saying there’s no schema because we don’t have any schema in the path and we don’t even have the file name in place. So I will not import the schema, and I’ll place it as “none.” I’ll hit OK. So the reason I did not change the setting first is because I want you to see what error you can get if Azure Data Factory can’t detect the schema because we don’t even have the file in place. When we choose Import from Schema from, let’s say, Azure Synapse, where we already have our table in place over there, it can actually detect what the underlying columns in the table are and what the underlying data type is, but we don’t even have our log JSON file in place as of yet.

That’s why I’m choosing the import schema, which has none for now. I’ll hit OK. Now, here in the Copy behavior, I’ll choose to merge the files so that all of the park-based files will actually be mapped onto one single log JSON file. Let me hit publish. I’ll hit “Publish” over here. Once this is done, I’ll trigger the pipeline. I’ll hit OK, and I’ll go on to the motor section. Let’s wait until this is complete. Once the pipeline has succeeded and you can hit refresh, you can see the run end. And as you can see, it was a success. If I go to my directory and hit refresh, I can see my log dot JSON file in place. Right, so this is around 14.42 MB. Now, in the next chapter, let’s see how we can transfer the data from this file onto Azure Synapse and then onto our table.

23. Lab – Loading JSON into SQL Pool

So in the last chapter, we saw how we could use the Anzio data factory to generate our JSON-based files. Now let’s make use of that and actually create a pipeline to take our data from that log JSON file and basically copy it onto Anzio Synapse. But let’s change the existing pipeline itself, which I’ll do over here. So I’ll go on to the author section. I’ll just hide this. So we’ve completed our parquet to JSON copy activity. Here, let me add another copy activity. And here, let me ensure that I connect both of the copy activities together. So I’ll go on to parse JSON. So this is hazardous. I’ll go on to copy the data over here. And let me rename this in the source: I am going to choose our source data set as the log JSON. Right? This will be the file path in the data set.

Leave everything as it is. I’ll go on to the sync. Now I’ll go on to the sync and create a new sync data set. Yeah. I’ll choose Azure Synapse. I’ll hit “continue.” I’ll give the name of the data set. I’ll choose Azure Synapse. I’ll choose my log data table. I’ll hit OK; now I’ll go on to the mapping. Let me import the schemas. So here we can see that the schema is being imported from the JSON-based file. And here it is being mapped onto our table. Let me just hide this for the moment. I want to go on to settings. I want to enable staging. Since we are copying it onto Azure Synapse, I need to choose Data Lake again. Browse for “Synapse.” Hit OK, so everything is in place. a mapping; a sync; a source. Right? So we want now to ensure that once our park basedfiles are converted onto JSON, then it will go on tothe next step to copy everything onto our Synapse log table. So, first things first. Let me delete this JSON log file. So that’s done.

Now I’ll go on to SQL Server Management Studio. Let’s delete the log data file. So we have nothing over here. Let us choose a star from the log data. And the number of rows that we had was three: 800:57. So it’s taking the number from here. So let’s do a select statement from the log data. Hit execute. So nothing is in place. Let’s now publish our pipeline. I’ll hit publish, and I’ll now trigger the pipeline. Okay, I’ll hit on it. I’ll go on to monitor, and let’s wait for the pipeline to complete. Now I can see that our pipeline has succeeded if I hit refresh, so it’s done. If I go on to my data directory and hit refresh, I should see my log JSON file in place. And here, if I select all, I can see all of my data. Here you can see the number of rows: 38,057. So in this chapter, I just want to show you that you can add multiple activities to your pipeline.

Right, so we had two activities. One was to convert apart-based files onto JSON, and the next was to take the JSON-based files onto Azure Synapse. Obviously, you can just take one step to take the Parquet-based files and load them onto Azure Synapse. But in this particular scenario, I just wanted to show that you can have two activities in your pipeline. In fact, here you might also have a mapping dataflow activity that can make some sort of transformation to the data before it is loaded onto Azure Synapse. So it’s up to you. This is your canvas. You can go ahead and create the activities that will be part of your pipeline.

24. Self-Hosted Integration Runtime

Hi, and welcome back. Now, in the next set of chapters, we’ll look at the implementation of the self-hosted integration runtime. Now, so far we have looked at examples at transferring data either from an Azure SQL database or from Azure Data Lake Gen two storage accounts. And at that point in time, we were using the Azure integration runtime.

So the Azure integration runtime already has support for connecting to a variety of data sources. But let’s say that you have a data source that could be hosted on a virtual machine. This machine could either be in Azure or it could be on your on-premises infrastructure. Let’s say you want to move data files or a SQL database that is hosted on this VM onto, let’s say, Azure Synapse. Then we have to install the integration runtime on this VM and then register the server with Azure Data Factory. So if you have your own custom system on which you want to take that data and then transfer it using Azure Data Factory, you then have to go ahead and make use of the self-hosted integration runtime.

Now, in our particular example, what we are going to do is first launch an Azure virtual machine. So you have the Virtual Machine Service available in Azure, using which you can create a VM on the cloud platform. This virtual machine will be based on Windows Server 2019. And then on this VM, we are going to install the NGINX web server. Then I’ll also install the self-hosted integration runtime on this VM. We will then connect our VM with an integration runtime to Azure Data Factory, and then we’ll copy one of the log files that is generated by NGINX onto a table with the help of Azure Data Factory. Now, I’m not going into details on Windows Server or on the Virtual Machine Service because that is not in the scope of this exam here. The entire purpose is to show you what the feature is or what the purpose of using the self-hosted integration runtime is. So let’s go on to the next set of chapters to see how we can implement this.

25. Lab – Self-Hosted Runtime – Setting up nginx

So, first things first. Let’s go ahead and create an Azure virtual machine. Now, for this, I’ll go onto the “Virtual Machines” section. And here I’ll create a new virtual machine. I’ll just hide this. As a result, we’ll be presented with a wizard for creating a new virtual machine. So this helps you have a virtual machine or a machine in the cloud. Now, I need to choose my existing resource group. I need to give the VM a name here. I need to choose the region. I’ll leave it as “North Europe.” In the availability options, I’ll choose no infrastructure, no redundancy required, any image, and I’ll leave it as a Windows Server 2019 data center. There are many images available.

So based on the size that’s chosen over here, we get two virtual CPUs and eight gigabytes of memory that will be assigned to the virtual machine. Here you can see the estimated cost per month. We’ll just be running this machine for just an hour for our scenario, or even under that. Next, I need to give details for the administrator’s account. So, you’re going to be using the username and password that you mentioned over here to log into the Azure Virtual Machine. So ensure that you remember the username and password that you enter. So, I am entering the username “demo user.” And when it comes to the password, I’m mentioning Azure Adhere 1, 2, 3. And next, when it comes to the password itself, there are some constraints on the password itself.

So please take note of these constraints. Now, here’s a password that meets all of the constraints. You can specify your own password. I’ll scroll down. I’m also going to select port 80 here in the inbound ports. So, we are going to be installing the NGINX web server on this machine. So, web servers normally listen on port 80. So Just has a good method of implementation. We must ensure that we select the inbound port of 80.

So if you want to ensure that the web server can listen to these requests, you can choose port 80. I’ll go on to Next for the disk; I’ll leave everything as it is, so it will allocate one OS disc onto that underlying Virtual Machine. You can also go ahead and create an additional data disk. So I said I wouldn’t go into detail about the service because it isn’t covered in the exam. What is in the scope of the exam is the self-hosted integration runtime. Now, your virtual machines need to be part of a virtual network. So you will have a network that is an isolated part of the cloud that is going to host your virtual machine. So this wizard will create a new virtual network for you.

And within the network, you need to have a subnet in place. That subnet within that virtual network will be hosting the virtual machine. It will also create that subnet for you. It will also create a public IP address. This public IP address will allow Internet access to this virtual machine. Now, please note that in SEO, the virtual machine will be a resource. Your virtual network will be a separate resource. Your public IP address will be another resource. So this wizard is going to create multiple resources for you. Next, I’ll leave everything else as it is. I’ll go on to Next for management. I’ll leave everything as it is, go on to advance, go on to tags, go on to review, and then create. And next, I’ll go ahead and hit on Create. Here you can note the cost per hour. So I’ll hit “create.” This will take around four to five minutes.

Let’s come back once we have the virtual machine in place. Once the deployment is complete, I’ll move on to the resource. And here we are on the page for the virtual machine. Now we are going to connect to this virtual machine. So I’ll hit on Connect and choose RDP, which is the Remote Desktop Protocol that is used for connecting onto a Windows machine. I’ll choose this. And here we can download the RDP file. Now I’ll click on the file that says “Download” and hit “Connect.” And here we need to mention what the username and password were that we mentioned during the creation of the VM. So now we are connecting to an Azure VM in the cloud. As you can see, you are now receiving a virtual machine based on Windows Server 2019. Now, by default, it will launch something known as Server Manager.

Now, I want to install the NGINX web server. So first, I’ll go on to the local server. Here I’ll go on to the enhanced security configuration. Currently, it’s on. I’ll just turn this off and hit OK so that we can now download something via Internet Explorer. So now I’ll enter the URL for Internet download. Don’t worry; I’ll ensure the URL is attached as a resource. On to this chapter. Now, let me click on the download button here. Now, when we come on to the mainline version, Now, here, I’ll choose the stable version. I’ll choose NGINX for the Windows part. And here it will ask us to either open the file or save it. So I’ll save the file by clicking Save. As a result, a zip file will be downloaded. I’ll open up the folder. I’ll copy the zip file.

So this is on the Azure VM. Now I’ll go on to the D drive. I’ll paste this. I’ll right-click, and I’ll extract everything. I’ll click on extract. So now we are on the D drive. In that particular folder, I’ll go into this folder. So here is the application. Now, on this machine, I’ll go on to the Windows menu. I’ll just type in “run” so that we can actually run the command prompt. Now I’ll right-click on the command prompt and run it as an administrator. Now, I want to move on to this directory. So, what I’ll do first is copy this here, make a CD, and then move on to that directory. I’ll just right-click what I’ve copied onto the clipboard.

Then I’ll go on to the D drive. And here, as you can see, we are in that folder. Now that we are in this folder, I’ll just run NGINX. Now, after hitting “enter,” let me just minimise this. If I go on to the NGINX folder, if I go on to the log files, if you see a process ID over here, that means you do have NGINX. Now, running on the server here, you can see an access log file, and you can also see an error log file. Now, on Internet Explorer, on the same machine, if I now go to “local host,” I can now see the NGINX home page. That means we do have NGINX on the server. If I go on to the access log file now, I can see the details of the access because you have access to the file from our local machine. So this was step number one: just to have a web server running on a virtual machine.

26. Lab – Self-Hosted Runtime – Setting up the runtime

So in the last chapter, we created an Azure virtual machine and installed a web server on this VM. Now, I want to copy this data that I have in the access log file onto a table in Azure Synapse. Assume your security department wants to know what IP addresses are being used to connect to the web server and what type of request is being made, because all of this information is being recorded in the access log file. So now I want this information from the access log file to come into a table in Azure Synapse so I can make use of Azure Data Factory.

For this particular process, I can have an activity in place to copy the data from the access log file onto a table. And for this, we are going to be using the self-hosted integration runtime. Now, if you are asking yourself, “We have an Azure VM in place; why can’t we use the Azure integration runtime since it is an Azure resource?” If you create a data set in Azure Data Factory and select Azure, you will notice that there is no option to retrieve data from an Azure Virtual Machine. For that, you have to install the integration runtime and then register that integration runtime with Azure. So here, if I go on to the Manage section, I can choose integration run times. Let me just collapse this. And here we have one standard Azure integration runtime. I can hit the new button to create a new integration runtime.

So the first option is to create either an Azure or a self-hosted integration runtime. I’ll choose that, and here on Continue, I’ll choose the self-hosted option and hit on Continue. Now I need to give the integration runtime a name. So I’ll just give it the same name as the virtual machine. I’ll hit “create.” Now here it is saying how to set up the integration runtime. So we first need to set up something on that virtual machine. This is in terms of the express setup. In terms of the manual setup, you have to download and install the integration runtime. So let’s choose this. And we have to ensure that we use one of these authentication keys so that the integration runtime on our virtual machine will register itself with Azure Data Factory.

So, let me click on this link, and it will open up a page. I’ll copy this link, and I’ll go on to Azure VM. I’ll open a new tab with the link here. And now let me hit the download button. I’m going to choose the latest version of the integration runtime. Then I’ll type “Allowance” and press the “Run” button. So now it’s running the setup for the integration runtime. So first, it will download the integration runtime and then run the setup. I’ll proceed to the next one. So, once the download is finished, it will launch this wizard in accordance with the terms of the licence agreement. I’ll proceed to the next one. It will install it in this default location. I’ll go onto Next, and I’ll click on Install.

This will just take a minute or two. Once the setup is complete, I’ll hit the finish line here. We now need to register the integration runtime. And here we need to add the authentication key. So, returning to Azure Data Factory So we have these two keys over here. We can take either key one or key two. So I’ll copy key one, paste it here, and hit on Register.So now this integration runtime is resting. This virtual machine has a runtime on the Azure data factory. I’ll hit “finish.” Now I’ll go back to Azure Data Factory. I’ll close the screen. Demo VM is now visible here. You can see the status is unavailable. It just takes a minute to ensure that the self-hosted integration runtime is made available.

If I click on Refresh now, you can see the status is running if I go on to my integration runtime; if I go on to the nodes here, I can see my node in the running state here. If you click on the edit for Limit Concurrent Jobs, you can limit the number of concurrent jobs that can run on this integration runtime. Remember, you can have many pipelines that can use this integration runtime. So you can limit the number of jobs that are actually running via this integration pipeline. On our Demo VM virtual machine, at the moment, there is no limit because we have mentioned zero. That means there’s no limit on the number of concurrent jobs. Now that we have the integration runtime in place, let’s go on to our next chapter to see how to build our pipeline.

27. Lab – Self-Hosted Runtime – Copy Activity

So now that we have the self-hosted integration runtime in place, it’s time to make use of it. Now we’re going to develop a pipeline that will consist of two activities.

One will be a normal copy activity, and the other will be a mapping data flow. Now, the copy activity will be used to copy the access log file from our server onto the Azure Data Lake Gen 2 storage account. So, if I go onto our server, note that we want to copy all of the contents that are in this particular folder, the access log. And we want this to be split up into columns of data that will be sent onto a table in Azure Synapse. So here I’m showing the table that we’re going to create in Azure Synapse in our dedicated pool. So, I have some columns that will actually map onto our data.

So, I’m not taking all of the data; I’m taking some of the data from each row and then adding it on to the column data values in our table. So, I said the first activity was going to be a simple copy activity. And then the mapping data flow will take our data from our Azure Data Lake Gen 2 storage account and then send it on to Azure Synapse. So, let’s go ahead and first create our copy activity. So here, let me create a new pipeline. I’ll just hide this. I’ll hide this as well. I’ll give a name to the pipeline. Next, I’ll put in the copy data activity. I’ll also go on to my storage account. I’m in the raw folder of the data container; let me add a directory. So I’ll give NGINX a directory name and select safe. So in this folder, we don’t have anything in place.

Here I’ll give a name for the activity. Here I’ll go on to my source. I’ll create a new source data set. Remember that we want to select an honourable file as your virtual machine this time. So, I have to choose a file as my data source. And here I have to choose a file system. I’ll hit “continue.” Now I’m going to be choosing a binary file. The access log file is no longer a delimited or comma-separated file. The Sebit file is not a JSON-based file. It’s some file that has rows of data. So I’m trying to treat this as an abeyant object that needs to be copied hazardously from our server onto Azure Data Lake Generation 2.

So here, I’m going to choose binary. And here, let me give you a name. Now, in the link service, I have to create a new link service. Here I can give the name of my link service. And here now instead of the integration runtimewhich is provided by a zero, I canuse my demo VM integration runtime. And here we need to specify the folder. So here is my access.dot logfile in this particular location. So I’ll copy this. I’ll put it on the host’s server. Here I need to mention what the user ID and password are for the virtual machine. I’ll test the connection. So this is in place. I’ll hit “create.” Next, I’ll browse for the file. So now you can see all the files in that particular folder. I’ll hit OK for the access log file, and I’ll hit OK here; let me hide this. Now let me go on to the sync here. Let me create a new sync data set.

I’ll go with Gen 2 for your data lake storage. I’ll hit “continue.” I’ll choose the binary format again because I said I want to copy the file as is. I’ll choose to continue. Allow me to provide the name in the link service. I can select my Azure data lake storage, and then browse for the folder where the raw data will be stored in the NGINX directory. I’ll hit OK. And here I am going to mention the name “hasaccess lock” because I want to copy the file “hazardous.” I’ll hit OK. So we’ve got normal copy activity in place. I’ll just go to the source, make sure everything is fine, go to general; everything is fine. So everything is in place. Let me publish this particular pipeline. Once this is done, I’ll trigger the pipeline. So now it will run the pipeline. So this is my pipeline. Just ignore these other ones. This is something I’ve done earlier on.

So I can see it has succeeded. If I go to my directory and hit refresh, I can see my access log file. If I go on to the file and click Edit, I can see the information as it is, right? So, in this chapter, I want to complete the first activity, which is the copy activity. And in the next chapter, we’ll go through the mapping data flow. Now, there are two reasons as to why I’m using this two-step process. So the first reason is that normally you will take your data from different sources and ingest it into the Azure data lake. Remember, your data lake needs to be a central place for all of your files. The second point is that from a mapping data flow, you can’t take a file from a virtual machine. So there’s no facility for using that in mapping data flow when it comes to the source. So, because of these two reasons, we’re doing this as a two-step process. We are done with the first step. Let’s move on to the next chapter to perform the second step.

28. Lab – Self-Hosted Runtime – Mapping Data Flow

So in the last chapter, we completed our copy data activity. In this chapter, let’s look at the mapping data flow that we are going to add as part of our pipeline.

So firstly, I’ll go on to the authors section, and here let me create—sorry, a new data flow. Hide this. I’ll give you a name. Now, for this mapping data flow, I’m going to make use of the data flow debug option. So I want to show the benefit of using the dataflow debug option when you’re creating your mapping data flow. Now, remember, as I mentioned before, you have to pay for the running cost when it comes to the underlying machine, which will have eight cores. So remember, there was a price that was shown before when it came to the pricing for debugging as well. So the debug option actually creates Apache Spark clusters that you can actually debug your flow on. So I am going to enable this, right?

So the good thing is that the debug time to live is given as one R, so that if you’re not doing anything, the session will expire and it will stop that compute virtual machine so that you don’t bear the cost. This is a good aspect. So I’ll hit or, okay, so it’ll start that dataflow debug, which could take four to five minutes because the initiative spins up those ApacheSpark clusters in the back end. In the meantime, let’s go ahead with our mapping data flow. So, first and foremost, my source So now my source is going to come from my Azure Data Lake Generation 2 storage account. So, let me give this stream a name.

Here. I’ll choose something new. When it comes to the data set, I’ll select your data lake from Gen 2 and press the Continue button. Now this time I’m going to choose the option of a delimited text file and hit Continue. And here, let me give you a name. Here, I’ll choose my AzureData Lake storage link service. I’ll browse for the file in Gen X. I’ll choose the access log file. I’ll hit RK. Now I won’t import the schema as of yet, right? I’ll click on “none” and let me hit “OK.” Now let’s wait until the data flow debug is initialized. Now, once we have the cluster in the ready state, I can go on Data Preview, and here I can hit Refresh.

So now we can actually see the data as it is being processed in the mapping data flow. Again, it just takes a minute or two for the data to be refreshed over here. So, once we have the data in place here, you can see that there is only one column. So everything is taken as one complete string, and it is given as only one row, which has one column. We now need to divide our data into multiple columns so that we can begin using those different columns to insert data into our table into zero synapses. So remember, this is not a comma-delimited file, but each field over here is separated by a space. So we can now use a space character to signify that there is a separation between the different fields.

So I can go to my source settings, open up my data set, and here I can choose what is a column delimiter. Now I see that there is no way to give a space. So for that, I can click on Edit, and here I can just press the space bar once to mention a space. And I can now leave everything else dangerous behind. And now let me go back onto the access log, flow data onto Data Preview, and let me hit Refresh. And now you can see we have made some improvements. We can see that things are now being split into multiple columns. We have a total of ten columns, and there is only some information that I want. So here, when it comes to the remote address, that is signified by column zero. I don’t need columns one and two. Then there’s the local time, which I’m representing as Aarca for this particular demo with this field.

However, you will notice this excerpt in square brackets. That’s something that we need to remove; we shouldn’t have that in our data. Next, I don’t need column four. Now column five will actually map to my request. Then we have the status and the bytes. So the status will be in column six, the bytes will be in column seven, and then columns eight and nine will actually map onto my remote user and my user agent. So let’s now keep this going so we know that we’re getting our information in place. Now I’ll go on to the projection section and import the projection so that we now get the columns in Data Preview.

We now know that Azure Data Factory can now split the data into different columns. And now let’s import the projection. Here you can see that for columns six and seven, which are actually the status and the bytes being represented in the short integer format, Now next I want to choose my columns of interest. So I’ll choose the select schema modifier here; let me go ahead. So let me just hide this. So I do want column zero. I want that map to have the remote address, so I’ll change it here. Now I don’t need column one and column two, and column three will be mapped onto our local time local.

We didn’t need column four as well. And then we wanted column five. We also wanted this column, then the bytes, then the remote user, and finally the user agent. So I’m selecting my columns here and also naming them. Let me go on to data preview, and let me hit again on refresh. Now, if you’re asking me, “How do I know what values should map to what columns over here?” So, I’ve done some research on understanding the NGINX access logs, and then based on each field in each row in an access log, I am trying to map it onto the correct column names here. So based on the documentation for NGINX, they are saying that the first field actually denotes the remote address from where the request is coming from.So based on that documentation, I have ensured to map it to the right columns. So here we can see our remote address, our local time, our request, our status, the bytes, the remote user, and finally the user agent. Now next we need to make somechanges, especially on to Time Local. So now let me use the derive column modifier here.

I want to change some columns. So what is the first column I want to change? I want to change the time Local column.And here I’m going to open up the expression builder. We can now take our input schema and use it in the expression where I want to use the substring function. So we go on to input the schema. Let me take the Time Local string, and I’m going to now extract a particular substring from the Time Local string altogether. So remember, I now want to ignore the first character, which is a square bracket, and then take all of the other characters after that to represent the time. So for that, I’m going to start with an index value of two, right from the second character, and then up to a length of 20. So I’m assuming that the time values are of a fixed length. Please note that there are other string functions available if you want to make this more dynamic.

Now, since we’ve enabled that data flow debugging here in the Data Preview, I can hit Refresh so that we can see what the impact of our expression is, which is again very, very important. If you want to see if an expression is actually working as it should, you can use the Data Preview feature that is available with the data flow debug. Remember, all of this is running in the background on that Apache Spark cluster. Here you can see the original values as well as the output. So we’re removing the first character and starting over with the second, and we’re using all of these characters as output. Click on “Save” and “Finish.” Now, next, I also want to ensure that I take the status and the bytes because currently these are in the short integer format, and I want to convert them to a proper integer format. So there is a two-int integer function.

And here I just need to add what my input schema is. So the first will be status. I can now go on to the bytes. Let me copy this. I can go on to the bytes here. So here you can see all of your derived columns. And here, I can just put bytes on save and finish. So here you can see all of the derived columns. Now, finally, let me go on to my sync. So I’ll put this in the server logs, Synapse. Now, before I can create a new data set, let’s create this table. So I’ll go on to SQL Server Management Studio on my new pool. I’ll right-click. Hit on “New Query” and let me create the table so that’s in place. Let me now create the new data set.

Choose Azure Synapse. Hit Continue, give a name, and choose My Link Service. Choose my table. Now we’ll move on to the mapping. Let me disable auto-mapping to ensure everything is fine. Yes, let me go on to the data preview and hit on Refresh.I want to ensure everything is as it should be. So everything seems to be fine. I’ll now publish my data sets and my data flow. Right, the publication is finished. Now I’ll go on to my access log pipeline. And then I’ll add data flow after the copy data activity. I’ll just give it a name in the settings. I’ll choose my access log. Data flow. I’ll need to choose my Staging Link service carefully. I’ll browse for the Synapse container. I’ll connect both of these together.

Let me publish everything and also ensure that I delete this file. Because as part of the Copy Data activity, it should copy this file, right? Once this is in place, let’s trigger our pipeline and go back once we have the pipeline complete. Now, once this pipeline has also succeeded, if you look at your directory, you can see the access log file. And now that we’re in SQL Server management studio, let’s select a star from the server logs, and you should be able to see your data in place. Also, if you’re following along, you should disable the dataflow debug if you haven’t already. Since we don’t require it anymore, ensure that it is disabled so that you don’t pay an additional cost.

Uncategorized

Related posts:

Leave a Reply Cancel reply