DP-203 Data Engineering on Microsoft Azure Topic: Design and Develop Data Processing

DP-203 Data Engineering on Microsoft Azure Topic: Design and Develop Data Processing – Azure Data Factory Part 5

December 19, 2022

29. Lab – Processing JSON Arrays

Hi, and welcome back. Now in this chapter, I want to go through a mapping data flow that uses the flattening transformation when it comes to your json-based files, and I’ll explain what this means and how we implement it. with the help of an example. In Azure Synapse, I’m going to create a customer course data table that has these columns. So the customer ID is the customer name, something known as “registered,” which is basically just the bit value of True or False. And then we have courses, one of which is Barcar.

So when it comes to a boolean value, the datatype in Azure Synapse is represented by a bit. So first things first: I’ll go on to my data container in the raw directory. I’ll create a new directory here. I’ll just name it “customer.” I’ll go on to it here. I’ll go on to the directory. I’m going to upload a JSON file that I have on my local system. So it’s a customer JSON file. I’ll click “Open” and then “Upload.” I’ll go on to the file, and I’ll click on Edit. And here you can see the JSON contents. Now here we have the customer ID, the customer name, the register, and then we have an array.

So these could be the courses that the customer has purchased. So, when we put this in a table in Azure Synapse, we want to explode, expand, or flatten this array so that there are now three rows for the customer where the customer ID is equal to one. So for each row where the customer ID is equal to one, you will have one value for each course. The same is true when the customer ID is equal to two. So now we want to go ahead and use the flattened transformation to ensure that we get this particular implementation in place. The first thing I’ll do is let me create the table in Azure Synapse. So I’ll end this. So here I’m logged in as a SQL admin user.

So please allow me to create the customer course table. That’s done. Now let me go on to Azure Data Factory. Let me create a new data flow. I’ll give you a name. Let me just hide this. I’ll close this out. I’ll add the source. I’ll choose a new data set. I’ll choose Azure Data Lake Storage. Gen 2 will now continue. Yeah. I’ll choose JSON here and continue. Just give a name. I’ll choose my link service as your data storage facility, Lake Storage. I’ll browse for the file. So I’ll choose the customer. Choose customer. JSON Hit OK. Hit OK. Over here. So we’ve got that in place. Now here, I am going to use the flattened formatter. So I’ll choose that.

So this will actually now give us the ability to unroll the array that we have in each JSON object. So now I can enrol in the classes. And the unroll route is the course itself. So here you can see that you now have the customer stream column. Courses are also mapped as courses over here. Now, this is only the part where it is going to expand the array of values. Now, let me choose the sync. So I’ll choose the sync as the destination. Yeah, let me give you the name. new data set. Select Azure Syn apps. Hit on “continue,” choose “myLink service,” and “my table.” Hit OK; go on to mapping. Let me disable auto-mapping. And here I need to choose the customer ID. The customer ID is now displayed as a string. So let me go back to the customer stream. Let me go on to the projection. So this is fine. Now, one thing that I need to do is go on to my customer stream. Now, if you look at our JSON objects, you can see it is an array of JSON objects. So it’s included in the square brackets. That means it’s an array of objects. So in Azure Data Factory, in our customerstream, in source settings, if you go onto the source options here under JSON settings, I can choose an array of documents. As a result, it will ensure that it accepts it as an array of documents. Once we have this in place, let me publish everything. Hit publish. Once this is done, I’ll create a new pipeline. Just hide this. Yeah, let me choose our data flow. Under settings, I’ll choose that data flow.

I’ll choose my staging. Browse for the Synapse folder. Hit OK; just hide this. Let me publish this. Hit publish. Then I’ll trigger my pipeline. And then let’s wait for this to be completed. Now, once the customer pipeline has succeeded, I’ll go on to SQL Server Management Studio, and let’s select Star from the table. And here you can see that for each user, the course in the array has been expanded, and we now have six rows in our table. So in this chapter, I just want to show you the flattened transformation that is available in Azure Data Factory.

30. Lab – Processing JSON Objects

And welcome back. Now in the last chapter, I showed you the flattening transformation that is available when it comes to your JSON data. So in that JSON data, for each record, we had an array. Now let’s suppose you also have objects within each JSON object in your data. Let me give you an example. So firstly, in my raw customer directory, let me delete the customer JSON file that I have at hand. I’ll hit on Delete and then OK, and now I’ll upload another customer JSON file. So in my temporary directory, I have now copied another customer JSON file. I’ll hit upload; let me go on to the file. I’ll click on edit, and here you can see that in addition to the course information that we have now as an array, we also have the details for each customer, which itself is a list of objects. So I have two customers over here when it comes to this information. So let me go back onto Azure Data Factory; I’ll go on to author; I’ll go on to our customer data flow. Let me make a change right here itself.I’ll go on to the customer stream. I’ll open up my data set, and here in the schema, let me import the schema again from the connection store. So now it has been detected that we have the details, and under the details we have the mobile and the city. So I’ll end this.

Now I’ll return to my customer data flow. So now I have five columns in total, and if I go on to the flattened stream, I’ll leave the unroll by as is because this is only used for arrays. But here we have an additional object. So let me reset the input columns over here and let me manually add each column. So I have the customer ID that we discovered on Maps. I’ll add another map. This will be for the customer’s name. This is also hazardous. I’ll add another map. This will be for registered users; add another mapping. This will be for the mobile details, and here we can leave it as mobile. I’ll add the final mapping, which will be for the city details. So now we can also take each value of the internal object as well. Now here, I’m going to first drop my customer table and then recreate the customer course table. So in SQL Server Management Studio, I’ll first drop the customer course table. Then I’ll recreate my table so that I have the mobile and the city as well. Now I’ll go on to my customer synapses. Now that we’re on the same page, I’ll open the data set in the schema. Let me again import the schema because it has changed.

So it’s adopted the new schema. Now I’ll go back to my customer data flow. Let me add a mapping here in the mapping. So this is for the phone. I can see the mobile has come over here. Let me add another map. This will be for the city, and it’s come up over here. So everything is as it should be. Let me publish everything. Then I can go on to my customer pipeline and trigger my pipeline. Right, so I’ve just made some changes to the mapping data flow to also take those JSON internal objects into consideration. Let’s come back once the pipeline has completed its execution. Now, once the customer pipeline has succeeded once more, if I go to SQL Server management studio and select the data, you can see the mobile number as well as the city. So this one’s going to show you another example of how you can take your JSON-based data and transfer it to a table.

31. Lab – Conditional Split

Now in this chapter, I want to show you how you can use the conditional split that is available in your mapping data flow. For this particular purpose, I’ll use our log source data set, which points to our log CSV file. So let me create a new data flow. So I’ll just hide this. So I’ll just name this a conditional split. Here I’ll add my source. So here I’ll choose our existing dataset, which is the log source. I’ll choose that. Yeah, I can give the stream a name for the stream. Now, next, what I want to do is only take the rows where the resource group is equal to the new GRP. So I do want to take the rows, which actually belong to another resource group. So for this, we can actually use a conditional split. We can say that we should split these rows so that we only have rows that belong on the new JRP on one side and all of the other rows on the other side. So we can use the conditional split at this point in time. So first, let me check if I have any data in my log data table. So I do have some data in place. So let me delete this table. So this is done. So normally we used to have 38,057 rows when it came to all of the rows in our log CC file. Here I will add a conditional split. Here you can say “split resource group.” So we have our incoming log stream. Now I can give the stream a name. So I can just say our new GRP stream, and here I can add: what is the condition? By opening up the condition builder, let me go ahead and just hide this, and then let me open up the expression builder. So here I can go on to my input schema. I can look at my resource group, and here I can see that if the new resource group is equal to the new GRP, then we should ensure that this has a separate stream altogether. So let me hit Save and finish. Now, next, I need to add a stream name for the rows that do not meet this condition, so I’ll just give a name. So now, in the conditional split, we have one stream. So this is our original stream of all of the data. Now here we have the stream wherein it will filter based on which resource group is equal to the new GRP, and this will be all of the other resource groups. and now we can add a different sync here, and if you want, you can add a different sync here. So I’ve enabled data flow debugging; if I go on to the data preview for this particular step, let me hit on refresh. So, if I go to the right and look at my resource group, I’ll see that it belongs to all new GRP. Now here, let me add a sync here, and let me look at my data set. So here is my log data set. So let me choose that. So let me leave this hazardous area. So I have now synced onto one of the streams, and I’m just leaving the other stream hazardous. If you want, you can give an output stream name here. Let me publish this mapping data flow. Then I’ll go ahead and create a new pipeline. Then I’ll go on to the data flow. I’ll go on to the settings. I’ll choose my data flow. I’ll just hide this. Let me choose my link service for staging. I’ll browse for my container. Let me publish this pipeline. Then let me trigger this pipeline. I’ll go on to the monitor section. So let’s wait for this pipeline to be completed. Now, after some time, we come back and can see that the pipeline run has succeeded. So, let me select a star from the log data in SQL Server Management Studio. So here we can see it fetching the rows. Here you can see the number of rows is less; it’s not 38,057. If you scroll to the right, you’ll notice that the resource group is entirely new GRP. So we’ve kind of filtered out the rows using that conditional split step that we have in our mapping data flow.

32. Lab – Schema Drift

Now, in this chapter, I want to go through the concept of schema drift, which is available automatically as part of your pipelines. So we’ve been using schema drift throughout the suites. So if you create a new pipeline, even when you look at the mapping data flow, when you create a data set, by default, that schema drift is already in place. So I’ll just create a pipeline to explain to you the concept of that schema drift. The first thing I’m going to do is create a new container. First, let me give the name of the schema. I’ll hit “create.” I’ll go on to schema. And here I’m going to upload a file. So in my temp directory, I’m going to upload a log CSV file. I’ll hit upload. I’m going to create a directory; just give it a name. I’ll hit save. I’ll go on to the new directory, and I’ll upload the other file that I have. So, if I go back to this log and remove one column, if I go back to edit, So here I just have a subset of the records from my log CC file. Here, if I scroll on to the right, I don’t have the resource group column in place; I only have the resource type. And if I go on to the other file, that’s the log CC file. If I continue to edit a subsection. If I scroll on to the right here, I have the resource group in place.

So now, when it comes to your input data, there is a difference. When it comes to the schema, OneFile doesn’t have one column in place. Now, when you are creating a mapping data flow to, let’s say, copy from the source onto the destination, if you have enabled that schema drift, then your data factory can handle the situation where the schema is different in your files. So, first, let me return to my Data Lake Generation Two storage account. And here, let me create a new container. I’ll just give it a name. I’ll hit create. Now I’ll go on to Azure Data Factory, and I’m going to create a new mapping data flow. So scroll down, and I’ll create a new data flow. I’ll just hide this. So I’ll just name this Schema Drift. Now, first, I’ll add my source. So my source is going to be the log files. Here I’ll create a new data set. I’ll go with Storage Gen 2 for your data lake. Here on out, I’ll choose my delimited text files. I’ll hit “continue.” I’ll just give you the name. Here I’ll choose the link service that points to my Azure de Lake Gentle storage account. Here, I’ll browse for the path. So it’s in my schema folder, which I’ll go through.

OK, now when it comes to importing the schema, I am going to choose none. Now, the reason for this is when we are allowing schema drift, which you can see has an option that is automatically enabled in the source settings. So this allows your schema to be different when it comes to our source files. Now, if I import the schema, it will look at our log CSV file, which we have in the schema container. So if you remember, if we go on to our schema container, it will look at the log CSV file, which will take or import the schema from there. But what happens to the schema of the other file? So when you want to implement schema drift, it’s good that Azure Data Factory determines the schema at runtime when it’s copying each file instead of now importing the schema and telling Azure Data Factory what the schema should be like. Instead, let Azure Data Factory determine the schema at runtime itself.

That’s why I’m choosing the import schema, which has none. I’ll hit on it. OK, now that we have this in place, there is also one more option for inferring the drifted column types. So if you are looking at a change in the column types from the new columns, So if you want Azure Data Factory to also understand what the column types will be for the new columns that get added to your source, you can also infer the drifted column types. But for us, we don’t need this particular option because I’m going to be doing a simple copy of these log files as JSON-based files in another container. If you are implementing schema adrift and you choose to sync a table in your dedicated SQL pool, there is a lot of work that you will need to do to ensure that your table then has the modified columns. And normally, when it comes to your databases, these are structured-based tables, so they only follow a particular structure in place. And this is a structure that you would have already thought of architecting and then defining.

Trying to do a schema drift for your files onto tables might not be the right approach when it comes to a pattern for data engineering. Here, if you are looking at, say, a staging area where you want to take the files, do some sort of transformation, and then move them to a staging area that is not structured, then that’s fine. In our case, I’m going to be adding a sync to a JSON file so that it doesn’t make a difference because when it comes to JSON-based files, these are semi-structured files. Again, it doesn’t follow any sort of strict schema design. Now, before I can add a sync, I’m going to be adding a derived column. So, in the derive column, I’d like to state that any columns inferred here are not being detected by the datafactory because we told it not to. We wanted to infer the columns during the pipeline transfer at runtime. So now I’d like to add a derived column to instruct Azure Data Factory to derive whatever columns it finds exactly as they are. So here in the column mapping, I’m going to add a column pattern. Here I’ll delete this particular column, and in each column that matches, I’ll enter a condition.

So, whenever there is a column of the type stringand all of the columns in our CC file are of the type string, just map whatever the incoming column name is, which you can actually address with Dollar. and just tell them that when it is copied onto the sync, just take the same column name. So if our column name is ID, then here it should be ID. Resource Group if the column name is Resource Group. If it’s a resource type, it will be a resource type at the destination. So, in the derived column, we simply ask that you map the columns as they are.

We are doing this because, remember, our columns are not defined. We need them to be defined automatically at runtime. Now, there are many other things that you can actually do when it comes to adding a column mapping based on rules. This is a simple one where we are mapping the columns as they are. And then I’ll add my sync. So I’ll scroll down, and I’ll choose my sync here. I’ll put JSON Sync in place, and here, let me hit you. Please note that even here we are allowing schema drift so that the same definition of our schemas can be mapped onto our JSON file. Yeah, I’ll click on “New” here. I’ll choose Azure Data Lake Storage. And two, I’ll hit “continue.” I’ll choose JSON here and continue. Let me give you a name here. I’ll choose my Link service here. I’ll browse for my container.

So I’ll go with schema drift here. I’ll hit OK here again: no importing of any sort of schema. I’ll hit on it. OK, here I’ll go on to the settings. Here, in the file name option, I’ll choose Output to a single file. I’ll set the single partition here. I’ll name the single file. So I’ll just say log dot JSON; now we have everything in place. Let me publish this, and then let me create a new pipeline so the publish is complete. So I’ll name this the Schema Pipeline. Then I’ll choose data flow. I’ll go on to settings. I’ll choose my data flow, which is schema drift. Yeah, we don’t need anything else. I’ll hit the publish button now. Once done, I’ll trigger the pipeline. I’ll hit OK, so it will trigger the pipeline. So let’s wait till the pipeline is complete. Now, once my pipeline run is complete, if I go on to my Schema Drift container, I can see my log JSON file in place. If I go back to the file and click Edit, I can see rows 21 to 40, which contain my resource group, and the other ones, which do not, but everything has been copied. So you can make use of this schema drift feature if you have a difference in your schema when it comes to your input.

33. Lab – Metadata activity

Now in this chapter, I’m going to show you how you can use two activities in your pipeline. The Get Metadata Activity can be used to get metadata about your source. So, for example, if you look at Daily Gen 2 storage accounts, you can get information such as what the item name is, what the item type is, the size, et cetera. If you scroll down, you can see that it can also get information for your relational databases as well. Now the first thing I’m going to do is go to my Azure Data Lake Gen 2 storage account. I’m going to go on to my data container, and here, let me add a new directory. So I’ll name the directory “customer.” I’ll go on to the directory, and here I’m going to upload two files. So here in my temporary directory, I have customer one and customer two. Let me click on Open and click on Upload. So if I go to customer one JSON and click edit, it only has two simple objects in place. Each object has a custom ID, a customer name, and a register.

Next, if I go on to Customer Two, if I go on to Edit, here again, I have two simple objects. I have customer ID three and customer ID four. So we have four customers in place, which are split across two files. Now, in my dedicated SQL pool, I’m going to create this table. So now I want Azure Data Factory to take the information from my files and then add the information and data to this particular table. Now, in addition to the customer ID, the customer name, and the registration information, I also want the file date and the name of the file from which this information is coming from.So first of all, let me create this table. SQL Administration is available in SQL Server Management Studio. Let me first create the table. So here I don’t have any information in the table itself, so nothing is there. Now I’ll go on to Azure Data Factory. I’ll go on to the author section. Now here, instead of working with the mapping data flow, I’m going to create a pipeline.

So I’ll create a new pipeline. I’ll just hide this. And here I’ll give the name of the customer: And here I’m going to use the activities that I’ve set up on the left side. So first, I’m going to choose the metadata activity. So I’m going to go on to general. And here I’ll choose “Get Metadata” and put it on the canvas. Now here, let me give you a name. So get metadata for the files; let me just hide this. And here, let me go on to the dataset, and let me create a new data set. Here, I’ll choose Azure. I’ll choose Azure Data Lake Storage Gen II. I’ll hit “continue.” I’ll choose JSON. I’ll hit “continue.” I’ll just give you a name here. I’ll name it the “Customer” folder. In this case, the Link service. I’ll choose my Azure data lake storage. Here I’ll browse for the data container and just choose the Customer folder and hit OK.

Here, let me hit OK. So I have my data set in place. Now here in the field list, I can choose “New.” Here in the argument, you can see that we have some fields in place. because this is a metadata activity. We can now get information about the underlying files in our particular folder. Now here, the first thing I want to do is get the child’s items. So I want this activity to first get all of the items that are in my customer folder. That is the first thing that I want to do. And then for each file, I want to perform another activity. So here I have to go on to iteration and conditioners. And here, let me map data files onto the for each activity, and I’ll take the for each activity. Now I’ll go on to the settings. For each activity you give a name, you can go on to the settings. And here in the items, let me add dynamic content. Now here I can choose the output from the previous activity, which is to get metadata files. And here I can say, “Please give me the children’s items.” So here I can put children’s items. And now whatever child items are given by the metadata activity will be given to this for each activity. So let me hit Finish.

Now, for each activity, we can add sub activities. So for each file in that particular folder, what do we want to do? So here I’ll click on “activities.” So we’ve arrived at the canvas itself. But then our parents’ activities are now showing up over here. And now we can add our child’s activities here. So what do I want to do? For each file, I want to get the file name and the date. So for that again, I need to choose “Get Metadata Activity.” So here, I want to get the metadata for all files. Now here, I’ll go on to the data set. Now here I’ll create a new data set, which again points to my files. Here. I’ll choose Azure Daily Storage. Gen two. I’ll hit “continue.” I’ll choose JSON. I’ll hit “continue.” Let me just give you a name. I’ll choose my linking service. I’ll again browse for that folder. I’ll hit OK. I’ll hit OK. Now here, I again want to add the field list. Now this time what I want is the item name and the last modified date when it comes to the file itself. Now here, I’ll open up my data set. I’ll now move on to parameters. I’ll create a new parameter. Here I’ll give a parameter for the file name. I’ll go on to my connection. And here I’ll add dynamic content.

Here, I’ll choose the file name. So the filename of the dataset. I’ll hit “finish.” So here we are trying to make our data set more dynamic in nature. So now, for each file, it will automatically get the file name. Now, how will it actually get the file name if I go back to my customer now? So, you can see our customer data set. Here you can see that it has now added that data set property. So now we want to pass the file name’s property, which will then be passed on to customer files. And customer files can now be iterated on for each file in our folder. So, how do we get the value here? So again, we can actually add. So I can insert dynamic content here, and I can use the iterator for each current item in each file.

And here I can get the name of each item. And I’ll hit “finish.” So, I know that there are a lot of steps to actually digest in this particular workflow that we are doing. But here the entire idea of this workflow is to show you how you can use Get Metadata Activity. And for each activity, if you need to do this a couple of times, please do it so that you get a good idea of how this works. You might not get it the first time because we need to add this file name as a property before adding the value. But all of this is being used to dynamically take in the content. Now. Next, I’ll choose the copy data activity. As a result, this will be used to copy to the sync. So let me connect this to copy to Synapse. So I’ll go on to the source. So here’s where I got it: The data set will be customer files. Here again, it will be the item name. I’ll hit “finish.” I’ll go on to the sync. Let me add a new sync data set. I’ll choose Asyl. I’ll choose Azure Synapse. I’ll hit “continue.” Here we can give a name, and I’ll choose my link service. I’ll choose my customer’s table. I’ll hit OK here if I go on to the mapping, so if you want, we’ll just leave the mapping as it is.

We don’t need to do the mapping. If I go on to the sync, I can do a simple bulk insert. Also, to ensure that we now get the filename and the file date in the source, I have to create a new additional column. So I’ll just cancel this. I’m going to add the file date. So this is the same column that we have here. I need one for the file name as well. So let me add a new column. So, file name is now in the value, and I’m going to choose dynamic content. So I’ll select “Get metadata for all files” and then click “Finish” under “Last modified.” Then for the next one, I’ll again add dynamic content. So here again, get metadata for all files, and here it will be the item name. Let me hit “finish.” Once I have everything in place, let me publish this. I’ll hit “publish” here. So this is all done. Let me now trigger my pipeline. So I’ll go on to the monitor section now that our customer pipeline has succeeded, so if I do a select star, I can see all of the user information at the same time, including the file name they come from and the file date in terms of the modified date. So in this chapter, I just want to give an example of how you can use the Get Metadata Activity along with each activity as well.

34. Lab – Azure DevOps – Git configuration

Hi, and welcome back. Now in this chapter, I want to show you how you can use Git for version control of Azure Data Factory pipelines. So normally, developers will use a version control tool such as Git to have different versions of their code base. You’ve probably heard of GitHub. This is a service that is available on the internet wherein developers can go ahead and host their Git repositories. So, similar to having repositories where you can version control your code, you can also use Git for version control of your pipelines. in the Azure data factory. You have the option of using GitHub or Azure repositories. So in our case, we are going to be using Azure repositories when it comes to working with Git repositories with Azure Data Factory. Now, here I am on the page for Azure DevOps.

So, Azure DevOps is a complete suite of tools that actually helps in the entire CI/CD pipeline, that is, continuous integration and deployment. You can actually then make use of Azure repositories, which can be used for hosting your Git repositories. You can actually start for free with Azure DevOps. So when you click on “Start free,” you can actually start with the same account details that you use for working with Azure. So, I’ve already gone ahead and created my free account with Azure DevOps. So I’m going to sign into Azure DevOps with my account. So now here I’m actually logged in with my same Azure admin account, and now I have access to Azure DevOps. Now, once you’re in Azure DevOps and have created your own account, you can create a new project. You can give the project a name for the project.

Here I’ll set the visibility to “private.” and I can hit Create. So we are only going to be using a handful of features when it comes to Azure DevOps. Now that we are here, we can go on to the repose part of Azure DevOps. And by default, you have an empty repository in place. So this is a Get-based repository. Now, in Azure Data Factory, you can go on to the Manage section. Here you can go on to Get configuration, and here you can hit on Configure. So now you can attach Azure Data Factory to the Azure repos feature, which is available in Azure DevOps. Here you can choose the repository type. Do you want to connect to GitHub, or do you want to connect to Azure DevOps? Get? So I’ll choose this here. I’m going to be choosing my Azure Active Directory. So normally, you will only have one directory. Since I work a lot on Azure, I have multiple Azure active directories in place. I just chose my default directory and hit on “Continue here.” Now I’ll select my organization’s name.

So on top of the project that we have built here, we are actually logged into something known as an organization, which is the same name as my admin account. So I’ll choose that, and then I should get a list of projects. I’ll select three projects from my DP 20. Allow me to name the repository. I’ll choose the same repository name. So when you create a project in Azure DevOps, by default, it will create a default repository with the same name. So this is your git repository, which will be used for version control. You can now create a new collaboration branch from this page. So this branch will now be used for hosting the changes that you make to your pipelines in your data factory. I’ll hit create. Here we have something known as a “published branch.” As a result, your collar branch is typically used to make changes to your pipelines.

Once you are done with the changes in your pipelines, you will then publish them onto the publish branch. For the purpose of this particular demo, I will not import any existing resources. And let me hit “apply.” So the repository has been configured, and now it’s asking us to set up a working branch. I’ll hit save. So now we have configured our repository. Now, let me go on to the authors’ section. Now here you can see that I don’t see any pipelines, I don’t see any data sets, and I don’t see any data flows. This does not mean that you have now lost your pipelines, lost your datasets, or lost your data flows. This is now giving you a view of your factory resources that are part of your collaborative branch in your repository. You can activate the live mode at any point in time. And here you can see your previous artifacts, your pipelines, your data sets, and your data flows. So no need to worry; you still have everything in place.

Let me go back to Azure DevOps git. Now here, let me create a simple pipeline. So let me just hide this and hide this as well. Let me give you a name. So I’ll just create a very simple pipeline. Here, let me create a simple copy-data activity. Now, do you notice something different about the designer? Now here, you will see that we have this extra save option in place. So we didn’t have this option when working with our live version. So at any point in time, let’s say we want to save a draught of our pipeline. We could not do that. We had to ensure that our pipeline was complete. We then validated, and then we had to publish a pipeline. But here, at any point in time, you can save your pipeline. So why is this possible? Because everything in your get repositories in Azure repositories is now version controlled. So if I refresh this page, nowhere can you see a pipeline folder. Here you can see the copy pipeline JSON file.

So this is now a representation of your copy pipeline that you have created here. So here, let me just complete the details of the pipeline. I’ll go on to the source. I’ll create a new source data set. I’ll just go with Azure Data Lake from now on. Continue by selecting the delimit text file. Continue by pressing Enter. I’ll create a new link service. So you can see, we don’t even have the link service in place. I’ll choose my Azure data. Lake Gen. 2 storage account I’ll hit create. Then I’ll browse for my log CSV file. I’ll hit OK. The first row has a header. I’ll hit OK. Let me go on to the sync. I’ll actually create a new data set. Again, I’ll choose this as your data lake. Press the “continue delimiting the file” button. Continue by pressing Enter. Let me again choose the same link service. I’ll just browse for a different container. I’ll just press OK.

first row as a header, not to import the schema. I’ll hit on okay; here, I’ll just leave everything as it is. Let me just save everything. saving the pipeline. So now that we have the pipeline in place, you can validate the pipeline. So this is also fine. Now, before we can trigger the pipeline, we need to publish our pipeline. So let’s do that. So now it is publishing everything onto that ADF underscore “Publish” branch in Azure repositories. So again, the same We have our link service, our data sets, and our pipeline. I’ll hit or miss. Okay, and once this is completed, you can see that it is also generating Arm templates. This is again an additional artefact that gets generated for your pipeline when you connect it to Azure repositories. And then you can go ahead and trigger your pipeline. Now again, if I refresh this page of Azure repositories, you can see that I have my link service and my data sets. So everything is now coming up with parts of the repositories in your repository. So in this chapter, I want to show you how you can add Git configuration to Azure Data Factory.

35. Lab – Azure DevOps – Release configuration

Now, in the previous chapter, we had seen how we could integrate Azure data factories with Azure repositories. We discovered that once we publish, our pipeline or any changes to our pipeline are reflected as code in Azure repositories. Now, if you are looking at a normal, normal cycle when it comes to continuous integration and delivery, when you’re looking at Azure Data Factory, So we established our Collab branch. Sometimes this is also known as the main branch. Now, apart from that, you could also create another branch known as a “feature branch,” wherein your developers or your data engineers can keep adding features to a pipeline. So you could have your pipeline built first, and this could be part of the Collab branch.

Then, when data engineers want to make changes to this pipeline, they might create a new branch known as a feature branch, which will have the baseline of the pipeline. Then they will make changes to the pipeline. And then once those changes are complete, they will create something known as a pull request to merge those changes onto the main branch or the Collab branch. In our case, it was very simple. I was the only person making changes to the pipeline or creating the pipeline. That’s why I was making all of those changes in the Collab branch. But your developers or your data engineers can actually create multiple branches. This actually helps in identifying what changes have been made in terms of features in your pipeline. And once all the changes have been reviewed by their peers and published to Azure Data Factory, So this is after the Collab branch. Remember, we had one more “Publish” folder? So let’s give some ADF underscore Publish a call. So once all your changes are published onto Azure Data Factory, you have one final version of the pipeline. Now, to go a step further, you might have multiple data factory resources. You might have one resource for development, one resource for staging, and another resource for production.

So what you can do is, if you have a pipeline that has been published in, let’s say, your development as a data factory resource, you can promote that same pipeline onto your other data factory resources. So remember, at this point in time, you have multiple steps. You have your branches within, a repository within, and a data factory resource. And if you want to share that pipeline with other Azure data factory resources, you can use the AzureDevop suite to publish it to different Azure data factory resources. So earlier on, we had published this simple pipeline from our Collab branch onto our main publish branch. Let’s look at how we can move this entire pipeline to a completely Azure Data Factory resource. So first of all, in all resources, let me create a new data factory resource.

I’ll look for data factory and click Create. Yeah, I’ll choose my resource group. I’ll choose the same location. Yeah, I’ll give you a name. I’ll proceed to the next one. Now, we don’t need to configure git. So get is normally used in your development-based scenarios here. This is our production data factory resource. I’ll go on to Next for networking. Leave everything as it is. Go to advance, go to tags, go to review, and then create. And let me go ahead and hit on Create. Let’s come back once we have the data factory resource in place. Now, once we have the resource in place, I’ll go ahead and add the resource. I’m going to launch Azure Data Factory. And here we should not be doing anything in the author section.

So if I go on to the author section, you can see we don’t have anything. Now I’m going to go back to Azure DevOps. Here, I’m going to go on to pipelines. Here. I’ll go on to releases. And here, let me create a new pipeline. Here, I’ll start with an empty job. I can enter my stage name here. Then I can close this. Here I can add an artifact. Here. I can choose Azure. Repost. Here I can choose my project. I can choose my source repository. Here I can choose my ADF default branch. So remember, all of our publish artefacts go on this default branch. I’ll leave everything as it is. I’ll click on “Add.” So you actually create multiple steps in this particular pipeline. This is known as a release pipeline. Here, we can just give a name. Now here, I’ll go on to the production stage. I’ll choose the job and the task. In the agent job, I’ll add a task. I’ll look for Arm here. So I’ll choose Arm Template Deployment and click on Add.

Now here, I’ll go on to the template deployment. Here, I’ll scroll down. Here, I need to choose my subscription. So here it is trying to load my subscriptions. So I’ll choose my test environment subscription. Now I need to authorise this project to actually make use of this subscription. So I’ll click on “authorize.” So now here it is, using my Azureadmin account to authorise that Azure subscription. So by default, you have a lot of security measures in place. Even though we are using this service with our Azure admin account, it is not taken for granted that we have access to our Azure subscription. So this will just take a minute or two.

But once we have this in place, now I can choose my test environment subscription. This was just creating a connection to that subscription. I’ll go ahead and choose my existing resource group. As a result, it should be the Data GRP group. North Europe is the setting. Then I’ll scroll down. Now, in terms of the template, I’ll browse for the template. So now I can actually go on to my Git repository here. Then I can choose my Arm template for factory JSON. I’ll hit OK here.

Any template parameters? I’ll browse. And now I’m going to select my armor template parameters for the factory and press the Or Key. Then I’ll go ahead and select override template parameters. And now I must provide the name of my new factory. So my new factory is called Production Factory 1000. So I’ll replace it here and I’ll hit OK; now here, I’ll save this pipeline, and then I’ll create a release of this pipeline. I’ll hit “create.” I can then click on the release. So I’ve now performed a manual trigger on this specific pipeline.

Please note that there is a lot that you can accomplish in Azure DevOps at this point in time. When it comes to the objectives for the exam, there are just two important points. One is. In Azure Data Factory, how do you enable Misconfiguration? And the other is, how do you actually promote your pipelines from one data factory resource to another?

When it comes to configuring your applications and databases via pipelines, you should normally go for other exams like the AZ 400 Exam. The AZ 400 Exam focuses completely on Azure DevOps and the entire continuous integration and deployment cycle. So here at the moment, you can see it is now doing a deployment onto a production stage, and it has succeeded. So what have we done? If I now refresh this page of the production factory, you can see we have one pipeline in place. We also have our two data sets. If I go on to the Manage section here in my Link Services, I can oversee my Link service in place. So now we have promoted our entire pipeline from one Azure Data Factory resource to another.

36. What resources are we taking forward

So now, again, just a quick note. What am I going to keep in terms of resources? So I’m going to be keeping my Azure Data Factory resource in place. And again, for multiple reasons, I want to reference this particular resource in subsequent sections. So, even when it comes to monitoring and security, there are some aspects of the Azure data factory that we still need to COVID. I am also keeping my synaptic workspace in place. I’m going to keep my dedicated SQL pool in place, too. But obviously, I’ll pause it. I’ll pause my pool when it is not required. But again, in chapters, when it comes to working with Azure Databricks, we need to have our dedicated SQL pool in place. Then my Data Lake Gentle storage account will be in jeopardy. So I’m going to be making a reference to a log CC file wherever required and to park-based files wherever required. So I’m carrying all of this forward, just to let you know.

Uncategorized

Related posts:

Leave a Reply Cancel reply