DP-900 Microsoft Azure Data Fundamentals Topic: Data Ingestion and Processing

December 20, 2022

1. Modern Data Warehouse

Now we’re going to dig into data warehouses a little bit more because one of the requirements of the exam is to describe the components of a modern data warehouse. And so in this video, we’re going to take a look at a typical architecture diagram of a data warehouse situation and break it down in terms of where the data is coming from, where it’s going, how it’s getting there, et cetera. So I took this diagram from Microsoft’s website. You can see here that we have some type of process going from left to right, ingesting data, storing it, training a data model, and using Sunface analytics to feed into analysis services that feed into power bi.

And so we’re going to start off at the absolute left of that diagram, which are the data sources. So within a data warehouse scenario, all the data comes from somewhere else, right? There’s no system that is writing data row by row, line by line, like it would be in an OLTP system. So data warehouses are always ingested and loaded from another source. Now, that could be structured data, which means it’s already sitting in a database of some sort with columns and rows, and we’re just effectively copying and transforming that data. Or it could be unstructured data, which means it’s basically text files, images, or other things that are sort of sitting there and need to be structured before they can be put into a data warehouse.

And so, looking at the examples here, you can have CSV files or JSON files, any kind of log output, things like that. Those are considered unstructured data. So you’ve got one or more data sources. And it doesn’t have to all be the same type or different types. You can pull it in from wherever it sits. Now, the first component that does the work here is called Azure Data Factory. Now, Azure Data Factory is what they call a data integration service. As a result, you can create activities, pipelines, and processes to transfer data from one source to another. We’re going to talk about Data Factory in more detail in the next section. But for now, we can see that this is the first step. Now it’s also called an orchestration service because it’s basically controlling the flow. It takes the data from its source and puts it into the next step of the workflow. Now, where it’s being stored in this particular diagram, it’s not being processed yet; it’s being loaded. So this is like an ELT-type situation where, first of all, it’s extracted and loaded into, in this case, a data lake.

Now, it doesn’t have to be a data lake. It could also simply be Blob storage. It’s almost the same type of storage facility, just optimised differently. So all your data is coming in from outside and is being stored first in a data lake. Then we’re going to use something like an Azure Databricks service. Now, Azure Databricks is based on the Databricks platform, which works off of Apache Spark, and it allows you to manipulate data at large scales. As a result, you can use a data lake as your source. And then you can basically run updates, insert data, and create columns. You’re basically performing a series of operations on the data, and it gets stored in another data source. Now you’ve got different programming languages. Python, Scala, and Spark are different types of Spark languages that allow you to basically build your data source. So that will be extracted from the original data and stored in a data warehouse. As a result, analytics has been assigned to this. And so basically, it’s massively parallel.

It can grow in computing power depending on the workload that you’re giving it. The data gets stored inside Synapse Analytics in a format that is optimised for reading. Again, no updates are allowed. It’s not a transactional system; it’s a data warehouse system. Also on the diagram, we have analysis services. And we basically stated that we have an analytics requirement with the OLAPtype system. It’s able to take the data from the data warehouse and build it in such a way with cubes and dimensions that you can do complex queries and multidimensional analysis on that data.

Finally, you end up with a power user who’s able to point to analysis services, sign apps, and data bricks, and basically pull out the data that is needed from those systems. And so Power Bi is a reporting tool. You can build dashboards. I think we talked about that right at the beginning of this course in terms of the different types of charts and reports that you can build with Power BI. You can then publish those to other people, publish them on the web, publish them to even mobile devices, et cetera. So, returning to the chart, we can see the full setup of what they call a modern data warehouse, which includes data coming in, a data factory, a data lake, and data bricks, all of which play a role in getting that data into a data warehouse, such as Synapse Analysis and Analytics. You can use Azure Analysis Services to build complex cubes and dimensions, and then Power Basically, you can run reports off of that. So that is what Microsoft considers to be a modern data warehouse.

2. Azure Data Factory

Now that we’ve talked about data warehousing, let’s talk about the other side of that, which is the data ingestion processes within Azure. I should point out that this is a deep topic. When we start talking about Azure Data Factory, there’s a lot to it. It’s basically the job of someone called a data engineer to do this processing. So this is the DP 900 course, and what we’re doing is talking and thinking about concepts here. But if you want to get into the details of it, there is certification for data engineers: the DP 200 and DP 201 exams. Azure put this out, saying what a data engineer is.

So data engineers are responsible for data-related implementation tasks. So this is the creation of databases, the ingesting of data (both batch and streaming), the transformation of data, the security around that, the backups and restores and retention policies, how you’re going to handle performance, and linking up with external data sources. So we talked about data warehousing. Let’s talk about ingestion and transformation. Now, the product that does this within Azure is primarily Azure Data Factory.

Now, Data Factory is what’s called a data orchestration tool. And the orchestration—if you think of a symphony or an orchestra, you have the conductor at the front, and he is the one directing the orchestra, telling when the drums are to come in, when the symbols are to come in. So the data factory is the conductor. It brings in data from external sources and could potentially do various transformations. And basically, it ends up being stored somewhere. Now, in the data warehousing example, it just copied and placed it in the data lake. But Data Factory does support transformation. It’s very similar to a product called SQL Server Integration Services, or SSIs, where you can copy data from one source to another but also do things to it. So let’s look at the features of Azure Data Factory.

So, there’s the orchestration part, which is the dataflow from one source to a destination. You can support data transformation tasks. This includes creating new columns, merging two columns into a single column, and reformatting columns as they bring them in. You can basically have a schedule. So you have this run on an automated basis, again, similar to SSIs. And in fact, you can take your SSIS packages if you’re currently running SQL Server Integration Services on Prem.That is something that can be imported into Azure Data Factory. Now, let’s talk about some of those definitions. So one thing that’s related to DataFactory is a thing called “activities.” Now, activities are basically actions that you perform on your data. It could be a copy activity, a transformation activity, or an if-then statement where you don’t copy data if it doesn’t meet certain criteria, et cetera. Next up, we have the concept of a pipeline. So a pipeline is a grouping of activities, and it’s exactly as it sounds. So your data factory is the orchestrator. It brings data from A, performs a series of operations on it, and finally saves it at destination B.

And that is a pipeline. It’s a combination of activities. Now, a single data factory can have multiple pipelines. And so you’re not just limited to copying data from one source to another and then being unable to do anything else. Of course, you can have multiple pipelines. You can have the pipelines be sequential or have parallel activities within a single pipeline. Now, and finally, the last concept is the trigger. So obviously, a data factory exists. You have your data pipeline defined. How is that pipeline triggered to run? It can be triggered based on a schedule. So every day at 6:00 a.m., this Azure data factory runs. It pulls data from this source, performs this transformation, and stores it at this destination. It could be listening for an event. So let’s say you have a Blob storage account, and you’re going to wait until a file gets uploaded, and that’s going to trigger your pipeline to run. And there’s also this concept of a tumbling window, which we’ll talk about in a second. It’s a very innovative way of triggering things. And of course, it can be run manually. So you could just go into Azure, go into the data factory tool, and hit run, just as it sounds, right?

So you can run it by executing it. You can have a schedule that is very simplistic, runs twice a day, twice a week, or every hour. You can basically restrict them so that you don’t have two data factories running at the same time. And the last concept here with the tumbling window, like I said, I still think it’s very interesting. So let’s say you are running it at a predetermined time, and you can basically say, “I’m running it at 9:00 a.m., but I only want to count the data that ran that was included yesterday.” And so you’re basically timeboxing the start and end dates of the data that you’re working on. And so it doesn’t have to be running at 9:00 a.m. and pulling in data up to 8:59 a.m. You can basically go back in time and have it run at 9:00 a.m. for yesterday’s data. So effectively, you can process data for a specific time period and ignore things that are outside that time period. It’s pretty interesting to me anyway. That’s the sort of overview of Azure Data Factory, the activities and pipelines, and the way that things are scheduled; that’s pretty much what you need to know for the DP 900 exam.

Uncategorized

Related posts:

Leave a Reply Cancel reply