DP-900 Microsoft Azure Data Fundamentals Topic: Data Analytics Workloads

December 20, 2022

1. Data Analytics Workloads

So the fourth and final section of this exam and course focuses on data analytics. And that is where 25% to 30% of your score lies. Now, data analytics is effectively the reporting aspect of databases. It is how you’re going to extract your so-called business intelligence. It’s how you’re going to get your insights out of the data. Sometimes those are quite straightforward. Sometimes that involves “correlating” a lot of other data together to come up with an insight. and sometimes that is very complicated. And the types of insights you want to get require machine learning and other things to go in there. AI runs your models against your data in order to extract some type of insight. When it comes to workloads in analytics, there are three main workloads, and I’m focusing on relational databases. Here, you don’t often have this type of analytics on a Cosmos, DB, or non-relational database. When you get into data analytics, you’re often talking about relational databases. And so there are three types of database workloads. The first one is called OLTP, which stands for online transaction processing. That’s your day-to-day database that’s behind your business system.

When customers are interacting with your systems and your employees are interacting with your systems, that gets updated in an OLTP database. Now, you might want to pull that data out into an analytics database called OLAP, or Online Analytics Processing. and that allows you to preprocess your data. And then you can run some deep insights into it. You can create pivot tables, drill down, see your data from many different angles, and try to find some insights.

Finally, you may need a data warehouse. And a data warehouse basically allows you to collect your data from multiple sources, from multiple different geographies, and even from structured and unstructured options. You can pull it into a single database and run reports against it; that’s quite common. That’s what the data warehouse is actually designed for, but it’s not necessarily dimensional, so you can look at it from different angles, but you can run big reports against it in a data warehouse. So in this section of the course, we’re going to break down each of these database workload types, give examples for each, and hopefully, at the end, we’ll get an understanding of database analytics. In Azure.

2. OLTP

So online transaction processing, abbreviated as OLTP, is the most common type of database you’re likely to encounter. It is essentially the backend database for your business systems. Your OLTP database is any type of website or application that reads and writes to a database. Typically, this could be your ordering system; this could be any type of system that is processing transactions and that allows rows to be updated, rows can be deleted, and data can change, and it’s constantly happening. Now you can run reports from this database. It’s not uncommon to do that. You can use SQL queries, select statements, and joins, and the database is sort of optimised for what I would call general use.

So you can write to it, and you can read for it. There are indexes and things like that. What are the traits of an OLTP? Well, they’ve talked about this earlier in the course, and that is how a data object can be broken up into multiple tables in order to have some consistency as well as reduce the amount of data that you’re writing. And so you can have an order that has an order details table, a products table, and a customer table. All of the information relating to an order is spread across multiple tables. That’s normalization. Another important feature of an OLTP is data integrity. So you wouldn’t be able to insert a record into a table that is a child table unless the parent table had the appropriate records, et cetera. ETCA, you would not permit a partially completed transaction to be written to the database until the entire transaction was completed. There’s the concept of consistency. We talked about this during the Cosmos DB section, but essentially, when you write a piece of data to the database, when you do an immediate read from that database, you’re expecting the same data to come back to you. There is no delay between the time that you do a write and the time that it shows up in a query.

Like I said, it’s optimised for general use. And so you have quite a lot of writing to do for a database like this. With some reading, the data is updatable. Most of these other data processing styles, the data workflows that we’re going to talk about, don’t allow you to update the data. So if you need your data to changeover, this is an LTP system. Finally, there is a range of data sizes ranging from megabytes to terabytes. We’re not talking about petabyte- or exabyte- or ytabyte-type databases. These are outside the scope of a standard SQL Database and SQL Server. Now, in Microsoft Azure, I just mentioned a couple of different options. If you need an OLTP system, your options are going to be Azure SQL Database, SQL Server installed in a virtual machine, or any of those Azure database options for MySQL or PostgreSQL. Even MariaDB. Those are OLTP systems. Azure managed or managed by yourself? within a VM.

3. OLAP

The next style of data workload is called OLAP, or online analytical processing. So we just talked about transactional databases, but databases are not really designed for really complex analysis. If you need to gather your data together and look at it from many different dimensions and angles, an Oltap system might not be the best data structure for that. Furthermore, at OLTP, a data database is in use. It’s a production system. You’re reading, you’re writing, you run a query at 9:30, and then by 9:05 the results of that query are going to be slightly different, and that may interfere with some of your analysis and your reporting.

You might want a database that is fairly stable and that you at least understand when it gets updated, so that you can run your reports and get a consistent set of results for the time that you run it. Consider the following scenario: you’re running a series of reports, perhaps one for each region, and you’re running these reports. And if the data is changing, then the numbers aren’t going to match on report number one, and report number two might be off by one or two. And that’s going to be a constant source of frustration when you have particularly complex reports; when you’re talking about lots of joins, sorts, group buys, and things like that, that could actually slow down a transactional database in SQL Server form. You’re going to fill up your tempdb with lots of temporary data, you’re going to use your memory, and you’re going to use your CPU, but that system is also running your production systems.

And so you don’t want your reporting work to interfere with your day-to-day business. And so, having a separate system that you can use to run deep, intense reports with lots of groupings and lots of sorting and things like that, you don’t have to worry about taking away processing power from your production system. Also, once you have your separate database and have extracted this data into a separate format, you can do some preprocessing, clean up the data, rename certain things, reformat things, and there is space for you to actually manipulate the data to run reports on. And finally, if you’re familiar with SQL Server Analysis Services, there’s the concept of cubes, dimensions, and measures. And so these are basically different ways of looking at data that take processing power to prepare. But once you’ve done that, you can run reports across multiple dimensions very quickly, and the work has already been done for you. So if we’re looking at OLAP or online analytical processing, there is no concept of locking rows.

When you’re in a production OLTP system, when a row is being updated, the database grabs the row and places a lock on it so that no other process can update that row. Then it does the update, and then it releases the lock. As a result, if multiple people are accessing the same data at the same time, this can cause problems. This concept of locking is what they call blocking. Okay? So you end up with raised conditions, and there are processes that take a long time because they’re waiting for other systems to release blocks. In an analytic system, there’s no locking at all because it’s a read-only database. There are no updates allowed, so you’re not running a report. And then you run it again, and it comes back with different results. Your data is basically static state.It’s designed for reading. As a result, it’s essentially optimized. The entire database engine is optimised to deliver the results as quickly as possible.

You can basically create indexes across all of your columns by combining things and grouping stuff together. And the database can build unique things that are specifically for reporting that your transaction database can’t do. These databases come in gigabyte sizes as well. You are not running an analytics process on terabytes, petabytes, exabytes, or any other size. It’s a lot of data, but it’s still a manageable amount of data. When looking into Azure, you can modify SQL Server. There’s a thing called a column store. And so you can basically modify SQL Server for this analytics process. You can also use specific Azure analysis services. It’s a specific data type that is designed for OLAP. And also, just like you have SQL Server in a VM, SQL Server Analysis Services SSAs is a product that you can instal in a VM or even run on your own premises. Let’s not forget that there are hybrid setups where you pull your data out of Azure onto your on-premises server and then use things like SSAs to analyse that data.

4. Synapse Analytics SQL Data Warehouse

So the third type of data workload that you sometimes see is called a “data warehouse.” Now, a data warehouse has a slight advantage over an analytics database, and we’re going to try to compare and contrast them in this video. First of all, the data warehouse serves as a central repository for your data. So a common use case for this will be if you have SQL servers or Azure SQL databases that control many different regions of your business. Let’s say you have your European, your North American, your South American, and your Asian businesses. They all have their own unique OLTP systems, but you’re going to want to run reports across them. And so you basically pull your data out of those regional systems and into a single data warehouse. Sometimes that’s called a roll-up database sometimes. And so you’re able to do end-to-end reporting across customers and across regions.

You’ve got multiple data sources being pulled into a single one. The other use case for this is when you have historical data, which is oftentimes the case in transaction systems. It grows and grows, and you might want to trim that a little bit. That’s called an archiving process. As a result, you want your transaction data to be as small as possible in order to be as performant as possible. But your data warehouse can be huge. You can go back 10, 15, or 20 years of data if you don’t want that clogging up your day-to-day systems. And so using data warehouses for historical data is a common use case. Also, you might want to have reports that generate end-of-day reports and end-of-month reports; those go into tables, but they don’t exist in your transaction system. But you’re basically running reports and applications that are generating data for your data warehouse at the end of the month, at the end of the day, et cetera. Now, data warehouses are commonly used for reporting systems.

And so if you have power users who use business intelligence tools and they want to go directly to a data source and start to build reports, well, pointing them to a data warehouse is quite a useful thing. So you can say, “Here you go; here’s all of our data going back the last ten years.” Create as many reports as you want; why not? So giving them a data source that is a data warehouse often makes a lot of sense. And so they can go about their business. They can run reports that take hours to run, and they’re not affecting the day-to-day operations of your business. You don’t have to ever run back to the analytics section of your office and say, “Are you guys doing something?” We’re seeing a big spike in CPU on the database. And that shouldn’t happen when you’ve got a proper data warehouse system set up. Now, there are a couple of types of data warehousing that are now quite common. When we’re talking about data warehousing, we’re talking about a SQL data warehouse, which has been renamed to Azure Synapse Analytics. But that’s quite a large enterprise system. It’s called “massively parallel processing.” As a result, SQL Data Warehouse can generate additional computing power. If you have reports that are going to take a long time to run, To do that processing, the data warehouse can basically grow to a higher CPU utilisation and then shrink back down. But that’s only for really big uses. For a data warehouse, you can use regular Azure SQL Database or SQL Server. So basically, you have a separate system, but it’s still an Azure SQL database. But that’s something that they can run reports against, or you can generalise things and do your processing against that. Synapse is one of the options.

You also have big data solutions. So hadoop. HDInsight is something within Microsoft Azure. And so you have the Apache Hive type of database. Or there’s LLAP, which is the interactive query on HDInsight. And so those are, again, big processes. They’ve got clusters of computers. You can grow into massive systems. It’s not necessary if you don’t need that. Now, one of the questions on the exam syllabus is, “When do you use a data warehouse?” And so I put together these five common scenarios for when it would make sense to use a data warehouse. The first one is if you’re currently experiencing really long-running queries and other things that are affecting your day-to-day operations. Remember when I said OLTP uses locking and things like that? Well, you don’t want your reporting to start interfering with your data by causing blocks and long-running processes. So that could be a good use case for pulling something into a data warehouse. If there are some specific columns and formats and renaming of things that has to happen for the reporter people that are working in the reporting department to be able to analyse the data, well, maybe that can be pulled out.

So you’re using your OLTP system for day-to-day operations, and you don’t care that the date column is a number. But when they’re using it in their reports, it can’t be one, two, or three. It has to be Monday, Tuesday, or Wednesday. Right? Again, we talked about archiving. If you need to reduce the size of your OLTP database but want to keep the historical database for reporting purposes, you can do so during the archiving process. Data warehouses are good at pulling in data from multiple sources. Like I said, if you have Asian, European, North American, or South American data sources and you want to pull them into a single database, And finally, I mentioned this briefly, but if you are finding that users are being directed to your main Oltb database, but they don’t really like the table names and column names, and this is an ongoing source of confusion, then creating them in their own data warehouse would be something that would be beneficial.

Uncategorized

Related posts:

Leave a Reply Cancel reply