Pass Amazon AWS Certified Data Analytics - Specialty Exam in First Attempt Easily
Latest Amazon AWS Certified Data Analytics - Specialty Practice Test Questions, Exam Dumps
Accurate & Verified Answers As Experienced in the Actual Test!
Check our Last Week Results!
- Premium File 233 Questions & Answers
Last Update: Nov 26, 2023
- Training Course 124 Lectures
- Study Guide 557 Pages
Download Free Amazon AWS Certified Data Analytics - Specialty Exam Dumps, Practice Test
Free VCE files for Amazon AWS Certified Data Analytics - Specialty certification practice test questions and answers, exam dumps are uploaded by real users who have taken the exam recently. Download the latest AWS Certified Data Analytics - Specialty AWS Certified Data Analytics - Specialty (DAS-C01) certification exam practice test questions and answers and sign up for free on Exam-Labs.
Amazon AWS Certified Data Analytics - Specialty Practice Test Questions, Amazon AWS Certified Data Analytics - Specialty Exam dumps
Domain 1: Collection
1. Collection Section Introduction
Welcome to the Collection section. In this section, we will learn how to move data into AWS. There are many ways to move data into AWS, and I like to categories them into three categories. The first is for real-time collection, where we can perform immediate action on our data. For this, we'll go over Kinesis data streams, simple queue service or SQS, and Iota for the Internet of Things. All of these services allow you and your applications to react in real time. That is, you will use Kenneth's data firehose. Don’t worry, all of this will be covered in this section or the database migration service. Finally, there is a third type of collection that is not very close to real time, and it is called batch. This is usually called when you want to move a larger amount of data into AWS for historical analysis, perhaps to do some big data analysis on it. So we have snowball and database data pipelines. All these services will be described in this section. There's a lot to learn, but I hope you get a general picture of where things stand in terms of near real-time and batch. And I will see you in the next lecture.
2. Kinesis Data Streams Overview
Let's start our course with Kinesis. Kinesis is such an important exam topic and overall a really important collection topic for AWS if you want to perform big data. So what is kinesis? Well, some view it as a managed alternative to Apache Kafka, which is great to gather a lot of data in real time. It's great if you want to gather data such as application logs, metrics, IoT data or click streams. Overall, it's great for any kind of realtime big data. It integrates with a lot of processing framework for streaming processing frameworks such as Spark or NiFi or other frameworks. And you get some really cool data replication because data is replicated synchronously up to three AZ. So by default, this is a highly available setup. Now, Kinesis is comprised of multiple services. The first one is going to be Kinesis streams or Kinesis data streams, which is going to be a low latency streaming ingest at scale. There will be Kinesis analytics and Frank will describe this to you later on in the course, which is to perform real time analytics on screen sequel. And finally there is Kinesis fire hose to load streams into s three redshift, elastic search and splunk. We will see all of them in this course, obviously. Now, let's first talk about Kinesis as a whole, right? What is the architecture surrounding Kinesis? The first thing we have is in the center we have the Amazon Kinesis service. And the first thing we're going to visit is the Amazon Kinesis stream. So this streams can take a lot of data from say, click streams or IoT devices, for example, a connected bike or metrics, and logs from your servers directly. So the streams will just ingest a lot of data and we'll see how that works later on in this section. But then you would like to analyze that data in real time. Maybe you're trying to compute a metric, maybe build alerts. For this, you can use optionally, the Amazon Kinesis analytics service. And then once you have all that data, you may want to store it somewhere for later analysis or realtime dashboards or all that kind of stuff. That would be Amazon Kinesis Firehose. Now, as I mentioned, Kanishi's Firehose is a near real time service. We'll see what that means when we do a deep dive on Kinsey's Firehose. But going into the exam, remember that it's near real time, not real time. So what does Kinesis firehose do? Well, it can deliver your data to maybe an Amazon S3 Bucket or an Amazon Redshift database, or Splunk or Elastic search, as we'll see later on. So that gives you a whole high level overview of the architecture surrounding Kinesis. Now, let's do a deep dive into Kinesis streams. Kinesis streams are divided into what's called a Shard. And a Shard, for those who don't know what that term means, is the equivalent of a partition. So what does it look like? Well, we have a producer and it's going to produce to our stream in Kinesis. But that stream is actually made off of shards. So for example, in this example we have three shards. And then consumers will read that data from shards. By default, Kinesis streams does not store your data forever. It stores it for 24 hours by default. So you basically have one day to consume the data. But if you wanted to increase safety and obviously will be more expensive, you could retain that data for up to seven days. So this is something to consider for Kinesis is really nice because you have the ability to reprocess and replay data. So once you consume the data, the data is not gone from Kinesis streams. It will be gone based on the data retention period. But you are able to read the same data over, over and over as long as it is in the Kinesis stream. That means that multiple applications can consume the same stream. That means you can do real time processing and you have huge scale of through puts. Now, one thing you should know is that Kinesis stream is not a database. Once you insert the data into Kinesis, it's immutable. That means it cannot be deleted. So it's a pend only stream. Okay, so we have a high level overview right now. Let's do a deep dive into shards. Basically, one stream is made of many different shards or partitions. And when you get billed in Kinesis, you get billed per shard provisioned. You can have as many shards as you want. By the way, you will get patching available or you can do a permissive put. And we'll see this in the producer section. Now, the number of shards can evolve over time. So you can reshart or merge. And we'll see these operations. And overall, the records are not going to be ordered globally. They're going to be ordered per shard. So within the shard, all the records will be ordered based on when they're received. So we've seen this graph. But just to reiterate, the producers produce to one or more shards. Here we have four shards and consumers will receive the data. Now, what do we produce to these shards? Well, we produce records. And what is the record made of? Well, it's made of something called a data blob. And that data blob is up to 1 need to remember this. It's up to 1 MB. It's bytes. So you can represent anything you want. Then you will have a record key. And that record key basically helps Kinesis know to reach shard to send that data to. So basically, if you choose a record key, you need to have it super distributed. For example, user ID if you have a million users. And that will allow you to avoid the hot partition problem where all the data goes to the same chart. Finally you will have a sequence number. And this is not something the producer sends, is something that gets added by Kinesis after it is ingested. It basically represents the sequence number in the shard when the data is added to that shard. So finally, there's a bit of limits you need to know in Kinesis data streams. First one is that a producer can only produce 1MB or 1000 messages per second outright per shard. So if you have ten shards, you get ten megabytes per second or 10,000 messages per second. If you go over that limit, you will get what's called a provision throughput exception. Now, there's two types of consumer in Kinesis. The first one is Consumer Classic and this is one you may have heard of. And you get two megabytes per second per shard across all the consumers. Or you get five API calls per second per share across all the consumers and you get consumer and hence find out this is a new kind of consumption that may not be described yet very well online. And so you get two megabytes per second at read per shard, per enhanced consumer and there is no API calls needed. So it's a push model. So the really cool thing we'll see this in details is that you can scale a lot more things to the consumer and hence sign out. And finally, you need to remember data retention which is 24 hours of retention by default and it's extend this to seven days. So that's it for high level overview of Kinesis. Now, we'll see you in the next lecture.
3. Kinesis Producers
So we need to know how we can produce data for Kinesis. So here is our Amazon Kinesis stream, and the exam will expect you to know at a high level how each of these works. The first one is the SDK. The SDK allows you to write code or use the CLI to directly send data into Amazon Kinesis streams. The second one is to use the Kinesis Producer library, or KPL. Remember that acronym? The Kinesis Producer Library is going to be a bit more advanced, you're going to write better code and it has some really good features that I will describe in this lecture which allow you to get enhanced throughput into Kinesis streams. The third one is to use a Kinesis Agent. So the Kinesis agent is a Linux program that runs on your server. So remember that it's an agent that runs on servers and basically allows you to get a log file, for example, and send that reliably into the Amazon Kinesis streams. Finally, you can use third party libraries that usually build upon the SDK such as Apache, Spark, Kafka, Connect, NiFi, etc. And all these things will allow you to send data to Kinesis streams reliably. So look at this diagram. Remember that Kinesis streams can have different ways of getting data from various sources. Now, let's do a deeper dive into all the methods on the left hand side. So the first one is to use the Producer SDK and that uses the Put Record or Put Records with an S API. So anytime you see Put record, that means SDK. So with the Put Record as the name indicates, you send one record. And if you put records with an S, you will send many records. Put Records will use batching and therefore increase your throughputs. That's because you send many records as part of one Http request and therefore you're saving in Http request and getting increased throughputs. Now, if you do go over your throughput though, you will get a provision throughput exception exceeded and that's important to know how to deal with this. We'll see this in the very next slide. Now, this Producer SDK can be used in various different ways. You can use it on your applications, but also on your mobile devices such as Android, iOS, etc. So when will you choose to use the Producer SDK? Well, any time we have a low throughput use case or don't mind the high latency, we want a very simple API. Or maybe we're just working straight from AWS Lambda. This is the kind of place where you would use the Producer SDK. You should also know that there are some managed AWS sources for Kinesis data streams, so they use a Producer SDK behind the scenes but you don't see it. And so these managed sources are Cloud Watch logs. Well, if you get a provisioned throughput exception, that happens when you are sending more data than you can. For example, you're exceeding the number of megabytes per second or you're exceeding the number of records per second you can send to any chart. So you need to make sure when you get this that you don't have a hot chart. For example, if you have Device ID as your key and you have 90% of your devices being iPhones, then you're going to get a hot partition because all your devices are iPhones and it will all go to the same partition. So make sure you distribute as much as possible your key that you choose in order not to get a hot partition. So the solution is if you get a provision throughput exception, try to do retries with backup. That means that you will retry after maybe 2 seconds. And if it doesn't work, you'll try after 4 seconds and then after 8seconds they're just like arbitrary numbers. Or you can increase the number of shards you have in Kinesis, basically to increase the amount of scaling you can do. And you need to ensure that your partition key is a good one, a very distributed one. So for example, for the mobile device, instead of choosing Apple versus Android, you would choose maybe the Device ID, which for sure is going to be different for each of your users. Just an example. So now let's talk about the Kinesis producer library or KPL. You need to learn exactly how this works because this is very important going to the exam. It's an easy to use and highly configurable C or Java library and personal experience. I've seen Java being used more when you use the KPL. It's used when you want to build high performance, long running producers and it has automation for the retried mechanism. So the exception I just described before, which we have to deal with when using the API, well, the Kinesis Produced Library automatically knows how to deal with this. Now, Kinesis produced library has two kind of APIs. There is synchronous API, which is the same as the SDK or asynchronous API, which will provide you better performance. But you need to deal with Asynchronousity, obviously. So every time in the exam you see, okay, we need to send data asynchronously to Kinesis data streams. Usually the Kinesis Produced Library or KPI will be the way to do it. It's a really nice library because it also is able to send metrics to Cloud Watch for monitoring. So any time you write an application with a KPI, you can monitor it directly in Cloud Watch. And it supports a very important mechanism called Batching. And Batching has two subsections and they're both turned on by default and they will help you increase the throughput and decrease the cost. You need to know those. Absolutely. The first one is to collect records and write to multiple shards in the same Port Records API call and the second one is to aggregate, which increases latency but increase the Efficiencies as well. So it's the capability to store multiple records in one record. And you're able to go over the 1000 records per limit with this. I'll describe this to you in the very next slide anyway, and that will allow you to increase the payload size and therefore increase throughput so you can reach that 1 Mb/second limit more consistently. If you want to use compression, that means make your record smaller, then this is not something the Kinesis Producer Library supports out of the box. Unfortunately, you would have to implement this yourself. When we do send a record with the KinesisProducer Library, though, it's a very special record and you cannot just read it using the CLI. For this, you will need to use the KCl or a special Helper library and we'll see the KCl in the next lecture. So let's talk about this batching because this is such an important concept to understand in Kinesis Producer Library and something the exam will ask you about. So here's, for example, me sending a record to Kinesis and it's 2.I'm sending it using the Kinesis produced library. Turns out it's not going to be sent right away. It's going to wait a little bit of time to see if more records are coming in. And maybe I'm sending a next record of 40 KB. Maybe I'm sending a next record of 500.What Kinesis Produced will do is at some point it will say, wait a minute, I can aggregate all these records into one record. So instead of sending three records, now we're sending one record and that record is still less than 1. We're going to do this in multiple times. So maybe we're going to have 130, 80 and 200 KB again and it's going to aggregate that one as well into one record. So now we have two records. So we've seen what Aggregation is, and then it's going to say, wait a minute, now we have to send two big records, but we're not going to use to Put Record API. We can use one of them and we'll do Collection, so we'll use the Put Records API. So here you see, I wanted to send seven records to Kinesis and I ended up doing only one API call because we have Aggregation and Collection. And so how does Kinesis know how long to wait to batch these records? So you can control it using this record max buffer time, which defaults to 100 milliseconds. Basically you're saying, okay, I'm willing to wait 100 milliseconds. So that's adding a little bit of latency at the trade off of it being more efficient. So if you want less delay you would decrease that setting. And if you want more batching, you would increase that setting. So that's it for the KPL. Just remember the batching mechanism. It's really important. Last way to produce two kinesis is to use the kinesis agent. And so this agent will basically be installed and it will monitor log files and directly sends them to kinesis data streams. It's just a configuration to do it's a Java based agent. And actually it's built on top of the KPI library. That makes it really reliable and efficient. You would install this only in Linux based server environments for now. So the features is to write from multiple directories and write to multiple streams. It has a routing feature based on the directory or log file you're using, and it can even preprocess the data before sending it to Kenneth data streams. It can do single line devices, you can do CSV to JSON, log to JSON. And on top of it, the kinesis agent is really well written. So it will handle anything like log file rotation, check pointing and retry upon failures. And because it uses the KPI library, then it will emit metrics to Cloud Watch for monitoring. So anytime you need to do aggregation of logs in mass in almost real time, then the KP kinesis agent is the way to go. All right, so we've seen all the ways to produce into kinesis. I know that's a lot, but remember, at a high level, you need to understand how they work. And I hope I achieved just that. I will see you in the next lecture. Bye.
4. Kinesis Consumers
Now let's talk about how we can consume data from Kinesis data streams. And we'll talk first about the classic consumers. So the first one is the Kinesis SDK. So the same way we could use a CLI or some programming language to push that through Kinesis data streams streams, we can use the SDK or the CLI to read data from Kinesis data streams. We can also use the Kinesis client library. And I hinted at in the previous lecture, it's named KCl. So we produce with the KPL and read with the KCl. There's also the Kinesis Connector library, which could be abbreviated to KSCL, but actually isn't. So it's a bit different from Consumer client library. And then we have third party libraries such as Apache, Spark Log, four Jfloom, Kafka Connect all these things. But the exam expects you to know that ApacheSpark is able to read from Kinesis data streams. As a consumer, we can also use Kinesis Data Firehose and also AWS Lambda if we need to. There's this mechanism of consumption called Kinesis consumer enhanced fan outs. And I will discuss it in the next lecture. So for now, let's just consider how a classic consumer would work on Kinesis. So first the SDK get records. So this is classic Kinesis, and that means that the records are going to be pulled or pulled by the consumer from a shard. And each shard will get a maximum of two megabytes total of aggregate throughput. So each shard, remember 1producer, two megabytes of consumer. So here's the example. Our producer is producing to our Kinesis data stream and it's formal of, let's say, three shards. And so if we have three shards, then we have six megabytes of aggregate throughputs for downstream. But each shard itself will get wo megabytes of its own. So now we have a consumer application and it wants to read from, for example, shard number one. What it will do is that it will do a GetRecords API call and the shard will return some data. And if the consumer wants more data, it needs to do another Get Records API call. So that's why it's called a polling mechanism. So get records, every time you run it, it will return up to ten megabytes of data. And then because the ten megabyte of data goes over the two megabytes per second total, you will need to wait 5 seconds until you do another Get records or it will return a maximum of up to 1000 records. Okay. Now that means there's also another limit you need to know first is that there's a maximum of five Get Records API calls per shards per second. So that means that your consumer application, it cannot do just get records, get records like 20 times per second. It can do it only five times per second. That means that you will get 200 millisecond latency on your data. So remember that number because it's really important. But now, what does that mean if we look at these constraints and we start adding more consumers? Well, if five consumers application consume from the same shard, they're different applications, and they all need to read the same data, then each consumer basically can pull for once a second and can receive less than 400. Means that the more consumers you have, the less throughput you will have per consumer. So if we had consumer B and consumer C, they will all share that limit of two megabytes per second per shard, and they will all share that limit of five get records API call per second. So it's really important to understand that, and we'll see how kinesis enhanced fanout for consumers will solve that problem. The next thing we'll explore is to use the kinesis client library, or KCl, to consume data. So it's a Java first library, but it also uses for other languages such as Golang, Python, Rubino, Net, and it allows you to read records produced with the KPL. So remember how KPL does some aggregation while the KCl does de aggregation? So the idea is that with the KCl, you can share multiple shards with multiple consumers in one group. There's also a shard discovery mechanism, which means that your kinesis data streams can be consumed by one application as well as a second one acting as a group together. On top of that, there's a check pointing feature, which means that if one of these applications goes down and then comes back up, it can remember exactly where it was consuming. Well, it basically uses an Amazon DynamoDB table to checkpoint the progress over time and synchronize to see who is going to read which shard, which makes it really helpful. So, DynamoDB will be used for the coordination and checkpointing, and it will have one row in the table for each chart to consume from. And so, because we have DynamoDB in the equation now with this cognitives client library, we need to consider how to provision throughput for that DynamoDB table. So you need to make sure you provision enough, right capacity units or read capacity units, WCU or RCU, or you need to use on demand for DynamoDB to just not use any provisioning of capacity. Otherwise, if you don't do this well, DynamoDB may throttle and that throttling will in fact slow down KCl.So that is a very popular question at the exam,saying my KCl library is not reading fast enough, even though there is enough throughput in my kinesis data stream. What's the problem? Well, the problem is that you probably have under provisioned your DynamoDB table, and therefore it cannot checkpoint fast enough and therefore it cannot consume fast enough. So, very important. Finally, there's an API of record processors to process the data, which make it really convenient to treat these messages one by one. So the Kinesis client library. What you need to remember is that it uses DynamoDB for checkpointing and though it has a limit on DynamoDB and it's used to de aggregate record from the KPL. There's also the utterly confusing connector library. Also known KCl but Kinesis connector library. And it's an older Java library from 2016and it leverages the KCl library under the hood and it's used to write data to Amazon S three or DynamoDB redshift or elastic search. And the connector library must be running on an EC2instance, for example, for it to happen since an application whose sole purpose is to take data from Kansas data streams and send it to all these destinations. Now, you may be like oh, this is already something we can do with Kinesis firehose. And that's true, we can already do this with Kansas fire hose and we'll see this. So for some of these targets we can use Kinesis firehose, for example, for S three and redshift, but for others we can use AWS Lambda. So overall, I would say the Kinesis Connect library can appear at the exam, but it's kind of deprecated and replaced by Kinesis firehose and lambda altogether. So let's talk about Lambda. Now, lambda can read records from a Kinesis data stream and the Lambda consumer also has a small library, which is really nice, used to de aggregate record from the KP. So you can produce with a KP and read from a lambda consumer using a small library. Now, lambda can be used to do lightweight ETL, so we can send data to Amazon S3 DynamoDB, redshift, elastic or really anywhere you want as long as you can programme it. And lambda can also be used to read in real time from Kinesis data streams and to trigger notifications or for example, send email in real time or whatever you may want. Finally, Frank will describe this at length, but there is a configurable batch size and we'll see this in the Lambda function. But basically you can say how much data at the time lambda should read from Kinesis, which helps you regulate throughputs. So overall, we've seen all the ways we can read from Kinesis data streams. There are many different ones, but hope it places some context to which one is good for which use. Case and I will see you in the next lecture.
5. Kinesis Enhanced Fan Out
So I'd like to make sure I teach you about real world big data, not just exam big data. And even though this might not be an exam just yet, I think it will appear very soon. And to me, Kinesis enhance is what I call a game changing feature. And it was appearing in August 2018, so I'm sure it will be in the exam very soon anyway. So how do we leverage this? Well, it will work if you use KCl2.0 or a diverse lambda from November 2018.So it's supported by both these things. And what does enhance mean? That means that each consumer will get two megabytes per second per shard. So it looks similar as before ,but it's not exactly the same. Right? So we have a producer producing Kinesis data streams and has, for example, one Shard. In this example, we'll have a consumer application A, and we'll do an API called subscribe to Shard. And automatically Kinesis data stream will start pushing data at the rate of two megabytes per second. So we're not pulling anymore, we're just saying subscribe to Shard and the Shard will send us two megabytes per second. That means that if we have 20 consumers overall, we'll get 40 megabytes per second per Shard. So we can start adding on consumer application, do the subscribe call again and get another two megabytes per second. So before we had a two megabytes per second limit per shard, but now we get two megabytes per second per limit per shard per consumer. The reason is because enhanced fan out has knees pushing data to consumers over HTP two. And the added benefit of this is, first of all, we can scale a lot more consumer applications. And the other one is we get a reduced latency. So before, remember we had this 200 milliseconds latency in case we had one consumer because it could pull five times a second. And actually, if you had more consumers that they all had 1 second latency. Well, with the enhanced fan out now we get a reduced agency because data is pushed to us and on average you'll get 70 millisecond latency, which is a lot less than 200 m second or 1 second. Hence that is to me a game changing feature and I'm just happy to tell you that it exists. Obviously you have to pay for it a bit more and the pricing page on average will help you understand how much you have to pay for. Now, what's the difference between enhanced and consumer standards and when would you use each of those? So a standard consumer, you will use it when you have a low number of consuming applications, say one, two or three. And you can tolerate some latency, you can tolerate 200 milliseconds or more latency and you want to minimize cost because this cost of center consumer is included in Kinesis already. And you would use enhanced span out consumers if you want to have multiple applications consuming the same stream. We're talking about, like, five or ten applications at the same time. And you want very low latency requirements, so you can only maybe tolerate 17 millisecond latency requirements. And obviously, this will bring you a higher cost. As I said, see the Kenneth's pricing page for that. And by default, though, when you use Find OutConsumer, you have a limit of five consumers that can use Find Out, but you can increase that doing a service request on AWS support. So I hope that makes sense. I hope you're excited about this new feature. Honestly, it's awesome. And I'm pretty sure the exam will ask you about it very, very soon. So it's good to learn about it. To me, it's game changing, because now we can have 20 consumers on a shard and not have any impact to each of them, which to me, is really game changing. All right, I've said it enough. I will see you in the next lecture. See?
Amazon AWS Certified Data Analytics - Specialty Exam Dumps, Amazon AWS Certified Data Analytics - Specialty Practice Test Questions and Answers
Do you have questions about our AWS Certified Data Analytics - Specialty AWS Certified Data Analytics - Specialty (DAS-C01) practice test questions and answers or any of our products? If you are not clear about our Amazon AWS Certified Data Analytics - Specialty exam practice test questions, you can read the FAQ below.
Purchase Amazon AWS Certified Data Analytics - Specialty Exam Training Products Individually