AWS Certified Data Analytics - Specialty: AWS Certified Data Analytics - Specialty (DAS-C01) Certification Video Training Course Outline
Domain 1: Collection
Domain 2: Storage
Domain 3: Processing
Domain 4: Analysis
Domain 5: Visualization
Domain 6: Security
Preparing for the Exam
Appendix: Machine Learning topic...
Domain 1: Collection
AWS Certified Data Analytics - Specialty: AWS Certified Data Analytics - Specialty (DAS-C01) Certification Video Training Course Info
Gain in-depth knowledge for passing your exam with Exam-Labs AWS Certified Data Analytics - Specialty: AWS Certified Data Analytics - Specialty (DAS-C01) certification video training course. The most trusted and reliable name for studying and passing with VCE files which include Amazon AWS Certified Data Analytics - Specialty practice test questions and answers, study guide and exam practice test questions. Unlike any other AWS Certified Data Analytics - Specialty: AWS Certified Data Analytics - Specialty (DAS-C01) video training course for your certification exam.
Domain 1: Collection
6. Kinesis Scaling
From an exam and real-world perspective, it's really important to understand how you can scale Kinesis. So the first operation you can do is to add shards. In the Kinesis world, adding shards is known as shard splitting, as we'll see in a moment. It can be used to basically increase the stream's capacity. So keep in mind that each shard receives one megabyte per second per day. Then if you have ten shards, you have ten megabytes, right? So if you want to increase the stream capacity, you need to split a shard. It can also be helpful to use and divide a hot chart. So if you have a hot chart, one that's being used more than others, we'll just split it, and it will probably increase our throughput. So what happens when you split a shard while the whole chart is closed and will be deleted once the data in it is expired? So let's have a look at the diagram because I think it will make a lot more sense. So we have shards one, two, and three in this example, and they all occupy the same space. Now, shard two is very hot, and we want to split it to increase throughput on the key space of shard two. So we're going to do a split operation, and what will happen is that Sharp Four will be created, and Shark Five will be created. And as you can see, they're occupying the same space as shard two. But now, because we have two shards, we have two megabytes per second on this space instead of 1 MB per second, and the other shards one and three remain the same. So before our shard looks like this, it has three shards or the stream looks like it has three shards. And then after the split, we have four shards, and we can see that four and five are occupying the space of shard two. So Shard 2 will be available as long as the data in it is not expired. But when it expires, it will be gone. So now this is the state of our new Chinese stream. When producers write data to it, they have four shards to write to. So this is a really important concept to understand because, basically, you can split as many shards as you want over time and increase your throughput this way. Now, what's the inverse operation to adding or splitting charts? Well, it's to decrease the chart or merge charges. And so when you merge charts, you would use that to decrease the stream capacity and save on some costs. It can also be used to group two shards that have low traffic and want to merge them together to save money. So all charts will be closed and deleted again based on data expiration. So now, for example, say we have this shard and have our stream and have one, four, and five, three, just like before, and we want to merge one and four together. So we're going to merge them together, and it's becoming shabble six. That's because maybe Shard One and Four didn't get much traffic, so we can merge them together and save some cost. So we merged them together and then sharded five and three while they remained the same as before. And so now we've gone from four shards to three shards in our new stream because we've merged it. So you can see now how we can split and merge and split, and we can basically modularize our entire KineSIS stream to increase and decrease throughputs over time. This is how it works. So what about auto scaling?You may ask: Do we have to do these things manually? Well, there is auto scaling, but it's not a native feature of Kinesis. And there's an API call that you can use to update the chart count called Update Chart Account. And you can implement some kind of auto-scaling using its lambda. There's a whole blog on it if you're curious. This is the architecture diagram, directly from that blog. So read that blog if you're interested in implementing auto scaling.But you have to do it a bit manually to make it work. So what about continuous scaling? What are the limitations of it?Well, you cannot do resharding in parallel. So you need to plan capacity in advance. So that means you can't restart a thousand streams at a time or 1,000 shares at a time. That would not work. As a result, if you anticipate high throughputs, you must plan capacity ahead of time. And you can only perform one resetting operation at a time. It takes a few seconds. So for example, if you have 1000 shards, it will take you about 30,000 seconds or 8.3 hours to double the number of shards up to 2000. So that gives you an idea that scaling in Kinesis is not instantaneous, it takes time. And as you can see, doubling from 1000 charts to 2000 charts takes over 8 hours. We need to plan a lot of events. On top of it, there are some limitations. Now they're a bit complex, and you don't need to remember them. But I've added the slides just in case you need to use Kinesis in the real world and need to know about them. So here they all are. Basically, that means that you can't scale up too quickly, and you can't scale down too quickly. There are some limits that iOS imposes on you, so I won't read them out to you. You don't need to know them for the exam. Okay? I just need to know that recharging cannot be done in parallel and that you need to basically recharge for a few seconds for each shard. So remember that for 1000 shards, it takes about 8.3 hours to double the number of charts to 2000. That's what I want you to remember out of it. You don't need to remember all the limitations, obviously, but they're here just for reference, in case you need them for your real-world application. Alright. So that's it for Kinesis scaling. I hope you liked it, and I will see you in the next lecture.
7. Kinesis Security
Okay, let's talk about Kinesis Security. So security is going to be a big part of your exam. It has a dedicated section. And so we need to know about security for each technology. So let's have a chat with my Kinesis. Basically, we control access and authorizations to Kinesis using IAM Pro policies. And we can get encryption-for-flight using the HTTP endpoints. That means that the data we send to Kineses will be encrypted and cannot be intercepted. We can also get encryption at rest within the Kinesis streams using KMS. And if you want to encrypt and decrypt your message on the client side, you must do so manually. So it is possible to do, but it is harder. There are no clients to help you do that. And it's up to you to implement that kind of technology—encryption technology. Finally, if you want to access Kinesis within your VPC within a private network, you could use something called VPC Endpoints. They're accessible in Kenya and basically allow you to access Kinesis not on the public internet but on your private VPC internet. So let's see about Kinesis security. Don't worry. Overall, at the end of the course, I'll have a whole section on security, describing in depth the security of each technology. But at least it gives you some kind of idea of how Kinesis Security works. Alright, that's it. I will see you at the next lecture.
8. Kinesis Data Firehose
Okay. Let's talk about the Kinesis data firehose. Now. What is it? Well, Kinesis Data Firehose is a fully managed service that does not require any administration and is near real time.So you have to remember this well. Kinesis data streams are real time.And, as you can see, Data Firehose is referred to as near real time. The reason is, and we'll see when we see the batching, that there is a 60-second latency minimum if your batch is not full. So we don't have the guarantee that datawill be right away in the destination. And there is a near-real-time thing with Kenneth Data Fire Hose. So, how do you make use of Firehose? Well, you're going to load data into Redshift. Amazon s three elasticsearch or Splunk? You have to remember these four destinations by heart. So remember Redshift's three elastic searches and Splunk. The really cool thing about KinesisData Firehose is that, unlike KennethD streams, there is automatic scaling. That means that if you need more throughput, Kenneth Data Firehose will scale for you, and if you need less, it will scale down for you as well. It also supports many data formats and is able to do data conversion. For example, from JSON to Parkin or RC. If you use SRE, the other data transformations you can do are done through AWS Lambda. For example, if you wanted to transform a CSV into a JSON file, we could use AWS Lambda to do that. Transformation coupled with the Kinesis data firehose also supports compression when the target is Amazon S 3. So we're able to compress that data being sent into G, Zip, Zip, or Snappy directly into S Three. And if you load it into Redshift further down the road, then G Zip is going to be the only supported conversion mechanism. Finally, you're going to pay only for the amount of data that goes through the firehose. So the really cool thing is that if you don't need to provision Firehose in advance, you only get billed for used capacity, not provision capacity. Finally, Exam will trick you into thinking that Spark or the KCl can read from KDF, or Kinesis Data Fire hose.This is not the case. Spark Streaming and Kinesis Client Library do not read from a firehose. They only read from Kinesis' data stream. So I just wanted to make that very clear right now. Now let's look at a diagram, which probably makes more sense. So in the middle we have Kinesis Data Firehose, and it has a bunch of sources that can send data to Firehose. just as before. We can use the SDK or the KP library instead of writing directly to Kinesis Data streams; we can write directly into the Amazon Kinesis Data Firehose. The Kennedys agent is also able to directly write into the Kinesis Data fire hose.We can also direct Kinesis Datastreams to the Kinesis Data firehose. And that is a very common pattern. Watch Logs and Events does have an integration with the Kenneth Data Fire hose.And finally, IoT rule actions that we'll see in this section can also send data to the Kinesis Data Firehose. So remember, these sources on the left side are very important for you to understand how they work. Now, as I said, Kensing Data Firehose is able to do some transformation. And for this, we use a companion lambda function that will take that data, transform it, and send it back to Firehose before delivery. Talking about delivery, where do you send the data to? I just said it to you on the last slide. If you can close your eyes and remember, that's perfect. But there's going to be Amazon's three: Redshift, Elasticsearch, and Splunk. Again, remember these four destinations. It's really important from an exam perspective. Okay, so what does the delivery diagram look like? Well, we have a fire hose in the middle, and it's going to have a source. For example, my source in this example is Kinesis Data Streams. So it's going to send that stream into Firehose, and maybe we'll be able to do some data transformation. So remember, I said CSV to Jason. Actually, there are a bunch of blueprints available for AWS Lambda to help you transform the data in Firehose into the format you want. Once the data is formatted and transformed, we can send the output, for example, to an Amazon S-3 bucket. If S Three is your destination, or if Redshift is your destination, it actually goes through S Three, and then there will be a copy command issued to put that data into Redshift. So what about the source records? Can we get the source records into a mystery bucket? The answer is yes. Any source record that goes through the KinesisData Firehose can be configured to lean into another bucket of your choice. So if the exam asks you, okay, how do we get all this source data into Amazon's free bucket through Kinesis Data Firehose? Well, this is a direct feature from Firehose. There's also a really good feature in case there is a transformation failure. We can archive the transformation failure into an Amazon S Three bucket, or even if there is a delivery failure, we can also have the delivery failure put into an Amazon S Three bucket. What you should remember from this diagram is that Kinesis Data Firehose does not cause data loss. Either it ends up in your targets, or you will have a recollection of all the transformation failures, delivery failures, and even the source records, if you wanted to, in another S3 Three buckets.So remember this diagram. It's really important. That's basically your understanding of kinesian data? Fire hose. Now, we're going to look into how often KinesisData Firehose sends the output data to the target. And for this, we need to understand the Firehose buffer size. So how does Firehose work well? Firehose will receive a bunch of records from the source and will accumulate all these records into a buffer. And a buffer, hence his name, is not flushed all the time. It's flushed based on some rules. It's based on time and size rules. So we need to define a buffer size. For example, 32 megabytes. That means that if you have enough data from the source to reach and fill a buffer size of 32 megabytes, then automatically it is going to be flushed. The other setting is buffer time—for example, two minutes. That means that if after two minutes your buffer size is not full, nonetheless, it's going to be flushed. So can you see that it can automatically increase the buffer size to increase throughput? So it scales really nicely for you and accommodates any throughput size. So if we have high throughput and a lot of data going through Kinesis Firehose, then the buffer size limit will be hit and it will flush based on buffer size. But if we have low throughputs, usually that means that the buffer size limit will not be hit but instead the buffer time limit will be hit. Now, for buffer time, the minimum you can set is 1 minute. And you can set the buffer size to a few megabytes. Okay, so can you see data streams versus a fire hose? They look similar, but they're actually not streams. You're going to have to write custom code for your producer and your consumer, and it's going to be real time.We're going to get 200 milliseconds of latency for classic consumers or 70 milliseconds of latency for enhanced phenomenon consumers. And you must manage the scaling yourself. So you need to perform shot splitting and shard merging to have your costs and throughput move over time. The data storage for the data is going to be one to seven days. And you have some replay capability. You can have multiple consumers. You can also use this with lambda to insert data in real time into elasticsearch, for example. Firehose, on the other hand, is going to be fully managed by autoscaling. It will send data to S3, Splunk, Redshift, and ElasticSearch for you. And it has some serverless data transformation you can do with AWS Lambda. And it's nearly real time.The lowest buffer, or buffer time, as I said, is 1 minute. It's automated scaling, and there is no data storage. So Firehose does not replay data or do stuff like this. So you need to really see the differences between the two. To me, it's clear what they are. Finally, keep in mind that with the KP producer, you have the option of producing into streams or fire hose. That is fine. So remember your use case. If your use case is to have applications read data, maybe it's going to be streams. But if all your use case is about is sending data into S3, Redshift, or Splunk, then maybe Firehose is going to be a use case for you. And remember the real-time versus near-real-time constraints. So that's it for Firehose. I hope you liked it, and I will see you in the next lecture.
9. [Exercise] Kinesis Firehose, Part 1
Let's begin by building out the core of many of the systems we're going to build. Throughout this course, we will configure a Kinesis data firehose stream to capture data from a server log generated on an Amazon EC2 instance and publish it to an Amazon S Three data lake. Later in the course, we'll use that SThree data lake in many different ways. As you can see, we'll analyse it using a multitude of other tools. But the first step is to create it. So let's dive in and get started. So sign into your AWS console, and hopefully you've already created an account. If not, go ahead and do that. and we're going to go to the Kinesis service. So just go ahead and type that in, and that will bring us to the Kinesis dashboard. Let's hit the Get Started button. And we want to create a Kinesis Firehose delivery stream. So let's click on "Create Delivery Stream." And we need to give this stream a name. Let's call it purchasing logs. just like this. Okay, now we'll choose Direct Put as the source. That's because we're going to use the Kinesis client application in order to publish data into this Firehose application. So, we're actually going to be putting data directly into Firehose and not from some other intermediate stream. You can see it here. The Amazon Kinesis Agent is one of those examples they give for direct input. So that makes sense. Let's go ahead and hit Next. And optionally, Firehose can do custom transformations of your records as you load them in. Now, we're not actually going to do this because our source is already pretty well structured in CSV format. But if you wanted to, there are options for actually doing your own custom transformations here. There are automatic routines that can transform things into JSON for you automatically. But again, we don't need that. CSV data is totally fine for our purposes. So let's go next. Now we need to tell it where the firehose delivery stream is going to put things. And we want to put it on Amazon.com. So the default setting here of "Amazon S Three" makes sense. We'll keep that selected. However, we have to say what bucket to put it in. So we're on to creating a bucket. Let's go ahead and create a new bucket under our account here. And every bucket needs a unique name. So you'll have to come up with your own name for this bucket and remember what it is because we will need to refer back to it later on. I'll just make up a name. How about orderlogs? Sundog.edu is what I'll call mine. Again, you need a unique name for yours. So experiment until you find one that is both memorable and unique in the world. And I'll go with the default region of US East for me; create that S 3-bucket, and it's automatically populated here as the S 3-destination for this firehose stream. Cool. Moving on. We don't need any special prefixes here, so we'll just hit Next. Now we need to specify some configuration options for how we're going to store things in S Three.I'm going to go ahead and stick with the five megabyte buffer size. That means that the incoming data from Firehose will be split up into files five megabytes in size, and that's good enough for my purposes. The buffer interval, however, I'm going to change. By default, it's five minutes, which is 300 seconds. I'm impatient. I want to be able to play with MyData more quickly as I'm experimenting with it. So I'm going to set that to the minimum value of 60 seconds. And again, it's a very important thing to remember on the exam that Kinesis Firehose can add a maximum of 10 MB of data at an interval of 60 seconds. It can't go any lower than that. So if you see exam questions that say, "Need to deliver data on a 1-second basis," Fire Hose is not able to do that. The fire hose is only close to real time. It can only give you data every minute. If you need a true real-time application, FireHose is not what you want. You want a traditional data stream—no need for compression or encryption. But it's important to realise that you have those options here. If you did want to encrypt your data at rest, all you have to do is check that checkbox, and it will be encrypted in your S-3 bucket. And we'll go ahead and enable error logging. Why not? Finally, we need an IAM role. That's identity and access management that provides the security permissions that Fire Hose will need to access our S-3 bucket. We can just hit "Create New," and we're going to go ahead and create a new IAM role. We'll stick with the default name here. That's fine. Open up the policy document and you can see that by default it does all the stuff that we need. We are granting permission to S Three to access our S Three bucket and both read and write to it. and that's really all we need for this function to do. It also has some lambda permissions if you want to use that, but it's not causing any problems. So we'll go ahead and stick with this. It also has the permissions that we need for Kinesis. So basically, the default policy document for Im for Firehose gives it access to the services it might need, including Kinesis, Lambda, and S 3. fined by us. If we really want to be more tight, we could trim this down to just what we need for our application. But we're not too paranoid in this world here.Go ahead and hit "allow." And now we're just about ready to create our firehose stream. Hit next, and we'll do a final review of our settings. We're calling it purchasing logs using direct placement. And it's going into the "three bucks" or whatever you named it. Remember, you have your own unique name there. We have set the buffer conditions to five megabyte files every 60 seconds. And we have an IAM role that was automatically defined to give this Firehose stream the permissions that it needs. Go ahead and hit "Create Delivery Stream." And our firehose delivery stream is in the process of being created. How cool is that?
10. [Exercise] Kinesis Firehose, Part 2
Now that we have our Firehose delivery stream set up, we need to set up an EC2 instance to actually publish our fake order information into that Firehose stream, where we'll ultimately end up in our S-3 bucket. So I'm back at the AWS Management Console here, and we're going to select EC two two.And real quick, we're going to launch a new instance. We're going to go with Amazon Linux (AMI). Not Linux, but two AMI. Because this comes with more AWS tools pre-installed for us, it makes life a little bit easier. So go ahead and hit "select" on the latest Amazon Linux.AMI. We'll stick with the T Twomicro instance, which should be eligible for the free tier. If your account is young enough, that shouldn't cost you anything. Go ahead and review and launch. We're fine with the defaults, so we press the Launch button once more. We do need a key pair, however, so we can actually log into this thing. So I'm going to go ahead and create a new key pair. We'll call it Big Data, for lack of a better name. Download the key pair and make sure you keep that file safe. There's no way to get it back once you've got it down there. So we have a new Big Data PM file that we can use to connect to it later on and go ahead and click "Launch Instances," and we'll just have to wait a couple of minutes while that actually spins up, go back to "View instances," and just wait for that to come out of pending status until it's actually running and ready for us to log in. So I'll pause this and come back when that's ready. Okay, that was actually pretty quick. Now to connect to our new EC2 instance, all we need to do is select it and click the Connect button to get instructions, and you can see that if you're on Mac or Linux. It's pretty straightforward. You can just use the SSH command there, exactly as they say. after making sure you have the correct permissions on that PM file that you downloaded for SSH. I'm on Windows, though, and if you are too, you probably want to use the Putty program. Go ahead and download that if you need it. It's a terminal application for Windows, and by clicking on that, it will tell you more details about how to actually connect using Putty. But I'll walk you through it right now in case you're new to it. First thing we need to do is convert our PEM file into a PPK file that Putty can use for SSH. So to do that, I need to launch the Putty Gen utility that ships with Putty. If I go down to Putty here in my Start menu, select Putty Gen, hit Load, and select that PEM file that I just downloaded, To do that, I need to change the filter to all files so I can actually see it. And there's big data at PEM. Open that okay and hit Save Private Key, and that will convert it to its PPK format. We'll call that big data as well. So now I have a big data-PPK file that will make Putty happy. Let's close out a Putty session and open up the Putty application itself. Now I'm going to paste in the host name that EC Two gave me. Just copy that from this Connect window here. Go ahead and paste that into the host name. And we need to go into the SSH area here, open that up, click on off, and select the PPK file that we just made, then go back to the main session window here. We probably want to save this so we can get back to it easily. We'll call it AWS Big Data and hit Open. and we should be able to connect. Yes. And we want to log in as EC2 users, so in we are. Cool. So the first thing we need to do is install the Kinesis client application here so we can actually send this log data into the Kinesis Firehose. To do that, we're just going to type pseudo-YUM, install an AWS Kinesis Agent, and make sure you spell everything right. It's easy to spell "Kinesis" wrong. Off it goes. And we have the Kinesis agent installed. Cool. So now we need to get our fake log data that we're going to be playing with throughout the course. Let's go ahead and download that from our server here. We can just type in "Watpolon media" and "Sundogsoft Comaws Big Data." Pay attention to capitalization here, folks, because it does matter. Zip. All right, now double check. You typed everything correctly. There's one little typo, and it won't work. All right, we have our log generator ZIP file. Let's go ahead and unzip that. Unziplog generator zip. And you can see that we have both the Python script that I wrote that will actually generate fake log data into the VAR log directory of our instance here and also the source log data here, the online retail CSV. Let's go ahead and change the permissions on that script so we can actually run it. Mod a plus x loggenerator PY should be changed. And let's take a quick look at what we have here. If you want to examine that script, you can. You don't really need to concern yourself with how it works, but what it does is it actually reads however many lines you want from the online retail CSV file and basically creates a bunch of CSV files inside the VARlog Cadabra directory for you. And if you want to take a peek at that log data itself, we can do that as well. You can see that's in the online retail CSV file, and it contains, as it says, an invoice number, a stock code, a description, a quantity, the invoice state, the unit price, customer ID, and country for every order. This is a real e-commerce data set, and it is from some UK retailer that sells crafts, basically. So real data is here. And it comes with real-world problems too, as you'll see later in the course. Hit Q to get out of there. All right, so let's go ahead and create our log directory that we're going to put everything in. Let's say "pseudo-makedirvarlogcadabra." And again, make sure you spell that right. It's very easy to misspell Cadabra, because it's not a real word anyway. And let's go ahead and configure KineSisto and write our stuff into there. So let's say CD, etc. Awskinesis. And this is the actual Kinesis agent configuration file, so we can say pseudo-announce that's our editor agent JSON. And this is the configuration file for the KineSIS agent.
Pay a fraction of the cost to study with Exam-Labs AWS Certified Data Analytics - Specialty: AWS Certified Data Analytics - Specialty (DAS-C01) certification video training course. Passing the certification exams have never been easier. With the complete self-paced exam prep solution including AWS Certified Data Analytics - Specialty: AWS Certified Data Analytics - Specialty (DAS-C01) certification video training course, practice test questions and answers, exam practice test questions and study guide, you have nothing to worry about for your next certification exam.