Amazon AWS DevOps Engineer Professional – Monitoring and Logging (Domain 3) Part 2

August 27, 2023

4. Kinesis – Data Streams Overview

AWS kinesis is something that can come up in the exam quite a bit, but not around the technical details on Kinesis, more around the features, the capabilities, and the ability to give you, for example, real time insights and real time dashboards if you pipe all the data in real time to somewhere else. Okay, so let’s have a look at Kinesis in this theory lecture, and then we’ll do a small handson in the next lecture. So, Kinesis is a managed alternative to something called apache kafka. apache Afkai is a streaming system and Kinesis can be a managed alternative to it. It doesn’t have the same api, but it has this real time streaming capability.

It’s great if you want to pipe in real time application logs, metrics, IoT data, click streams and so on. It’s great. Also, if you want to do real time big data, it’s compatible with stream processing frameworks, Spark, nifi, et cetera. And when you send you the data to Kinesis, it will be automatically replicated to three Availability Zones. As such, it is a highly available service. There is three kind of services within Kinesis. There is kinesis streams or kinesis data streams. This is low latency streaming ingest at scale. You have Kinesis Analytics to perform real time analytics on streams using the sql language, and then finally Kinesis firehose to load streams into s, three redshift and elasticsearch.

And the ones that will come up most often at the exam will be Kinesis Streams and most especially Kinesis firehose. So we’ll have details on this. So what is Kinesis looking like? Here’s the Kinesis service, and we have Kinesis streams. It will onboard a lot of data coming from various sources such as click streams, IoT devices, metrics and logs. They will be sending the data using the sdk to Kinesis streams. And then maybe we want to do some analytics in real time, for example, do some aggregations logs, windowing joins, et cetera. And ensuring that, for example, we can get metrics out of it or insights right away from within the Kinesis service in real time.

And then maybe you want to deliver the data somewhere. So we’ll use Kinesis firehose, and Kinesis firehose, for example, could send this data to an Amazon Astroy buckets. This is just one example. We could, for example, not use Kinesis analytics. We could use Kinesis firehose directly after Kinesis streams, or just use Kinesis firehose on its own. So we have lots of different combinations, but this is one example that illustrates what we can do with Amazon Kinesis. So let’s look into streams. Now, streams are divided in shards or partitions, and in Kinesis well, it’s called a shard.

So we have producers and they send data and they will send data to Kinesis and they will be distributed the data in different shards. So maybe we’ll send to Shard One, Shard Two or Shard Three, and then consumers will be reading the data directly in real time. The data retention in Kinesis is one day by default, but you can go up to seven days. So one week and you have the ability because of this data retention to reprocess or replay data. That also means that you can have multiple applications that can consume the same stream at the same time.

And this is something that’s quite different from another type of pub sub messaging system such as sqs, where you can only have one application consume the same stream. It’s real time processing has scaled throughputs. The more shards you have, the more throughput you’ll have. And once the data is inserted into Kinesis, it cannot be deleted. So this is this concept of immutability and as such, once you send data to Kinesis it will be there for one day, up to seven days and afterwards it will be deleted. But you cannot delete data manually. So one stream is made of multiple shards and the billing is going to be per shard provision.

So you have to think about how many shards in advance you will provision and you can have as many shards as you want. You can send messages in batch and the number of shards can evolve over time. So you can restart. So splitting shards or merge shards. If you want to increase or decrease the number of shards and increase the throughputs overall you get the records will be ordered per shard. So if you have a producer sending data and then consumer rejecting data as we send data from the producer every time, a new record will be appended to each shard. And so the records will be ordered per shard but not across shards.

So what are the records looking like in Kinesis? Well, you have a data blob, it can represent up to 1 mb. It can be anything, it’s bytes. Then you have a record key and the key will be sent along to the message to allow your producer to direct it to the correct shard. And so, if you have a hot key problem, then you’ll have a hot partition problem and that means that one shard will be overloaded. So this is something you have to look out for. And then finally, once you send it to Kinesis, kinesis will append a sequence number which is a unique ID to each record put in the shard. And this is added by Kinesis after ingestion.

So the limits you need to know going into the exam is that the producers can send one megabytes per second or 1000 messages per second at rights per shard and otherwise you will get a provision throughput exception. And this is something you have to look out for if you have a hot shard problem. And for a consumer you can read two megabytes per second per shard across all the consumers or do five api calls per second per shard across all the consumers. So that means that if you have more than two so if you have three different applications consuming at the same time. There is a possibility of threatling because each application will maybe request one megabytes.

And so the limit is two megabytes per shard. So we will request three megabytes per chart at the same time. And so we’ll get Threat lane. And so the idea is that the more consumers you have, you may get some threat link possible. Now, data retention, it’s 24 hours data to retention per default and it can be extended to seven days. So what can produce data to Kinesis? Well, we have Kinesis sdk, so this is something we can use in our applications. The Kinesis Producer Library, which is a java library. Kinesis Agent Cloud Watch Logs. So Cloud Watch logs can directly send data to Kinesis streams. And this is something you definitely have to know for the exam.

And we’ll be doing a handson on this anyway to see how that works. And also third party libraries such as Spark, Log for Japanese, floom, Cafe, Connect, nifi and so on. But the one you have to remember in here is that we have the sdk and Cloud Watch logs that can send data to your Kinesis streams for consumers. Well, what can consume the stream? You have the Kinesis sdk, you have the Kinesis Client library, which is a java library, the connector library, Spark and then firehose. So firehose is a service we’ll see in the next lecture that is allowed to read directly from Kinesis streams AWS Lambda as well.

If you want to have chained up lambda functions with our stream and also third party libraries such as Spark, Lacophoria, Appenders, flume Cafe, Connect and so on, there’s a special consumer you need to know, which is the Kinesis case yell that we saw in the previous slide. And so it uses the dynamodb table to checkpoint offsets. And this will be used also to track other workers and share the work amongst charts. So what happens is that if you want to read in distributed matter your streams and charts, what you will do is that you have your streams in here and you have a dynamodb table that will be used to checkpoint offsets and track the work amongst the workers.

And so each Kinesis kcl app may be running on java, will be checkpointing progress into dynamodb and consuming messages in parallel from your Kinesis stream. And what you need to remember here is that you cannot have more kcl applications than shards in your stream. So if you have six shards, you can have six kcl applications reading at the same time, all sharing the same dynamodb table to share the reads. Okay, so that’s it for just a Kinesis Data streams introduction. Not everything you have to remember, but hopefully this is something you already know. And I will see you in the next lecture to talk about Kinesis data file host in Kinesis Analytics. So, see you in the next lecture.

5. Kinesis – Data Firehose & Analytics Overview

So now let’s talk about kanisha data Firehose. And this is something super important for the exam because you will be used it many, many times to deliver data into different places. So it’s a fully managed service and there is no administration needed. And it’s near real time. So kinesis data streams are real time, but this is near real time. So we’re talking about 60 seconds latency minimum for non full batches. And with kenny’s data Firehose, you can load data into Redshift, Amazon s three, elasticsearch and splunk. So remember this. There’s automatic scaling, and you can do data transformation through aws lambda.

So you can change, for example, a csv into adjacent using a lambda function supports compression. So if your target is Amazon s three, you can have g, zip, zip and snappy compression. And you’re only going to pay for the amount of data going through Firehose. So we don’t have to scale it, we don’t have to provision it. Amazon will do that for us and will pay exactly for the amount going through Firehose. So let’s look at a diagram. This is data firehose. And what can send data to Firehose is the kpl, the kinesis agent, the kinesis data streams, cloudwatch logs and events and IoT collections. So quite a few events sources that can go into Firehose.

But again, the one you have to remember that are going to be really important is kinesis data streams and cloud watch logs and events. And we can have a lambda function to do some transformations on the fly and deliver data to Amazon is three, redshift, elasticsearch and splunk. And you have to remember these destinations as well on the back of your hand. Okay, so this is it. And what about the difference between Streams and Firehose in case it wasn’t obvious? Well, Streams is going to write custom code, so you can write your own consumer and your own producers.

It’s going to be real time with about 200 milliseconds latency for classic consumers. And you must manage scaling yourself. So you need to make sure you split charge or merge charge to scale out or in based on your streaming and throughput needs. You have data storage of up to one to seven days, and you have replay capability multiconsumers. And you can use lambda, for example, as a consumer, if you wanted to insert data in real time into elasticsearch. Okay, so here the real information is real time, whereas Firehose is going to be fully managed. So you don’t have to do anything. And you can send data to stories, blunt, redshift and elasticsearch.

You can do serverless data transformations with lambda, and it’s near real time. Okay? So here the lowest buffer time is 1 minute. It’s automated scaling, so you don’t have to do anything. And you have no data storage, so you cannot replay data or anything like this. So the big difference here is that if you are asked, okay, how do I send data to elastic search in real time, you have to use kinesis data streams with lambda function. If you’re talking about loading data in Elasticsearch near real time, then Firehose is going to be a great candidate. Okay, finally, let’s talk about kinesis data analytics at a high level.

So you can perform real time analytics on kenny’s streams using sql and this called kinesis data analytics. So we have odor scaling it’s managed so no servers to provision it’s continuous in real time. And you only pay for what actually goes through kinesis data analytics. You can create streams out of the real time queries and pipe these streams, for example into either kinesis streams or kinesis data firehose. Okay, so that’s it for this theory lecture. In the next lecture, we’ll be going through creating a kinesis data firehose and playing with it. So see you in the next lecture.

6. Kinesis – Data Firehose Hands On

So let’s go to the Kinesis console and in Kinesis, what we’re going to do in this handson is get started and create our first data firehose. The reason is I want to just show you directly how we can deliver data into Sri. So I’m not going to create a data stream. I’m just going to create a delivery stream. And this will be a Kinesis data firehose delivery stream. So here we go. So I’m going to call it demo firehose. And then you need to choose a source. It could be direct puts or other sources. So remember, the other sources can be AWS, IoT, cloud watch logs or cloud watch events. So we’ll just keep it as direct puts or other sources. But I could also choose a Kinesis stream if I had a stream already decreed.

So I’m going to choose direct puts or other sources. Excellent. So here it’s saying to do firehose puts, you can use the Put Record api or Put Record Batch to send directly to Firehose. You can use the Amazon Kinesis agent IoT to create IoT rules and send data directly into Firehose cloud watch logs. And we’ll see this in the hands on when we go over cloud watch logs and cloudwatch events, where we go over the events as well and see how that works. So, okay, let’s click on next. And do we need to process the records? Do we need to have a lambda function, look at these records and convert them into something before delivery? Then for now, we can say no, but if you enable it, you will need to specify the lambda function for that.

But I’m going to disable this for now. And do I want to convert the record formats? For now, I’m just going to say no. But you could definitely convert between different formats using a device lambda or this option in here. So we’ll click on next. And now we have to choose a destination. So where do we want to send the data to? Has three redshift elasticsearch service, or splunk. And so for each specific destination, you will need to specify some parameters in here. So I’m just going to choose s three, and I’m going to choose a bucket. I’m going to choose my AWS DevOps course, stefan, and this will be great. And for the prefix, I’m going to say kinesis data firehose. And that should be good.

Okay, and for the prefix, I’m going to say kinesis datafare host errors. And this is only if the delivery doesn’t work. Okay. Click on next. And the buffer size is how often do I want to send data into s three? So I can say every five megabytes you write to s three, and it can go between one and 128 megabytes. So if you want to have a large buffer, so I’ll just use a small value, for example, 32. And the buffer interval is how often do I want to write to s three. So the minimum I can set is every 62nd, but I can set all the way to 900 seconds. And I like to deliver every minute. So I’ll just have a buffer interval of every 60 seconds. But the larger this buffer, for example, 120 seconds, the larger the file and the less api calls firehose data fire hose will be making to S three.

So maybe we’ll just keep it as 120 seconds. Here we go. Do I want to have s three compression and encryption. So yes, I’ll have gzip, for example, and I would love to encrypt my data using kms key, for example, my kms key or the default AWS s three. Excellent. Okay, now error logging. Yes, we want to enable error logging. We could tag our data Firehose delivery stream. And then finally we need to have an im role for this so we can create a new iam role in here. And so this is going to take me straight into iam. And I’m going to call this role firehose delivery role. Excellent. And allow to write to my S Three buckets.

Now, let’s click on next. Everything does look good. And we’ll create this delivery stream. And now we’re all set. So you can take a minute before the status is updated. So let’s wait a little bit until this is done. Okay, so now the delivery stream has been created and what we can do is just send demo data to see how that works. So we’ll select it and test with demo data and we’ll start sending them data to the delivery stream. And so what should happen is that the data is being delivered and as the data is accumulated by Kinesis Firehose, then it will be flushed into S Three every two minutes or every 32 megabytes. So what I’ll do is go into my destination right now, my destination bucket, and I should be waiting until I see a Kinesis folder in here.

And that should happen in two minutes. So let me wait a minute. And my data has just arrived in this Kinesis Data fairhouse 2019. So I guess with the prefix I should have added a slash so that it goes into the right directory. So let me just correct this right now. I can edit my stream, scroll all the way down and I’m going to add a prefix of a slash. This way it will not produce the same little error I have right now. This is, okay, let’s save this. So we need to update the role. So I’ll click on Create New or Updates because we are delivering to a new prefix. So this will be done very quickly. And click on allow. And here we go. Now we should be good and save it.

So now we should have a correct prefix. But nonetheless, the data has been delivered into S three. And if we go in there, we see our file that has been delivered by firehose. And if we download it and open it, we start seeing every single little ticker that was sent by the sample demo data into this giant file. And so we get a lot of information in here around what’s going on with this ticker symbols. And that’s just data dummy data produced by Kennydata Firehose Service on its own. But that’s it. We have demoed. How? Kenneth data firehose works. When you’re ready, you can just start some demo data and then stop when you’re happy with it.

And you also have some nice capabilities around monitoring to see how much incoming data was coming in, how many records and how many times things were being delivered, how long it took. So it took 121 seconds for dispatch to be full and then delivered to S Three, and so on. Some nice information. And then you can go into the Amazon S Three logs. If you wanted to go see directly from Cloud Watch logs if there was any delivery errors, they would appear right here. So that’s it just for this one lecture. But now we have a Kinesis Data fire hose ready, and we can see how it sends data to Amazon Three. And we’ll be using this, actually, in future lectures, so make sure you keep it around. All right, that’s it. I will see you in the next lecture.

Uncategorized

Related posts:

Leave a Reply Cancel reply