- AWS Well-Architected Framework
Hey everyone, welcome back. In today’s video, we will be discussing the well-architected framework. Now, before we do that, I would like to share a small story that can help with the well-architected framework. So, after 10th grade, I had never owned a phone until that day. My parents were quite strict, and when I got into college and the college was quite far, it was like one and a half hours one way to get there. At that time, my parents allowed me to get my first mobile phone. I was pretty excited, like, which one should I really buy? And I had absolutely no idea. I didn’t know certain things, but it wasn’t a very detailed inspection. Because, in general, when you buy your first phone, you want to make sure that it has the best in class while also being within your budget. So I went through a few blog posts that had specific pointers. That’s all right. If you want to buy a phone within the budget of $200, just make sure it has the following things in place: And you simply check the box. And if you just follow it, you will ensure that at least you’ll have a good phone.
And you will not get a Chinese phone that stops working after certain months. And that blog, which had the checklist, really helped me get a good phone. And that was the first Motorola phone I used from then on. And I’ve continued to use Motorola phones since then. Sorry, one and a half fans. Anyways, this was just a small story where we could relate it to the well-architected framework, where typically, whenever you are designing your architectures in AWS, you need to make sure that they follow the best practices, and in order to ensure that, AWS has constructed a well-architected framework. Now, this well-architected framework is basically designed around five pillars, which are operational excellence, security, reliability, performance, and cost optimization. And typically, if you make sure that whatever architecture you design on AWS follows the operational procedures and follows the pointers with each of these design pillars, then you will make sure that your architecture is not only scalable, it is secure, it is cost-optimized, and it will also be good in terms of performance. So it’s like a framework; you just have to look into what the framework says and apply the same within your architecture design. Now, typically, if you are an enterprise customer—I have worked in organizations that have like 5000 servers and are big AWS enterprise customers.
And typically, the AWS people, the AWS support team, or the time they will come to your organisation will look into the entire architecture, and then they will give you a PDF document. They’ll go through the entire architecture and review it according to the five pillars that are present within the well-architected framework. Then they’ll send you a PDF document outlining which areas of the well-designed framework you’re missing. Let’s say you are missing from security. What are the security components that you are missing that are defined in the Well-Architected framework but that your organisation is not implementing? So that document that they provide really becomes helpful in order to make sure that the architecture and the way we are dealing with our applications on AWS are following the best practises and standards that AWS recommends. So, with this, let’s go ahead and understand each one of these five pillars. Now, the first pillar is operational Operational Excellence.Now, operational excellence is more concerned with the operation and monitoring of systems in order to provide business value. Now, this design principle contains various hints. So the first pointer is to perform operations as code. Then you have annotated documentation, and you have made frequent, small, reversible changes. This is very similar to the Agile methodology, in which you refine operations procedures on a regular basis, anticipate failure, and learn from operational failure. Let me give you a few real-world examples that relate to operational Operational Excellence.So, in one of the organisations that I was working with, they would take “Perform operations” as a code. So, what the person had done was completely manual there.So all of the servers or services that were launched were done manually. And suddenly, one guy in the evening just terminated a production server. And he did not even know he was planning to terminate some other server. But he was just speaking and chatting around, and without realising it, he terminated a critical production server there. One major issue was that the production server had only recently been launched, say, four or five hours before. But that is dangerous. What if he had deleted the database? As a result, it is critical to include performance operations in your code. So basically, in the case of doing everything manually, you can make use of infrastructure as code, like cloud formation or TerraForm. So what happens there is that you have the power of a pulled request and approval process, similar to what you typically have in your application code management. So, let’s say I want to delete or terminate the server.
I’ll write that inside the code. I’ll send that code for the pulled request to my manager. Once he approves it, then I’ll go ahead and apply that change to the production environment. So, doing things through infrastructure as code is very important and is also one of the design principles of the pillar, which is operational Operational Excellence.One more important part here is anticipating failure. So generally, and this is the last point, learn from operational failure. So this is like a postmortem or RCA. So once the application has failed and you have recovered, you learn, you do an RCA, and you define the things that you will do to ensure that this does not happen again. But one of the good practises is to also do premotum. One methodology is like postmodernism, where you learn after failure, and the second is premortem, where before the failure occurs, you design or test various failure scenarios and learn beforehand. So let’s say you have an application in production. So you investigate the various failure scenarios that could occur and how you can ensure that the worst scenario does not occur through team discussions and various other methods. As a result, both the premodern and postmodern periods are significant. So that’s about the first pillar, which is operational excellence. This is a very high-level overview. Now again, we can actually create a dedicated course on each of the pillars. So since this is not a dedicated course, we’ll just have a high-level overview so that we understand the well-architected framework. Now, pillar two is security. Security is very important, and this pillar basically focuses on protecting information and systems.
Now, I’ve been working exclusively on securities for the past five years, and from what I’ve seen, security is one of the lowest priorities even in large corporations, unless and until they are compromised. So once they get compromised, security suddenly becomes a very high priority. I’ll show you an example. I got a call from one of the CEOs of a very big organisation in India, and I went for a meeting. I then realised that they were forming a security team. And I was like, “Why suddenly?” Because they have been operational for the past six years and it was found that some of the critical production servers were compromised, and they never had a security team; they had millions of customers, but they never considered security a priority. And hence, pillar two of security is extremely important. And one of the first important principles within this pillar is implementing a strong identity foundation. This is very important. Now, again, I’ll give you an example of a startup for which I was doing a consultation and which was again compromised. And we found out what had happened. Some people left after about six or seven months. Back. However, the users were still present, and the problem was that those users were getting used to a person who left the organization but was still using the AWS account for various activities. And this is the reason why you should have a very strong identity foundation. So typically, single sign-on is what a lot of organisations prefer. You also need to have security at all the layers. For example, in addition to having a security policy at the network ACL level, then at the security group level, you can also have HIDs and IPS at the host level. So that even if someone by mistake launches a machine on a public subnet and puts the security group there, you still have an HIDs agent as well as an IPS within your system, which can still protect against various attack vectors. You must also protect data in transit addresses. This typically happens with the help of encryption. TLS may be in transit, and you have an IS for encryption address.
Another important principle is to keep people away from data. Avoid giving random access to developers within the production environment. It is not required to form a central log management system so that all the logs from your production go to that central system and just go away, giving access to developers in the production system. This is very important as far as security is concerned. The third pillar here is reliability one.Now, reliability basically focuses on the ability to prevent and quickly recover from failures to meet the business’s and customers’ needs. I have a very funny but nightmarish real-world example for that. So in one of the organisations that I have been working with, what they used to do was take a database backup of a critical production server on a weekly basis, as far as I remember. And suddenly, what happened was that the production server went down basically where the data centre was present; the data centre went down, due to which the production server went down. Now, the backups were stored in a centralised storage system, so the backups were still present. So the management decided, all right, let’s launch that server in another backup data centre that we had. And when the DB team tried to restore that backup, it never worked, and it was realised that the backup procedure itself was incorrect, all the backups that were present were corrupted, and the new server could not be restored from the backup. And that is where it was realised that testing the recovery procedure is extremely important. You need to make sure that whatever recovery procedure you are formulating or whatever recovery procedure is present within your organization is tested so that it just works. All right. You can then automatically recover from failure. This is possible with the help of various AWS services, such as auto scaling and others.
So stop kissing capacity. This can be accomplished using auto-scaling as well as lambda-based programming. So the next pillar here is performance efficiency. This is, once again, a critical pillar in most organizations. I used to have one colleague, so what he used to do is basically he was in the application team, and what he used to do was even for the testing environment, like for development, he used to launch M-5 and two X-large instances. And when you look at the CPU utilization, it was hardly one or two percent, so it was really not at all performance optimized. So this is where performance efficiency really comes into play, because if your applications are not performance efficient, you will lose on a cost basis. And this is the reason why, generally, in this pillar, using serverless architecture is something that is quite preferred. Serverless helps you not only reduce your cost, but you don’t really have to worry much about handling capacity because serverless can really scale based on the requests that you might have. So various services can help. Lambda is one at this stage, and API gateway is another. You can say load balancers can also help you and provide various other services. The last pillar for the Bellarchitected framework is cost optimization. Again, this is one of the drives that a lot of organisations typically make because their AWS running costs are very high. So, within this, you must ensure that whatever services and servers you have are fully utilized. So you don’t want a C, four, or two Xlarge servers to have 2 or 3% CPU utilization. So you need to make sure that your performance and your efficiency are at their best. Now, one of the important parts over here is to stop spending money on data centre operations. Data centres are really expensive. So make sure that if you have a data center, try and move it to AWS; it will really help you decrease the cost over here. One more important point is to use the managed service. For example, one of the startups I’ve been working with used to use MongoDB. Now they had a MongoDB cluster of five servers, and those five servers used to be quite big, and that used to have a very high cost for that organization. So later they moved completely to DynamoDB instead of MongoDB, and it really helped them bring down the cost. So this is where it’s better to use managed services. Because if you don’t use managed services, you’ll have to manage the cluster, high availability, and a variety of other things. So all of those aspects are part of cost optimization.
2. AWS Personal Health Dashboard
Hey everyone, how is it going? In today’s video, we will be discussing the AWS personal health dashboard. The personal health dashboard will now provide you with notifications as well as an overview if there are any errors or anomalies in the AWS services. Now we also have something called the AWS service health dashboard, and I’m sure that many of you have used it, or if not, the service health dashboard basically gives you an overview of whether the service is working correctly or not. Now, I’ve spent the last five years working in enterprises that use AWS, and there are a lot of instances where the production service goes down, usually due to a backend AWS issue. So it’s not like that. If you’re using AWS, you are 100% good. AWS also has its own issues. So you need to verify, like whenever your production goes down or there’s some networking issue, whether it’s an AWS issue or a side issue. So that was typically done with the help of the service health dashboard. Now, within the service health dashboard, you’ll find information about various services, as well as information about whether or not the service is operational.
So typically, this is a region. So you have North America, South America, Europe, and Asia Pacific. As you can see, this is a very high-level overview, and it is not always a personalised dashboard. This is a very global dashboard that AWS provides. And since this is global, there are a lot of things like event-driven operations that you cannot perform. So due to that, AWS really recommends you get a personalised view of the AWS service with the help of your personal health dashboard. So let’s go ahead and understand more about that. Now the AWS personal health dashboard displays issues that are impacting your resources, are going to potentially impact the service, or are already impacting the service that you are using in your AWS account. So this is how the dashboard looks like.And if you see something like “open issues,” “schedule chains,” or “other notifications,” it will give you the list of open issues that are related to your environment within AWS. So let’s go ahead and look into how exactly it might look in the management console. So I’m in my management console over here, and here you have something called a bell icon. Simply clicking on the bell icon will bring up a list of alerts. You have open issues, scheduled changes, and other notifications. Typically, you do not really have an orange mark over here. If there are any notifications, then you will be able to see them from here. So let’s click here and select the “view all alerts” button. Now within this section, you have a list of open issues, a list of scheduled changes, and other notifications. So, let’s say you’re running a production server and you suddenly encounter some networking issues. Now, you are not sure whether it was due to the changes that you made last evening or whether it is due to some of the AWS component side.So the first thing that you would typically check is your personal health dashboard to see whether your open issues have certain AWS site failures that are occurring due to which your component might go down. So this is something that you will be able to see over here. Now, there is also something called an “event log,” which typically contains So, even if you only see the dashboard for the last seven days, the event log can contain data for up to 90 days. So you will be able to see various pieces of information over here. So all of the events you see here fall under the category of issue. So let’s click on “operational issue” and they’ll tell you what exactly the issue was, in which region it was, what the start time was, and what the end time was. And, in most cases, no affected entities are discovered within the affected resources. But in the event that your EC instances are affected, you will be able to see the list of instances within your environment that are affected by this specific issue. As a result, the SysOps or SRE team can quickly determine which resources are impacted by the AWS operational issue side. Now, one more powerful thing about thepersonal health dashboard is this notification. So on the top right, you have set up notification for a Cloud Watch event.
Let’s click here and look into what exactly this is. So this is the Cloud Watch rules page. Let us now assume that you want to receive an email. You have three teams: one for SRE, one for soft, and one for security. Now, whenever there is a problem with security services, you want the security team to be notified. If there is a compute-related problem, you want SIS upstream to be notified. And whenever there is something like an API, a gateway, or similar services that have issues, you want the SRE team to get alerted, and all of those can be done through the Cloud Watch. Now, from the service name, we need to select “Health.” So this is the health check, and instead of listing all the events, what you can do is specify the health check events. So let’s specify. Let’s say that there is an issue related to Im. Now, Im is typically an identity and is generally managed by the security team. Now, within Im, you can specify the event category. Say I want to only get events related to an issue. Like whenever there is an issue related to the Im service, the security team needs to be notified. There are now event codes here as well, with various event codes related to API, operational issues, SAML-related issues, and federation-related issues. So I’ll say any event typecode and any resources over here. You can now set a target once you’ve selected that on the right side. So if I click on a target over here, there can be various targets; it can be an SNS topic.
Now, SNS topic in turn can be integrated with email functionality, where within SNS topic you can specify your topic name, which is associated with your security team, or you can even have a lambda function here that becomes quite powerful. So you have a lambda function, and you can put logic within it. Let’s say there is an EC2 instance that is marked for host failure. So you can have a lambda function that will automatically restart that EC2 instance so that it migrates the host from the potential failure host to a completely new host within that availability zone. So lambda is also one of the important targets that the Cloud Watch event source provides. Typically, within the organisation with which I’ve been working, we have various teams, and as the use case that we decided on, any issue related to services that the security team handles is a separate SMS topic. The SNS topic is integrated with the email address of the security team, and whenever there is a health issue related to security services, only the security team will get an alert. If there is a health issue related to the compute service, only the Sisoffs team would get an alert or something similar. This is why the personal health dashboard becomes so important, and the service health dashboard does not typically provide you with the same level of flexibility as the personal health dashboard.
3. AWS Pricing Model
Hey everyone, and welcome back. Now in today’s lecture, we will speak about the fundamentals of pricing as far as the AWS environment is concerned. So let’s go ahead and understand more about this. So AWS has more than 50 services ranging from IAS to software as a service to platform as a service-based platforms. Now, in the earlier lectures, as I hope you already know, we have discussed quite a few details related to each of these platforms. So, for each of the AWS services, we generally pay for the exact amount of resources that we use. And this is what you call a “pay as you go” model. So if you are running a server for, let’s say, 1 hour, you will pay for 1 hour and not for the entire month. However, as opposed to a non-cloud-based environment, you generally have a monthly subscription. So you pay for a server for one month, and even if you only use it for an hour or two, you still pay for one month. However, it is a pay as you go model in cloud environments. So when you talk about the pricing aspect, definitely the first point that really comes to mind is “pay as you go.” This is the first aspect. The second point to mention is that we pay less than we reserve. So what do I mean by this? So if we know that we are going to run one server with 16 GB of RAM continuously for one year, you commit to AWS that I’ll need that and reserve that resource for me. So in that case, you have to pay even less because otherwise you switch to paying on demand. So let me just show you what I mean by this. So I’ll just open up the AWS console, and I’ll just show you the easy two. As a result, if you go to the instances You see, you have spot instances, you have reserved instances, and you have dedicated hosts. So whenever you launch your instance, it is basically on demand. So when you reserve your instance, you see that you have to purchase a reserved instance.
So in the event that you purchase a reserved instance, you commit to AWS for a certain amount of time, which can be one year or even three years. So when you commit, you pay less. So that is what the second point is all about. Third, as you use more, you pay less per unit. So the more you use, the less you pay. That will be discussed. And lastly, pay even less when the AWS grows. So we’ll be discussing each of these points in the upcoming slide. So, let us discuss the “pay as you go” model. As a result, the pay-as-you-go model allows customers to only pay for the resources that they use. So one of the great benefits is that there are no large upfront expenses. So for ten days I need, let’s assume, five terabytes of storage. I only pay for that amount for ten days. So I delete all of my storage after ten days. I no longer have to pay that. So that is one of the big benefits. Second, pay for only what has been used, and third, pay for only as long as you need it. So these are the great benefits. Now, again, you have no long-term contracts. So there are no contracts that say you need to have a compulsory reservation of one year. There are no license-based pricing dependencies in such contracts. This is quite important. Let me just show you this: So let me launch an instance within the instance type. You see, you even have a Microsoft Windows server. So I can select the Windows server, and I can actually launch a Windows server without having to worry about licencing aspects. So if I use this for 10 hours, I only pay for 10 hours. So this is one of the very big benefits, where you don’t have to worry about licencing and all. Third, allow us to have resources and create resources that are not based on forecasts. So when you go on to the data centre environment, you have a forecast that within the next three months I’ll be needing, let’s assume, 100 servers.
So you’ll be purchasing all of those hundred servers right now and not really using any of them right now. So you don’t really have to worry about forecasts. If you need 100 servers after three months, launch that much service after three months. So again, this is a great benefit here.The next step is to pay less. Venue reserve. We already discussed that for certain services, like Ecto or RDS, we can purchase reserved instances depending upon the predictive usage that an organisation that have.So a reserved instance allows us to save up to 75% of the on-demand cost. So we can pay for the reserve instance in three ways. One is All Upfront, which provides the largest discount. The second is the partial upfront, and the third is the no upfront, which is the smaller discount. So the more you reserve, for example, if I reserve for three years and pay all of the money upfront, that is referred to as paying everything upfront. So that gives us the largest discount. You also have no upfront where you commit forthree years, but you do not pay everything upfront. So in such scenarios, you have a smaller discount. So that is what the second point is all about. Third, pay less by even using more, where we can get volume-based discounts as the usage increases. So if you talk about certain services like S3, you see the first 50 terabytes per month, or $0.0023 per GB. Next, for 50 TB per month, the pricing decreases. As a result, the more you use it, the lower the pricing becomes. So one important thing to remember is that this does not apply to all the services. It only applies for certain AWS services. Is the next step to pay even less as AWS expands? So since 2016, AWS has actually lowered its pricing 44 times. So, even though pay decreases as AWS grows, one of the major reasons for this is increased competition. So as far as 2018 is concerned, you have Azure, you have Google Cloud Platform, and you have a lot of competition that is coming up. And this is one of the major factors in the reduction of prices. So there are some blogs that have compared the prices between each year. So you see, you have 20, 11, 20, 12, 20, 13, 20, 14. So, if you compare pricing from 2011 to 2014, a three-year difference, you’ll notice that prices have been cut in half. So that is like a 50% pricing reduction. And, as you can see, this is one of the most recent blogs from May 2017 on Amazon. Amazon has actually reduced the pricing of various services quite a bit. As a result, this falls under this category. So these are some of the important pointers for.