CompTIA CASP+ CAS-004 Topic: Infrastructure Design (Domain 1)

December 14, 2022

1. Infrastructure Design (OBJ 1.2)

In this section of the course, we’re going to discuss how to determine the proper infrastructure security design by considering some key requirements such as scalability, resiliency, and automation. So in this section, we’re again going to be focusing on domain one security architecture specifically, objective one and two of a scenario analysis, and analyzing the organizational requirements to determine the proper infrastructure security design. So, as we begin moving through this section, we’re going to start out by discussing scalability, including both vertical and horizontal scaling for our resources. Then we’re going to discuss the different forms of resilience, including high availability, diversity and heterogeneity of resources, allocation strategies, redundancy, replication, and clustering. Next, we’re going to move into automation and discuss auto, scaling, SOAR, and bootstrapping. Finally, we’re going to discuss the concepts of performance containerization, virtualization content delivery networks, and caching, as well as how these affect our organizational networks and their associated performance. So let’s get started. In this section on infrastructure design.

2. Scalability (OBJ 1.2)

In this lesson, we’re going to discuss scalability requirements. Now, when I talk about scalability, scalability is measured by the number of requests an asset server can effectively support simultaneously. In traditional networks and servers, you have reached the limits of scalability when the system or application can no longer handle additional requests effectively. This occurs due to the limitation of processing capability, the amount of physical memory, or even the amount of network bandwidth that you may have available. Now, if you begin to run out of resources when trying to support requests from users, then you need to scale up or scale out to increase those resources and therefore be able to support the additional requests. Scaling up is known as “vertical scaling.” Scaling out is known as “horizontal scaling.” Let’s take a quick look at each of these. Vertical scaling refers to adding more resources, such as more processing power, more memory, more storage, or more bandwidth, to your existing machine.

For example, on my laptop, I have 16 GB of RAM. Now, if I find the system slowing down when I open really large files, I may need to increase my available memory. And I could do that by adding an additional 16 GB of RAM, bringing my total up to 32 GB of RAM. This is a really simple example of scaling up, or using vertical scaling, going from 16 to 32 GB in the same system. Now, this is very popular in traditional systems and networks, and it can also be done when you’re using cloud services as well. For example, let’s pretend for a moment that you’re going to start a new blog and you decide to use a simple cloud solution like Amazon’s Light Sail to host that blog. Well, you might start out with a five-dollar plan per month plan. And this has 1 GB of memory, a single-core processor, 40 GB of disc space, and two terabytes of data transfer each month. Now, over time, you start to gain more readers for your blog, and you realise you need to scale up vertically. So you click a button and you upgrade to the 20 hours per month plan. This now gives you 4GB of memory instead of 1GB of memory and also doubles your processors and disc space.

And now your data transfer allowances will also go up. All of this went up vertically, giving you additional resources. Now, you can keep scaling up vertically until you reach the largest plan size, which at the time of the filming was around $160 per month. That plan will give you 32GB of memory, an eight-core processor, 640GB of disc space, and seven terabytes of data transfer per month. But what do you do when you can’t vertically scale anymore? But you’re still getting more and more readers for your fabulous new blog. Well, you’re going to need to rearchitect your website to allow for horizontal scaling, also known as scaling out. Now, you see, vertical scaling is really easy because you don’t have to recode or redevelop anything on your website at all. Instead, you just add more memory, more processors, or more bandwidth, and you’re still using a single machine, this virtual machine, to do all of your processing. And that is how you initially designed your website. So scaling vertically is really easy. But with horizontal scaling, we now have to break down that workload into smaller pieces of logic that can be executed in parallel across multiple machines.

If you’re running your website on a database, horizontal scaling allows you to partition the data across multiple databases so that each database only contains a portion of the total data. For example, let’s go back to the example of the blog. You might have a collection of thousands of different articles split across multiple servers, one for each year of the different articles you wrote during that year. Now, when someone wants to read one of the articles, it’s going to get answered by the corresponding server and database based on the article year that was being requested. In general, scaling out will be better in the long run because each individual machine is less expensive to add than scaling up. Now, scaling up is going to be more expensive because you’re doing more on a single machine and you’re adding additional resources. And that doesn’t scale up nearly as well as scaling out does. Additionally, when you scale out, it becomes virtually limitless because you can keep adding smaller, inexpensive machines to meet the increasing workload that’s being caused by all these additional users and their requests. By focusing on horizontal scaling, you can become infinitely elastic, and your only real limit becomes your ability to pay for cloud computing time from your cloud service provider.

You can have instant and continuous availability, no limit to hardware capacity cost that is assessed on a per-use basis, built-in redundancy, and it will be easier to size and resize to meet your needs by using horizontal scaling and scaling out in this manner. Now, the challenge with horizontal scaling is that you need to design your applications to work in smaller, self-contained blocks that interact to create the results you need. To do this, your application has to be designed as a stateless app on the server side. Therefore, it can be divorced from any single server for long periods of time. Any time your application has to rely on server-side tracking of what it’s doing at any given moment, that’s going to require that your session be tied to a particular server. If, on the other hand, all session-related specifics can be passed seamlessly across literally hundreds of servers, that’s going to allow us to infinitely scale out much quicker. The ability to handle a single session—or thousands or millions of single sessions—across multiple servers interchangeably is at the very epicentre of horizontal scaling. To do this, You always want to focus on using service-oriented architectures, or SOA, whenever you’re building for horizontal scaling. By using microservices and developing your application for independent web application caching and database tiers, you can scale each of these blocks independently as needed to meet your increasing loads and demands for your services and be able to infinitely scale outward.

3. Resiliency Issues (OBJ 1.2)

Security professionals need to design their networks to be resilient. Resiliency is focused on maintaining business continuity for critical services, applications, and data. Simply put, to ensure resiliency, we must ensure availability. Now, it’s often cheaper and easier to utilise the same type of software and systems throughout our environment. Creating a homogeneous environment puts all of our systems at a greater risk of a security breach. Though, if an attacker identifies a single vulnerability that exists in the operating system that supports all of our servers, for example, then they could reuse that attack over and over again throughout our entire network. To increase our availability and resiliency, it is better to have a mixture of different operating systems to provide the services we need inside our networks. This is known as having diversity in your components or having a heterogeneous system. This, however, is going to make operations without this network much more challenging.

So your organisation needs to perform a calculated risk management decision when deciding whether to use different operating systems to increase resiliency. Instead of using the same operating system across all of its servers to increase efficiency and reduce complexity, This is an operations-versus-security thing that we have to balance. Now, while our organisations may not decide to use different operating systems to run all of our servers, we should at least distribute the servers across two or more data centres to ensure some resiliency. This is known as the “distributed allocation” of our resources. Every critical component that supports our network should have redundancy not just in the components but also in location. For example, if one of our data centres loses power or is flooded, our second data center should be able to pick up all the critical services from that lost data center.

Another method of increasing resilience is by using automation and orchestration. Orchestration involves creating fully automated workflows. For example, we can use orchestrated workflows to start additional cloud servers based on additional user requests for services and increased demand. By reacting to the demand signal, orchestration can perform numerous functions to aid in the overall resiliency of that system. This is known as “course of action orchestration,” and it involves using a series of “if then” conditions to start services, scale them up, scale them out, or turn them off, then scale them down or scale them in based on the conditions set and the course of action that should be taken. Another key resiliency issue is persistent and nonpersistent data. Persistent data is going to be stored on a device and is not lost when the device loses power or is shut down. The most common form of persistent data is information that’s stored on your hard drive.

Now, non-persistent data, on the other hand, is usually stored in memory, and it’s going to be lost whenever power is removed. As we consider increasing resiliency, you should consider how you will best protect your no persistent data. This is extremely important in the world of databases. To overcome the issue of no persistent data, we should utilise a journaling process. Before making any data changes to the disk, the journaling process stores them in a transaction log.

That way, if power is removed prior to those transactions actually being applied, we can utilize the transaction law upon restoration to ensure that the changes become persistent. Now, as stated earlier, resilience goes hand in hand with availability. If our systems are designed to be highly available and redundant, we will create a more resilient system. By utilizing fault-tolerant measures such as redundant hardware, redundant power sources, and redundant network connections, we can increase resilience across the entire system. It’s also important to consider the likelihood of an attack when we’re developing our systems for resilience. For example, if an organization is likely to be the victim of a distributed denial of service attack, it is going to be important to design and install solutions to mitigate against that threat.

Now, as we identify the various vulnerabilities and threats that may affect our organization, we can apply controls to lessen their impact and increase our resiliency against those threats. Another key to resiliency is creating high availability. It is professionals’ responsibility to configure high availability between our servers and networks for our critical services. For example, for our networks, we often use load balancers to ensure we utilize two or more network connections to service all of the inbound and outbound requests. When we talk about high availability, we’re talking about making sure that our systems are up and available at all times. After all, networks need to be designed to maintain the availability of the resources that they’re hosting. This uptime is critically important, regardless of whether a hardware or software failure is going to be occurring.

Just as maintaining the confidentiality and integrity of our data is important to security, so is its availability. Because if the user can’t access the data, it really doesn’t matter how secure that data really is. So to achieve this, we often rely on redundancy within our networks as well as the ability to scale up or scale out as needed. There are numerous availability controls that are used to provide a high level of reliability and availability to users of the network. These include redundancy, fault tolerance, service level agreements, meantime between failures and meantime to repair metrics, single points of failure, load balancing, clustering, failover, and FailSoft.

Redundancy is a term that refers to having extra components that are not strictly necessary for the function but are used in the event of the failure of another component. For example, at my office, I have redundant network connections. In fact, I have three of them. This has three different ways for us to connect to the Internet. Our primary connection is through a microwave link; our secondary connection is from our local cable company; and our tertiary connection is through a cellular modem. On a daily basis, our network is configured to load balance across both the microwave link and the cable modem link, but we reserve the cellular link as a backup and redundant connection if one of the other two fails. This setup provides us with resilience between the two connections as well as redundancy by having a third connection that acts as a backup or failover.

Another way that we can achieve high levels of availability in our networks is by using redundant hardware. By using multiple pieces of physical hardware, such as mirrored hard drives, redundant routers, or things like that, our networks can suffer a loss of a physical component without interrupting access to the network or the data it contains. A truly redundant network would have a duplicate piece of every single hardware component, so the service remains up at all times. Unfortunately, while this might provide the best service, it also at least doubles the cost of operations, since true redundancy in our networks is too costly for most organizations. Instead, we rely on fault-tolerant technologies. These systems provide uninterrupted access by clustering systems or devices together into pools, and that way we can provide our services. For example, a raid or redundant array of inexpensive discs is one such fault-tolerant system. It utilises a pool of physical discs to be able to provide a single logical disc for use on the network. Now, in most configurations, even if one hard drive in the array fails, the data is still going to be available to the users.

We’ll discuss raids more in depth in their own lesson. Service level agreements, or SLAs, are another way for us to mitigate the cost of maintaining availability while also minimizing our downtime. Assume we don’t have enough resources to maintain a fully redundant network, so we’ll instead use a service level agreement with the manufacturer of our routers to ensure we get a replacement within 4 hours of any router failure. Failing this, it provides us with the ability to recover quickly in the case of a failure at a much lower cost than having full redundancy by buying multiple routers. Now, service level agreements can also act as a formalized agreement between a service provider and an organization. For example, we may have an SLA that states our Internet service provider is going to provide us with 100 megabits per second of bandwidth at least 99.9% of the time. Or our SLA may state that our web hosting service must maintain a 99.99% uptime. Now, when things fail, as they often do, it’s important for us to be able to recover quickly from that.

And one of the ways we do that is by buying quality components that have a low time to repair or a high time between failures. Now, the first metric we need to think about is the time to repair. MTTR is the average amount of time it takes to get a particular device fixed and back online. The shorter the average time to repair, the better it is for our organization’s availability. The second metric is the mean time between failures, or MTBF. Now, the mean time between failures is a measure of how reliable the device is because it’s going to measure the average amount of time that the device will operate before it breaks or fails. If you have a longer mean time between failures, that means you have higher reliability and higher availability for your organization. This is a common metric that device manufacturers will give you and tell you about to help you make a better purchasing decision. Because if that device has a three-year versus a five-year gap between failures, that means that device is less reliable, and we’d want to buy the five-year one. Now, when it comes to availability, it’s also important for us to consider how our network is designed and identify any single points of failure. Whenever possible, it is best to design our network without any single points of failure, such as a single router that connects our business to the Internet or a single server that houses the only copy of some important data. Redundancy is the key to eliminating single points of failure. As you find these single points of failure, you need to design redundancy into those systems so that data and those network choke points are available from multiple places inside the network. Next, we have load balancing.

And load balancing is used to spread the computational workload across multiple hardware products. For example, if we have a website with lots of users generating requests, we may need a server farm to handle all of that load. To do this, we’ll use a content switch. Now, a content switch is going to spread the load for millions of users across a large number of web servers to ensure that all of our users receive the proper service. When it comes to load balancing requests to our data, we often rely on replication. In terms of adding redundancy to our systems, replication is the act of copying or reproducing something. We’re going to use replication to spread our data across multiple servers. Replication is a form of backup that is provided in real time and is always connected. Remember, though, that replication is not a true backup because it’s not going to allow you to go back in time and restore that data. Instead, it’s just mirrors or copies of the same data across multiple servers. For example, my website utilises a content distribution network, or CDN. When I upload a new video, it’s replicated across the CDN to all of our endpoints, which then include servers at various locations around the globe.

So if one of my students wants to watch a video, they’re redirected to the closest one to their location, and they watch the replicated version of it as they’re watching the video. But if one of my team members decides to delete the original video from our main server, that deletion action will also be replicated to all the other servers around the world, and they’ll lose their copy of that video too. For this reason, remember that replication is simply a copy of the original server, and it will be deleted or modified whenever the original is deleted or modified as well. Replication can be used to ensure high availability of the data, but it is not considered an integrity measure in large data centers. We’re also going to use clustering to provide resiliency. Now, clustering is like load balancing, except it relies on software instead of hardware to perform the load balancing functions. Clustering is going to distribute the load through either round robin, weighted round robin, or lease connection algorithms to spread the load through those virtual instances. For example, at one of my previous organizations, we ran a Microsoft Exchange Server to provide email to all of our users.

Now, we had about 150 users on that system, and that is entirely too many users to provide adequate levels of service using a single server. So we used a clustered environment instead. We had five servers in that cluster, and even if one of those servers failed, the other four servers would provide service to the users requesting access to their email. When you’re using clustering, any node in the cluster can perform the workload, since all of them are redundant and configured to work together to provide resiliency and higher availability. Another concept that’s important to high availability is the concept of failover. Now, failover is a concept used in system engineering that allows the system to switch over automatically to a backup system if the primary system cannot continue to operate. Fail Soft, on the other hand, cannot failover to another system and instead is going to terminate any noncritical processes when a failure occurs in an attempt to continue operations on that particular piece of hardware. So, as you can see, there are lots of different ways to design resilience into your systems, including high availability measures, diversity or heterogeneity of components and software, and using orchestration to automatically respond to changing conditions by distributing your resources across different servers or data centers, creating redundancy, utilising replication, or configuring clusters.

4. Automation (OBJ 1.2)

In this lesson, we’re going to discuss automation and specifically the concepts surrounding autoscaling, SOAR, and bootstrapping. Automation is now defined as the use of electronics and computer control devices to take control of processes. in terms of infrastructure design. Automation can be implemented to scale our cloud computing resources either up or out, meaning vertically or horizontally, based upon your needs. This is known as “auto scaling,” and it’s simply a form of automation. Autoscaling will now monitor your applications and servers and adjust their capacities automatically to maintain consistent, predictable performance at the lowest possible cost. Essentially, you’re going to set up target utilisation levels in the system, and if those targets are hit, the system will scale up or down accordingly.

Now, let’s consider a simple example of this. Let’s say I’m running my web server on AWS, the Amazon Web Services, and I have designed it to horizontally scale so that it is infinitely elastic upwards and downwards. For this example, let’s pretend that each EC2 compute instance is going to be spawned, and I can host 100 users on my website for each of those instances. So, if I average 10 simultaneous users, I would need to have about ten instances running inside of EC2 or ten compute resources. Now, if I configure auto-scaling for my web server, I can set a limit of 70% utilization for those instances. Any time I hit 70%, I’m going to consider it to be under too much load, and I’ll automatically add another instance using auto scaling. On the other hand, if demand falls and fewer people visit my website, whenever I reach 20% or less, that instance should be taken offline to save me money. So I’m not paying for computing that I’m not using.

So 70% or more, add more servers; 20% or less, remove some servers. That’s the basic logic I’m going to employ with auto scaling. Now, all day long, the system is going to automatically add or remove EC. Based on those two conditions, I have two instances. That is what automatic scaling is all about. So, even if I have up to 1000 concurrent users and they all decide to visit my website right now, at the same time, the system should detect that load and quickly horizontally scale out more simple two instances to accept those requests and provide the service that my users require. Next, let’s talk about soaring. SOAR is an acronym that stands for Security Orchestration, Automation, and Response. Now, SOAR is a class of security tools that helps facilitate incident response, threat hunting, and security configurations by orchestrating and automating runbooks and delivering data enrichment.

SOAR is a stack of compatible software programmers that enables an organization to collect data about security threats and respond to those security events without any human assistance. For this reason, the number one thing you’re going to see SOAR use is an instant response because it can automate so many of your actions. I like to think of it as Chapter 20. Essentially, it’s a next-generation Samsung. It takes all the security information and event monitoring system data. It integrates it into Sore. And when you combine those two elements, you have your next generation Seam. And you get this really awesome product that can act on your behalf. Soar gives you the ability to scan security and threat data and be able to identify different things within it. You can then analyses it using machine learning, and you can automate the process of doing data enrichment to make that data inside the system even more powerful for your analysts to be able to use.

Finally, you can use Store to perform instant response and even automatically provision new network resources. That means you can create new accounts, and you can easily create new VMs as well. If you’re using VDI, or virtual desktop infrastructure, you can also delete somebody’s infected workstation and then create a new virtual machine for them to use. And if you use this type of capability correctly, you can do all of this with automated playbooks. Now, when discussing this, I mention the word “playbook.” What exactly is that? Well, a playbook is essentially a checklist of actions that you’re going to perform to detect and respond to a specific type of incident. So, for example, you may have a playbook that talks about what to do in the event of a phishing campaign. If somebody clicks on a link in that phishing campaign, you’re going to have several steps they need to take.

You’re going to go to the machine, isolate it from the network, perform a virus scan to make sure he’s infected himself, check the registry to make sure there’s nothing persistent in there, and then back up all the user data, format the computer, reinstall the operating system, and restore all their data back on that machine. These might be all the different actions you’re going to take based on this playbook. Now, these playbooks can either be manual or automated, but when we’re talking about a playbook, we’re just talking about the fact that there are several steps you need to do in any given process based on some sort of trigger. Now, if I automate that playbook and I run it a lot, that can move from being a playbook to being a run book. Now, a playbook is an automated version of a playbook, and it leaves clearly defined interaction points for human analysis. So for example, my store might say if somebody clicks on a link in the phishing email, they should do steps one through five.

Then, when you get to step two, pause, send it to an analyst, and they’re going to decide whether or not you’re going to reimage the machine or not. Reimage the machine. By doing this, the human and the sword can work together more effectively, saving the human’s time while keeping them in the loop for any critical decisions that you may not feel comfortable enough having a machine make through automation. This allows both of these things to work together to create a better environment and help reduce the workload of our cybersecurity analysts. Because again, we only have so many cybersecurity professionals, and if we’re having to waste their time on very minor things that we can automate, that’s not very helpful to us.

So instead, we want to automate what we can, and Sore allows us to do a lot of that automation while keeping the analysts in the loop. Finally, let’s talk about bootstrapping and its place within automation. Now, Bootstrap is a simple Perl framework that is intended to take old manual release processes and modernise them to allow them to be automated. Essentially, when you think of bootstrapping, I want you to think about how you can automate testing and deployment. Bootstrapping is often used to deploy cloud-based applications by allowing you to create templates for an entire stack that needs to be deployed across the cloud to multiple regions or availability zones quickly and easily. Remember, if you need to operate at scale, you can’t do all of this manually. You need to learn to automate your operations, and you can automate your scaling by using auto-scaling. You can automate your security responses by using SOAR, and you can automate your testing and deployment by using the Bootstrap framework.

5. Performance Design (OBJ 1.2)

In this lesson, we’re going to talk about performance considerations and how to increase the performance of your end users. When it comes to the performance of your infrastructure, there are going to be four key areas that you need to consider: latency, traffic, errors, and saturation. Now, latency is the measure of the time it takes to complete a given action. Each component in your system, your network, or your applications is going to be measured differently when you’re trying to capture latency, involved with them, and responding to a request. But the overall concept is the same. How long does your end user have to wait for that thing to happen? For example, if you go to deautrain.com, how long does it take from the time you go and type in deautrain.com and press the Enter key to when the web page finally loads up? That is, your mental latency as an end user. In the example of our website, our goal is to keep the response time on our home page under 2 seconds.

But we know that some of our web apps within our student portal may take up to 10 seconds to fully load the information because we have to make a bunch of API calls. Now, that is a long latency, but it is something we are actively working on to reduce that time and create a more enjoyable student experience for our end users. To do that, we have to look at all the different parts of the system that make up that service, measure the latency across the entire system and within each component, and then identify the bottlenecks that are causing the service to respond with a higher latency. The second area we need to consider is traffic. Now, traffic is a measurement of the busyness of your systems and their components. For example, how many people are currently visiting deontrain.com? If there are only five people, then we’re having a low traffic period. If it’s 500 people, that’s going to be a high-traffic period for us.

By understanding your traffic patterns, you can determine when the load or demand on your services is going to be at its highest and when. You should add additional resources to handle that load. Often, you’re going to find that periods of high latencies also correlate with periods of high traffic. If this is the case, you may need to adjust your auto-scaling limits if you’re going to be using cloud-based servers. Next, we need to discuss errors. Errors are another important thing to track and monitor because they’re going to help you identify issues that could cause increased latency due to improper coding, decreased traffic due to broken links or broken systems, and increased user frustration. Also, errors could be an indication of a possible security issue.

So it is really important to read your error logs and determine the root causes in order to keep the performance of your systems high. Fourth, you need to consider your saturation levels. Now, saturation is measured by how much of a given resource is being utilised by your system or service. This is usually measured as a percentage of total capacity. For example, if you have 16 GB of memory and are currently using 8 GB of it, you are at 50% saturation. By using some basic assumptions, you should be able to handle twice the current load before your system crashes because you’re not fully saturated quite yet since you’re only at 50% loading. So now that we understand the basics of performance by measuring our latency, traffic errors, and saturation, it’s also important for us to look for solutions to help improve our performance. Two great ways to do this are to use content delivery networks, or CDNs, and caches.

Now, a content delivery network, or CDN, is going to refer to a geographically distributed group of servers that are going to work together to provide fast delivery of Internet content. A CDN is going to allow for a quicker transfer of assets that are needed for loading internet content, including web pages, JavaScript files, style sheets, images, and videos. CDNS are extremely popular, and today the majority of web traffic is actually stored and served through the CDNS, resulting in reduced latency. After all, if I have a student who’s sitting in India trying to access data from my server in the United States, that inherently has latency involved because of the distance between those two locations. But if that same student in India is instead accessing my content from a CDN in Asia, we can reduce the latency by 300 or 400 milliseconds on average, and that is about a third of a second less time for them to wait.

By using a CDN, you can also improve your website load times, reduce your bandwidth costs, increase content availability and redundancy, and improve your website security. This is because a lot of CDNs also provide Dodos mitigation, better security certificates, and other optimizations for you. For example, Cloud flare and Akamai are both well-known and respected CDNs that provide these additional security features for the websites they host. Finally, let’s talk about caching. Caching is a high-speed data storage layer that stores a subset of data so that future requests for that data are served up faster than would be possible by accessing the data’s primary storage location. Now, caching occurs at many places in your systems. For example, on your own laptop or desktop. It keeps a copy of recently accessed websites in its local file cache.

Your proxy server keeps a copy in its cache, and if you run a website, you likely have a cache of your pages that’s served up to visitors before trying to get an updated version from the server’s database. If nothing has changed since then, caching can be performed by high-speed memory such as RAM or in cloud memory engines, at the application or storage layer, depending on your configuration. If you’re going to use caching in your system’s design, Remember, you need to include controls such as the Time to Live, or TTL.

That way, your cache always knows how old the data is that it’s holding. For example, on my website, we use a six-hour cache, so after six hours, our cache will delete what it has and pull the latest copy to serve our students from the master database. Now, if I update the content and I want to ensure everybody gets it right away, instead of waiting for 6 hours, I can simply clear the cache after updating the content, and then the cache will begin to rebuild over time as more people are accessing my website. Now, in this lesson, we just discussed a little bit about performance concerns, and I provided you with two methods to help increase your system’s performance using CDNS and caching. Remember, if you’re having performance issues, first you need to identify your end-to-end performance by measuring it. Then measure the individual components so you can identify where your performance bottleneck is and then remove it.

6. Virtualization (OBJ 1.2)

This is known as “auto scaling,” and it’s simply a form of automation. Auto scaling will now monitor your applications and servers and adjust their capacities automatically to maintain consistent, predictable performance at the lowest possible cost. Essentially, you’re going to set up target utilization levels in the system, and if those targets are hit, the system will scale up or down accordingly. Now, let’s consider a simple example of this. Let’s say I’m running my web server on AWS, the Amazon Web Services, and I have designed it to horizontally scale so that it is infinitely elastic upwards and downwards. For this example, let’s pretend that each EC2 compute instance is going to be spawned, and I can host 100 users on my website for each of those instances.

So, if I average 10 simultaneous users, I would need to have about ten instances running inside of EC2 or ten compute resources. Now, if I configure auto-scaling for my web server, I can set a limit of 70% utilization for those instances. Any time I hit 70%, I’m going to consider it to be under too much load, and I’ll automatically add another instance using auto scaling. On the other hand, if demand goes down and there are fewer people accessing my website, anytime I get down to 20% or less, that instance should be taken offline to save me money. So I’m not paying for computing that I’m not using. Increase the number of servers by 70% or more. 20% or less, remove some servers. That’s the basic logic I’m going to employ with auto scaling. Now, all day long, the system is going to automatically add or remove EC.

Based on those two conditions, I have two instances. That is what automatic scaling is all about. So, even if I have up to 1000 concurrent users and they all decide to visit my website right now, at the same time, the system should detect that load and quickly horizontally scale out more simple two instances to accept those requests and provide the service that my users require. Next, let’s talk about soaring. SOAR is an acronym that stands for security, orchestration, automation, and response. Now, SOAR is a class of security tools that helps facilitate incident response, threat hunting, and security configurations by orchestrating and automating run books and delivering data enrichment. SOAR is a stack of compatible software programmers that enables an organization to collect data about security threats and respond to those security events without any human assistance.

For this reason, the number one thing you’re going to see SOAR use is an instant response because it can automate so many of your actions. I like to think of it as Chapter 20. Essentially, it’s a next-generation Samsung. It takes all the security information and event monitoring system data. It integrates it into Sore. And when you combine those two elements, you have your next generation Seam. And you get this really awesome product that can act on your behalf. Soar gives you the ability to scan security and threat data and be able to identify different things within it. You can then analyses it using machine learning, and you can automate the process of doing data enrichment to make that data inside the system even more powerful for your analysts to be able to use. And finally, you can use Instant Response and even Store to automatically provision new resources for your network automatically using Store.

That means you can create new accounts, and you can easily create new VMs as well. If you’re using VDI, or virtual desktop infrastructure, you can also delete somebody’s infected workstation and then create a new virtual machine for them to use. And if you use this type of capability correctly, you can do all of this with automated playbooks. Now, when discussing this, I mention the word “playbook.” What exactly is that? Well, a playbook is essentially a checklist of actions that you’re going to perform to detect and respond to a specific type of incident. So, for example, you may have a playbook that talks about what to do in the event of a phishing campaign. If someone clicks on a link in that phishing campaign, there are several steps they need to take.

Then, when you get to step two, pause, send it to an analyst, and they’re going to decide whether or not you’re going to reimage the machine or not. Reimage the machine. By doing this, the human and the source can work together more effectively, saving the human’s time while keeping them in the loop for any critical decisions that you may not feel comfortable enough having a machine make through automation. This allows both of these things to work together to create a better environment and help reduce the workload of our cybersecurity analysts. Because again, we only have so many cybersecurity professionals, and if we’re having to waste their time on very minor things that we can automate, that’s not very helpful to us. So instead, we want to automate what we can, and Sore allows us to do a lot of that automation while keeping the analysts in the loop.

Finally, let’s talk about bootstrapping and its place within automation. Now, Bootstrap is a simple Perl framework that is intended to take old manual release processes and modernise them to allow them to be automated. Essentially, when you think of bootstrapping, I want you to think about how you can automate testing and deployment. Bootstrapping is often used to deploy cloud-based applications by allowing you to create templates for an entire stack that needs to be deployed across the cloud to multiple regions or availability zones quickly and easily. Remember, if you need to operate at scale, you can’t do all of this manually. You need to learn to automate your operations, and you can automate your scaling by using auto-scaling. You can automate your security responses by using SOAR, and you can automate your testing and deployments by using the Bootstrap framework.

7. Securing VMs (OBJ 1.2)

In this demonstration, I want to show you how you can better secure your virtual machine from the outside environment. We’re going to do that by using encryption as well as disabling sharing between the host operating system, in my case, the Mac system, and my Windows 10 virtual machine. If I don’t allow sharing between the two, that way, even if the Windows machine gets a virus or some sort of malware, it can’t escape the virtual environment and can’t affect my host operating system, which is the Mac.

So from within VirtualBox, on the left side, you’ll have all of your systems listed. On this one, it’s a new install, and I only have the one Windows 10 machine that I made earlier. Now if you go ahead and click on it and then click on Settings, you’ll have these settings for it. There are two areas that I want to go through. The first one is disc encryption. You have this file on your system that’s holding your entire Windows operating system—that virtual machine. So I’ll open Finder and find it in my VirtualBox VM folder. In here, I have this one folder that is Windows 10. And underneath it, you’ll see that I have some log files. I have some configuration files, which are my VirtualBox settings.

And this VDI, this nine-gigabyte file, is the entire hard drive image of that Windows 10 machine. Right now, it’s not encrypted. And so anybody who got access to this machine could get the data from that Windows machine, and we don’t want that. So one of the things you want to do is go into VirtualBox, and under the General tab, there is this disc encryption setting. You can enable disc encryption. You can choose which cypher it’s going to use: AES 256 or AES 128. 256 has a higher bit key, so it’s going to be better. And then you can give it a long, strong password. Something like this is going to give me an ice-long 16-character password that is a mixture of uppercase, lowercase special characters, and numbers. When I click OK, it’s going to go through and encrypt that disc image. Because the encryption process requires processing and disc access resources, this may take some time in your system, depending on its power. On my system, this only took about 30 seconds because we have a very high-performance system using solid-state hard drives. The second thing we want to do is go into our settings and look at our shared folders. So, right now, you can see that we have a nice encrypted disc under DiskEncryption.

And when we go over here to our shared folders, we don’t have any set up right now. Now, inside the shared folders, if you add a folder, this will make a connection between the virtual machine and the host machine. So in my case, if I want to go ahead and connect my Mac’s desktop folder to this Windows machine, I can go ahead and hit auto mount, which will allow me to auto mount this folder as a shared resource that the Windows machine can get. You can also do it as a read-only version. So it’s a one-way transfer where the Windows machine can read from the Mac, but the Mac can’t read from the Windows machine. Right now, I have it set up as a two-way share. We’ll go ahead and hit OK, and I can go ahead and boot up this Windows machine once you go to boot it. Because we’ve set that encryption, we do have to enter that long, strong password each time that the file is decrypted, allowing us to boot up the hard drive. And now, once we’re booted up into Windows, we can click on the folder, we can go to the network, and you’ll see there is now this network server called Vbox Server. This is what hosts all the shared files and folders. So we can see that the desktop folder that I shared is now present. And it may appear that the Mac desktop is currently empty from here.

Now if I look at the Mac, you can see there’s nothing on my desktop. Let’s go ahead and make a file here just to show that we have a connection between the two. Allow me to create a text document from Windows. And now if I go back over here, you’ll see from Windows that it’s there. And that two-way connection is dangerous because if you have a Windows host system and a Windows virtual machine system and you get some sort of virus or malware inside the virtual machine, it can then be transferred to your host computer. So what I recommend is that we don’t have that connection set up. So inside a VirtualBox, I like to go in and delete those connections and make sure that this virtual machine stays isolated and that there is not a connection between the two. And this will provide you with a little more security when using these virtual machines.

8. Containerization (OBJ 1.2)

The newest type of virtualization that’s becoming popular in our networks is known as container-based virtualization, also known as containerization. This type of virtualization is much more focused on servers than the end user, though. With this type of virtualization, the operating system kernel is shared across multiple virtual machines, but the user’s space for each virtual machine is uniquely created and managed. Containerization is a type of virtualization that’s applied by a host operating system to provision an isolated execution environment for an application. Containerization is considered fairly secure because it enforces resource segmentation and separation at the operating system level. Now, containerization is commonly used with Linux servers, and some examples of this container-based virtualization include things like Docker, Parallels, Virtuoso, and the Open Project.

Now, what does containerization really look like? Well, you have a piece of hardware, and then on top of that hardware you have a host operating system, usually Linux. And then you have a container manager, something like Kubernetes, Docker, or something like that. Now, this container manager is going to be used to create different containers that contain different applications. In this case, I have three containers.

I have the first environment, which is based on the kernel of the host OS—in this example, a Linux system that’s being used. And so we have a container running Linux that contains some applications there. Now container two can do the same thing, and container three can do the same thing. Since all three containers are sharing the same host operating system files, this takes a lot less resources than doing pure virtualization using virtual machines, like we talked about in our virtualization lesson. If we instead use individual virtual machines, each one would need its own copy of an operating system, and that could be eight to 10 GB for each one.

But with containers, we can all share the same operating system. So it uses a lot less storage space and takes a lot less processing power. This is the real benefit of using something like a container from an operational perspective. Now, because these containers are logically isolated, they can’t actually interface with each other. If I wanted to have two containers talk, I would actually have to connect them through a virtual network and do the right routing and switching to allow them to talk. By default, they have no way of talking to each other. And this is a great thing from a security standpoint because we have this segmentation. Now here’s your big warning when it comes to dealing with containers: If an attacker compromises the host OS underneath, for example, that Linux operating system I was just discussing, guess what’s happening? Well, this attack now has access to all of the containers and their data because that one operating system is used by all the containers being hosted. This is one of the biggest vulnerabilities when you use containers. I have a container system that’s running 50 different servers right now, and each of these is running different servers and services using containerization.

But if somebody is able to get to that one server underneath that Linux operating system, they now have access to all 50 of those hosted servers and all of their data. Another risk to consider is how your containers and other virtual machines are actually going to be hosted. Once we begin to rely on virtualization and cloud computing, it becomes very important to recognise that our data might be hosted on the same fiscal server as another organization’s data. As a result, we introduce some vulnerabilities into our systems’ security. First, if the physical server crashes due to something one of the other organisations is doing, it can actually affect all the organisations on that same server. Similarly, if one organisation is not maintaining the security of their virtual environments, they’re being hosted on that physical server. There is a possibility that an attacker can exploit that to the detriment of all the other organisations hosted on that same server. Just as there are concerns with interconnecting our networks with somebody else’s, there are also concerns with hosting multiple organizations’ data on the same physical server. It’s important for us to properly configure, manage, and audit user access to the virtual machines that are being hosted on those servers.

Also, we want to make sure we ensure that our virtual machines have the latest patches, antivirus software, and access control in place. In order to minimise the risk of having the physical server’s resources overwhelmed, it’s always a good idea to set up our virtual machines with proper failover redundancy and elasticity. By monitoring network performance and the physical server’s resources, we should be able to balance the load across several physical machines instead of relying on just one. Another vulnerability that might be exploited by an attacker is when we use the same type of hypervisor across all of our virtual machines. So, for example, if our organisation relies solely on VMware’s ESXi hypervisor and a new exploit is created for that hypervisor, an attacker could use that against all of our systems. So if it’s successful on one of our servers, they’re likely going to try it on the rest of our servers. And if all of our servers use the same platform, in this case VMware’s ESXi, this vulnerability could be exploited across our entire organization.

The challenge with this, though, is that again, if we utilize multiple hypervisor platforms, our support costs and our training requirements are also going to increase. For this reason, most organizations do choose to use a single platform, but at least they’re making a measured risk decision when they do that. To mitigate that risk, the organisation should utilise proper configurations, ensure the hypervisor always remains patched and up-to-date, and only be accessible through a secure management interface, as well as tightly control access control.Now, these are some of the things you must consider when you start figuring out if you’re going to virtualize your systems and migrate them to an on premise or cloud-based solution first: are you going to virtualize? If so, are you going to use traditional virtual machines or are you going to use containerization? What is the risk/reward ratio here? There’s a balancing act you have to do. It’s a business decision, and it’s also a cybersecurity decision. And so you’re going to have to measure these things and decide what is best to select for your organisation and its particular use cases.

Uncategorized

Related posts:

Leave a Reply Cancel reply