Amazon AWS Certified Database Specialty – Amazon Redshift Part 3

June 18, 2023

13. Redshift concurrency scaling

Now let’s talk about redshift concurrency scaling. So concurrency scaling is a feature of Redshift that allows you to automatically scale your cluster capacity to support the increase in read workload. So if your redshift cluster detects an increase in concurrent read request, then it’s going to automatically scale up your cluster capacity. And concurrency scaling can actually scale to a virtually unlimited number of queries. Concurrency scaling is a parameter in your Wlmq configuration and is a dynamic parameter. So it does not require a reboot when you enable concurrency scaling.

And to enable concurrency scaling, you simply have to set the concurrency scaling mode parameter to auto for a given Wlmq. Of course. And remember, you can continue with your redirect operations during the period when concurrency scaling is in progress. Okay. And concurrency scaling works on credit basis. You get about an hour of free concurrency scaling credits every single day. Every day, you accumulate about 1 hour of concurrency scaling. And this will be automatically used whenever your cluster requires it or whenever your cluster sees an increase in read requests. All right, let’s continue.

14. Redshift scaling

Now, let’s talk about scaling options in redshift. So this is your redshift cluster with a leader node and compute nodes, and you can use different scaling options to scale your cluster. So the first one is elastic resize, and you can use this to add or remove nodes from your existing cluster. And this just takes a few minutes, four to 8 minutes typically. So you can add new nodes or remove nodes. And remember that this is a manual process, it’s not automatic. And whenever you carry out elastic resize, your cluster will remain unavailable. All right? And then we also have the classic resize option, where you simply change the node type or the number of nodes or both of them. Okay? So you simply increase the size of your instance with classic resize. And the way classic resize works is it will copy your data into a new cluster.

So during a classic resize process, your source cluster will be in read only mode. So remember that whenever the classic resize operation is in progress, your source cluster will be in read only mode, all right? And then you can always use a snapshot and restore and then use a classic resize. So you have your original redshift cluster, you restore to another cluster of the same size using snapshot restore, and then you resize that restore cluster using classic resize. And the benefit of using the snapshot restore instead of just using the classic resize is that this approach will ensure that your source cluster is available throughout the scaling process. All right? So the source cluster remains available throughout, and once the operation complete, then you can manually copy the delta of new data to the new cluster. All right, let’s continue.

15. Redshift backup, restore and cross-region snapshots

Now, let’s talk about the backup and restore options in Redshift. So, Redshift maintains at least three copies of your data the original, the replica on the compute nodes, and a backup copy in S three. So the snapshots in Redshift are point in time backups of your cluster, and they are internally stored in S three, and these backups are incremental. So the first backup will be a full backup, and subsequent backups will be incremental. What that means is only the changes are additionally saved. And when you restore from a backup, you always restore into a new cluster. Okay? And you have two types of backups automated and manual.

With automated, you can configure your backups as per your requirements. So you could have automated backups every 8 hours or every five gigs, or you can also have a custom schedule, and you also set your retention period for the automated backups. The manual backups, on the other hand, can be retained as long as you want, so unless you delete them, they will be retained. And in addition to this, you can also ask Redshift to automatically copy your snapshots to another AWS region. So you can copy these snapshots to another region, irrespective of whether they are automated or manual snapshots. So let’s see how to copy these snapshots to another region.

So, if you have unencrypted snapshot, then the process is simpler than if you have an encrypted snapshot. All right, so let’s look at the unencrypted snapshots first. For unencrypted snapshot, you simply configure cross region snapshots from the Redshift console. So when you choose the option to configure the cross region snapshot, this is what you will be presented with. You simply provide a destination region and specify the retention period for your automated and your manual snapshots, and save that and you’re done. Now, if you’re using encrypted snapshot, then the process is slightly more elaborate.

When you choose to configure cross region snapshots for encrypted snapshots, then you have to additionally specify what is called as a Snapshot Copy grant. The screen would look something like this. You have to provide a name for your Snapshot Copy grant and choose the Kms key that should be used for the encryption purposes. So you can use the AWS own key, or a key from your current account, or a key from any other AWS account. Okay? So when you save this, a Snapshot Copy grant will be created in the destination region, okay? And this allows the destination region to store the encrypted snapshot. So that’s about it. Let’s continue.

16. Redshift Multi-AZ deployment alternative

Now, let’s talk about multiaz deployments in red shift. Now, we already know that there is no multiaz in red shift, but there is an alternative that you can use to implement multiaz kind of deployment with red shift. So let’s find out how to do that. So, what you could do is you simply run two separate clusters in different AZ and load the same data to these two clusters. You have two red shift clusters running in two Availability Zones, and you use a snapshot of one cluster to restore data into another cluster. So you can restore a cluster snapshot to a different Availability Zone. The two clusters will now have the same set of data, and you can use Rest Shift Spectrum, for example, to query the data spread across these two az’s.

17. Redshift availability and durability

Now let’s talk about Redshift availability and durability. So let’s consider different outage scenarios. Solet’s say we have a drive failure, and in this case, your Redshift cluster will remain available, although with some decline in the query performance. So the way Redshift handles drive failures is it uses is data stored on a replica on other drives within the node. So what it’s going to do? Is it’s going to replace the failed drive and copy the data over from the replica stored on other drives within that particular node? Of course, if you have a single node cluster, then this is not going to work. Single node clusters do not support data replication, so you must restore from a snapshot in that case. But if you have multi node clusters, the drive failures will be handled by Redshift automatically. Now, this was about the drive failure.

What happens if we have a node failure in case of node failures as well? Redshift has the capability to detect and replace the failed nodes. But only thing to remember here is that the cluster will remain unavailable until the node gets replaced. And we already know that Redshift does not have multiaz support. So if the AZ goes down, then your cluster will not be available until the AZ outage gets resolved. But your data is still preserved. And apart from this, as we already discussed, you can always restore from a snapshot to another AZ within the region. And when you do this, your frequently accessed data gets restored first. So your business operations can resume as soon as possible. Alright, so that’s about availability and durability in Redshift. Let’s continue to the next lecture.

18. Redshift security

Now, let’s talk about Redshift security. First, encryption. So Redshift supports encryption at Rest as well as encryption in transit. So for encryption at Rest, redshift uses Hardware Accelerated as 256 bit encryption. And you can use your own Kms keys here, or also use HSM or the Hardware Security modules. And for encryption in transit, we use SSL connections. And when you make the API requests to Redshift, they must be signed using the SIGV Four process. SIGV Four process simply signs your API requests using your AWS credentials. All right? And when you use Redshift Spectrum, for example, you can also make use of S Three, SSE or Server side encryption. And in addition to this, if you’re interested in end to end encryption, you can use the client side encryption, also called as envelope encryption.

All right? So the connection between your client application and Redshift is encrypted using SSL. Similarly, the connection between your data source, like Kinesis Two, S Three is also encrypted using SSL or TLS. And the connection between S Three and Redshift that’s typically used for copy command can be encrypted using the S Three server side encryption. And if you need client side encryption, you use Envelope Encryption. So what it does is it encrypts your data at the source. So as the data moves from the source all the way to Redshift, it remains encrypted. All right? So this is about encryption. Let’s talk about IAM and network. Now, so access to Redshift resources is controlled at four different levels. First, the cluster management is done using your Im policies.

Connectivity can be set up using your VPCs and security groups for database access. You can use your grant revoke statements or create user create group statements to create your users groups and permissions. And Redshift also supports Im authentication for use with temporary credentials and for SSO. And remember that you can only talk to Redshift using the leader node. You cannot access compute node directly. So whenever you make a request, it goes to leader node, and leader node will distribute your query across different compute nodes and also gather data from different compute nodes and return you the aggregated result. All right, so that’s about Redshift security. Let’s continue to the next lecture.

19. Enhanced VPC routing in Redshift

Now let’s talk about the concept of enhanced VPC routing and redshift. So, enhanced VPC routing is used to manage data traffic between your redshift cluster and other resources that might require Internet access. For example, if you want to use copy or unload commands, then typically this traffic will go over the inner net if you don’t have enhanced VPC routing in place. If you set up enhanced VPC routing, then all your copy and unload traffic can be routed through VPC. And just as I said, if you don’t enable enhanced VPC routing, then this traffic will be routed over through the Internet. And this includes traffic within AWS, like the S Three traffic for example.

So if you have your S Three bucket within the same region as your cluster, then you can use the VPC endpoints or S Three endpoints to be specific. So you simply use a VPC endpoint to route your traffic through VPC to your S Three bucket. If you have your S Three bucket sitting in a different region, then you might have to use Nat gateway. So you set up your Nat gateway and then traffic can be routed through VPC to the bucket setting in another region. And if you want to connect to AWS services outside your VPC, you can consider using Internet gateway as well. Now, in addition to this, you can also use enhanced VPC routing with your redshift spectrum.

So here we have the redshift spectrum which accesses your S Three data, and it stores your external tables in the external data catalog like Athena, Glue, or Hive data stores. So by default, redshift spectrum does not use enhanced VPC routing. And if you want to route that traffic through your VPC, then you need to do some additional configuration. For example, you need to modify your Im role to allow traffic to the S Three bucket from redshift spectrum. And you also need to modify the policy that’s attached to your S Three bucket. So typically, if you have enabled enhanced VPC routing on your redshift cluster, then the spectrum will not be able to access the S Three bucket if the bucket policy restricts access only to the VPC endpoint that you use for enhanced VPC routing.

So what you could do is you should use a bucket policy that restricts access to specific principles like users or accounts for example. So this will allow your restrict spectrum to access the S Three bucket. And in addition to this, we already know that Spectrum uses Athena and Glue to store your external table schema. So you also have to configure the VPC to allow your cluster to access AWS, Glue, and Athena. All right, so that’s about enhanced VPC routing. Let’s continue to the next lecture.

20. Redshift monitoring

Now, let’s talk about Red shift monitoring. So, monitoring is provided typically by Cloud Watch, and Red shift is not an exception. So you can monitor different Cloud Watch metrics from right within your Redshift console, or you can also monitor them through Cloud Watch. So typical metrics are CPU utilization, latency through port, et cetera. And in addition to Cloud Watch metrics, redshift also provides you monitoring information about your query and load performance. Now, this data is not published to Cloud Watch metrics. It’s only displayed within the Redshift console. So this includes cluster performance, query history, database performance, as well as information about concurrency, scaling.

And we also have what’s called as the Redshift advisor. So the redshift advisor offers you recommendations for cost optimization and performance tuning of your Redshift cluster. All right, let’s continue. Let’s talk about database audit, logging, and redshift. So, Redshift provides you with audit logs. These are typically stored in s three buckets, and these are not enabled by default. You must enable them if you want to use them. And there are three parts to Redshift audit logs. First, the connection log that stores the connection disconnection information as well as any failed login attempts. Then the second part is the User log, which stores changes to user information.

And finally, we have the User Activity log that logs different queries that are made to the Redshift cluster. Now, in addition to audit logs, redshift also maintains the STL and SVL tables. STL and SVL are the prefixes of the table names. So these are similar to audit logs, and these are available by default on every redshift cluster node. So, for example, we have STL underscore query table that logs all the executed queries, and we have STL connection log. And this table logs the connection disconnection information along with the failed login attempts. This is similar to the connection log that you have in the Redshift audit logs. And just like any other AWS service, we have Cloud Trail to capture API calls on your Redshift cluster.

21. Redshift pricing

Now let’s talk about redshift pricing. With redshift you only pay for what you use. So you pay for compute node hours. You pay for backup storage, whether it is manual or automated. And just like any other AWS service, you also pay for data transfer. But you don’t pay for data transfer between redshift and s three if they are in the same region, then you also pay for the data scan by spectrum. So amount of s three data that your spectrum query scan is what you pay for. You also pay for managed storage if you’re using RA three nodes.

And for concurrency scaling, you would pay a per second rate. Or about the free credits that you get, you get about 1 hour of free concurrency scaling credits for 24 hours. So every day you get about an hour of concurrency scaling credits for free. And anything about that will be charged on a per second rate. And for spectrum there are no charges when you are not running queries. So you only pay for the amount of s three data that your queries can. Alright Mind.

22. Redshift related services – Athena and Quicksight

Now let’s look at Athena and Quick side. So, Athena is an analytical tool, similar to Redshift, but kind of a watered down version of Redshift. So to say, you can run SQL queries on top of your S Three data using Athena, your paper query, and you can output to S Three as well. Athena supports CSV, JSON, Parquet, or ORC data formats. All Athena queries are logged in Cloud Trail, and these can be changed to Cloud Watch logs for monitoring within Cloud Watch. And this is great for sporadic queries. So if your use cases are like, you want to run analytics once in a while, or once in a month or something like that, then you can definitely use Athena. So these are typically for sporadic workloads or once in a while use cases.

Athena is a good option to go, and Redshift is like if you have a sustained usage, if you have a large amount of data, then you can definitely consider using Redshift instead of Athena. And then we have QuickSight. Now, QuickSight is a visualization tool, or a Bi tool, or a business intelligence tool. And you can use QuickSite to create different analytical dashboards that help you with making your business decisions. And Quicksa integrates with different services like Athena, Redshift, EMR, RDS, and so on. So here is a quick example of using Athena. It can get data from S Three, it can output data to S Three. This can be analyzed using Quick site, and the queries are logged in Cloud Trail, and these can be changed over to Cloud Watch logs.

And you can also set up different Cloud Watch alarms from Cloud Watch logs. So this is a simple scenario of Athena in action. Now, let’s also look at Quick site. So, QuickSite, as I mentioned, is used to create analytical dashboards using different sources, which includes Redshift as well. So you can gather data and analytical information from different sources, like RDS, Redshift, Athena, S Three, as well as third party tools like Salesforce, Teradata, Excel Files, Jira, and so on. And this is not an analytics course, and Quick site is not a database. So we’re not going to go deeper into it. So that’s all we need to know about Quick site from the exam perspective. So let’s continue to the next lecture.

23. AQUA for Redshift

Now let’s talk about Aqua. For redshift, Aqua is Advanced Query Accelerator, and this is a new feature of Redshift that AWS has come up with. And the Aqua, or the Advanced Query Accelerator, is a new distributed and Hardwareaccelerated cache that sits on top of your redshift cluster. It allows you to run your data intensive tasks like filtering and aggregation closer to the storage layer, which is S three. So what this does is it avoids the networking bandwidth limitations. And if you compare this with other data warehouses, then other data warehouses typically need to move data to compute nodes for processing.

But with Aqua, you can run your data intensive tasks closer to the storage layer without having to move data to compute nodes for processing. And this enables Redshift to run up to ten times faster than other cloud data warehouses. And Aqua can process large amounts of data in parallel across multiple nodes. It can automatically scale out to add additional storage capacity. And you do not need to make any code changes to start using Aqua if you’re already using Redshift. So that’s all about Aqua. And that also brings us to the end of this section on Redshift. So I hope you found this information useful. And that’s the end of this section. Let’s continue to the next section where we discuss another database service.

Uncategorized

Related posts:

Leave a Reply Cancel reply