8. Cloud Storage Demo Part 2 of 2
Now if I select the file, you can see that that’s the container picture. So, so, and so forth. And if I go back here, I could then just as well edit the metadata if I want. I could use that image again and add whatever metadata I wanted. And then, if I wanted to, I could copy that as well. I don’t have to just move it; I could copy it. And then if I go here to browse, let’s go ahead and go back to the test folder, select it, and then copy it. And now if I go to the test folder, you can see that it’s not there. Well, I wonder why. One of the things you want to be aware of is that there could be some latency when you do some of these operations. So if I go back and refresh, this may take a second.
Let’s go back, and it could take again—it could take ten minutes, it could take a minute. You just never know. A lot of it just depends on the latency. And again, we’ll go back and check that in a second now. It appears that yes. Okay, so the copy is finally here. As you can see, it says “copy of.” So now let’s just go ahead and delete that. Actually, we need to go up here. Go delete it. Get out of the way. Okay, now that’s essentially a very simple process to create a bucket, a folder, and move files around. In the next demo that I have coming up, we’re going to talk more about transfer and essentially some of the capabilities that you could use to take AWS and migrate over, or, let’s say, your own data centre requirements, assuming that they meet the requirements to move files over from your infrastructure to GCP.
So to close it out, let’s go back to storage. And you could see that I have my buckets. Let’s go ahead and do a quick clean-up here. And let’s just make sure I’m where I’m supposed to be, right there. And before I delete this, I just want to check a few things. You could see that the lifecycle here is enabled. Now, before I delete anything, let’s just pay attention to what we delete. Because one of the things that is very easy to do is to accidentally delete your files if you do not organise your files properly. Now, Google warns you once more—of course, before you delete something, which is great. So in this case here, let’s go over to this bucket here. And it looks like we have a little latency going on here. And we did. And now I’m in that bucket. What I’d like to do now is let’s go ahead and delete that folder, and it says, “Do you want to delete it?” And there are two files in there. It says that once you do it, you can’t undo it. And yeah, I’m happy to go ahead and get it done. It tells you to delete it. Now let’s go ahead and select the remaining files, and as you can see, I could select a folder or whatever number of files I so choose.
I could go ahead and delete that as well. And now I’m back to where I started. Let’s go back to the buckets. As you can see, this is what we had beforehand with the exception of this. So guess what we want to do? We want to go ahead and delete that bucket, and we’ll delete that one. Now, before you delete any of the buckets, if you have any of your applications using the buckets, just be aware that it does let you. So if I go here to delete, it tells you that you can’t undo this. Be very cautious because if you have applications using this bucket and dropping files into it, and you delete the bucket, guess what’s going to happen when they try to drop or retrieve files from the bucket? It’s not available. And again, this is where you create some. I guess the word I want to use is significant, but there are avoidable issues when you do delete those files. So that was about it for buckets. Let’s go ahead and continue on.
9. Cloud SQL Demo
So, spanner, we’re going to talk about the overview and then also some of the product features. Cloud Spanner is essentially the only enterprise-grade, globally distributed, and strongly consistent database service built for the cloud, specifically to combine the benefits of a relational database structure with non-relational horizontal scale. Okay, so what exactly makes this unique? Well, that’s a good question. When it comes to cloud spanner, this is a distributed approach. Now, typically, if you want to scale SQL, you generally need to scale up. When I say scale up, I mean that you want to scale vertically, which means adding more resources to that virtual machine. In most cases, that’s right. SQL does not scale very well in general. Relational databases don’t really scale well horizontally. They’re really not meant to do that. When it comes to cloud spanner, it has a unique niche in the sense that it’s really the only thing out there that does what it really accomplishes, I guess.
Now that cloudSpanner is really the only enterprise-grade, globally distributed, inconsistent database service, Amazon really doesn’t have anything comparable. Now you’ll read, for example, “Redshift” or “Aurora,” but they’re really totally different use cases. You’ve got a SQL database; yeah, it might be comparable to cloud SQL, but it’s not globally distributed. It just doesn’t scale that way. So, with cloud spanner, what it really does that no other competitor can currently do is the ability to scale it globally. You can scale it horizontally, and it’s fully managed. So it’s automatic. The availability is high and strongly consistent. There’s a white paper that will go through that Google posted. That, I think, is an excellent thing, especially if you’re a data architect. You’re going to definitely be a data engineer, a data architect, and you definitely want to read the paper, especially if you’re familiar with SQL, for example, and you really get it really quickly. But cloud spanner really addresses a lot of challenges, especially with inconsistencies. For example, because this is built on the Google Network, you already know that Google has the best infrastructure of any cloud provider they built. And they own their own networks, so their performance is second to none. They’re able to scale this out multiregionally.
This is a big deal for you as a customer. So you could scale this out to Asia, the US, and Europe, for example, and be able to sleep at night. You could really distribute this globally. This is a big deal. And again, it’s very secure. It’s got a lot of built-in capabilities around security. It is built, of course, just like all the other Google services, on the Google Cloud Platform’s foundation of security. Another important note that’s a distinguishing factor between it and other Google services is that it’s managed by Google’s SRE team. That’s the site recovery engineering team. So you have some of the highest-paid, most skilled engineers managing this infrastructure. This is something that Amazon certainly can’t say when it comes to cloud spanner. Where does it fit in? Well, it’s relational. So you could take SQL and run it on cloud storage. That’s what it’s meant for, and you could distribute that database globally. This is a big deal. So if you have a database that you have to scale and you want high availability, this is what you want to do with it. This is a good use case for financial transactions. Essentially, add technology. Again, a lot of different usecases on the spanner page. I believe there are numerous use cases for this eCommerce—logistics, telecom, you name it. It’s extremely quick to deploy security.
It will support course encryption by default, both in transit and at rest. This is again a de facto standard on the Google cloud. Anyways. Of course, there is also a moral component. It’s going to support Google’s security architecture as well when it comes to multilanguage support. Again, it’s going to support everything that you want to support. Java Node JS PHP python ruby I don’t believe it supports Ruby on the rules yet, but it does support Ruby. Now, this is good, and this is on the cloud-spanner page. You can pull that off. This is a good little comparison. So, once more, does it support this? It supports everything; it supports high availability; scalability is horizontal, whereas a traditional database is vertical. Replication is automatic; you don’t have to worry about it, whereas typically it’s configurable in the other situations.
And again, strong consistency, right? The white paper we’re going to show you talks about acid versus base. For those data engineers, database gurus, and big data folks, you know exactly what I’m talking about. For example, does that transaction get written when it is fully completed? In other words, does that response get acknowledged?, once that transaction is confirmed, it’ll go ahead and confirm that transaction as complete. Again, CloudSpanner’s relational database is designed to support full relational semantics. You can basically address it with SQL commands, basically support schema changes, and basically reuse your existing SQL skills. Here are some links. Now let’s go over to the PDF I want to show you. The PDF will also be available for download during the lecture. Now, here is the PDF. It’s known as spanner true time.
And then there’s the cap theorem. Remember the cap versus base debate? This goes through what the Cap theorem is. Remember, consistency, availability, and partition tolerance essentially go through all that. I won’t actually spend time on this because it’s a little bit outside the realm of what we want to COVID, but I want to make sure that you understand this, especially if you’re in a situation where you’re trying to understand the difference between SQL, cloudSQL, and spanner, which have very different use cases. They support SQL. On the other hand, cloud SQL is not going to scale globally. Now, can you scale cloud SQL on a regional basis globally? with different instances? You could, but that’s not really the way you want to go when you’re running an international production application. In a lot of cases, you want to have that transactional consistency, right? That’s very important. You want to have that external, global—I guess—really strong transactional consistency. You don’t want to have that instance open and then find out there are seven transactions going on at the same time. What do you do then? Right? You can’t guarantee consistency, right? You don’t want to have two users accessing the same file, right? That’s never what you want to do. And again, it goes through how availability is handled. It goes through, for example, different scenarios around reliability.
Now. Chubby. Chubby, what exactly is he? Now, there’s a paper called Chubby. This is an original paper that they did that goes through the availability of databases. Essentially. They go through a lot of terminology; it is a very good paper. And if you’re going to take the Data Engineer exam, you’ve got to read this. This is required material for the Data Engineer exam. Now, they also have true time. This is important too, because if you have a distributed system, how do you handle, for example, timestamping? And how do you handle multiple transactions where you’re talking a 10th of a second difference or even less than that? There are numerous details to consider here. This is a really good Friday afternoon read, so take a look at it. It’s downloadable. And if you’re taking the Data Engineering Exam, you have to read this. This will definitely get you up to speed on some of the questions you may see on the exam.
10. Cloud Spanner 101
It’s generally going to be used more in use cases where you’re going to have web and mobile applications in a lot of cases, but certainly not just for that use case. We’ll go ahead and talk about some of the features as well. Datastore is a highly scalable NoSQL database for your applications. It’s meant to automatically handle sharding and replication. provides you with highly available and durable databases that can scale automatically to handle your applications’ loads. For example, you may want to use a cloud datastore for your web applications.
Perhaps you will want to use it for mobile apps as well. It supports various data types; it will support asset transactions; and it will support multiple access methods such as JSON, Open Source, and so on. It’s got a lot of nice features. We’ll go over built-in features like strong consistency and global scalability. It is a managed service, and it supports ANSI 2011, and that’s important for SQL folks as well to know because Data Store essentially fits into the non-relational category over here. It is primarily suitable for mobile and web use. You may see it in applications such as games, user profiles, and perhaps some advertising technology as well. Also, you may also see it in applications that generally need to scale with a lot of queries, perhaps a lot of indexing. could also be a good use case. Again, the use case is really going to be dependent on what you’re really looking for when it comes to applications.
It is a schema-driven database, so that probably limits what you’re going to want to do with it. But definitely, if it’s web or mobile cloud apps, it’s generally a good solution. When it comes to the data store, it is a pay-per-use model, and others are going to pay for what you use. It is a Rest API that is supported, affordable, and fast. Now, on the Data Store page, the pricing is there. The pricing, again, can be a little bit confusing. Now, from what I remember, the pricing is going to be based on the amount of storage that you’re going to store, but also the number of reads and writes that the “entity” reads and writes, and then operations as well. You do get a free limit as well. And there is a pricing guide. You’ll definitely want to look at it before you scale out your app. There are charges for reads and writes, as I stated. As a result, Datastore can scale to terabytes. It supports what’s called a persistent hash map, filters, objects, and properties. It supports, puts on, and gets as well. It is an attribute-based granularity. Now, what is an attribute? Well, an attribute is basically as granular as you’re going to get. While SQL supports fields, attributes have a finer granularity that may be useful.
One of the gut instincts is that you’ll definitely want to use App Engine with this for deploying your applications; look into that and make sure it works. I’ve not really seen use cases where you could deploy this with an in-house app or virtual machines. But again, things change, and perhaps I’m just not really experiencing that Data Store area. But it is a great use case. If you’re looking at no SQL, it is fully managed. You can access it with JSON or with what’s called Orms, which are for those folks that are developing objects. Basically, that’s basically “objectify,” which is another way to put it. The link for the data store is there. Now the terminology is a little bit, I guess, different. and this is sort of what gets folks at first, even when I teach the class. This is the hardest part to get. Now, we’re all familiar with SQL. Now, here is sort of a comparison. When you’re talking about relational databases or SQL, for example, a “kind” is going to be a category of an object. This is similar to a table. An entity is a single object. This is similar to a row. A property is a field, and a key is a primary key. This is some of the jargon. If you do take the Data Engineer exam, we’ll get you. If you don’t understand it, this is a good starting point, essentially. This again is a whole course in itself in the sense that you really need to go to the Data Store Developer Guide to know where to start because it’s a different query language. Another thing I want to mention is that it has a very easy-to-use dashboard. It does support different data types. For example, integers, floating points, strings, dates—you name it. There is a lot to talk about. You could literally talk about DataStore all day, but for this course, I just want you to know the fundamentals. You can now replicate multiple locations using cloud data store replication. It could be multiregional. If you replicate multiregional redundancy, you’ll have, of course, higher availability. Because if one region goes down, what happens? You have another region that is available and supports regional locations. Now, the reason you may want to use regional over multiregional is that you’ll get lower latency and better performance.
But the trade-off is what? Higher availability does have global points of presence. So Google does have POPs in their network that support lower latency for this datastore replication. So, just a reminder. Datastore is a schema-list database, which allows you to worry less about making changes to your underlying data structure as your application evolves. It is a powerful query engine that allows you to search for data acrossmultiple properties and sort as needed. Once again, you’re going to be using this directly in the Google cloud. This is not something you want to really do in the house. Basically, you’re going to load the data on the data store and then query against it, which is really what it’s saying.
11. Cloud Datastore Overview
and has some amazing abilities It is fully managed. It is petabyte-scale with low latency and seamless scalability for throughput. It has some AI capability; essentially, it learns and adjusts to access patterns. Cloud BigTable uses a low-latency storage stack. This is important in the sense that it is going to have its own storage stack that is optimised for this specific service. has redundant auto-scaling as well as the storage. When we talk about cloud BigTable, you’re definitely going to use it for big data. And this could be for a wide range of applications. It could be for advertising, technology, finance, the Internet of Things, or anything else you require. Low latency, user analytics, and data analysis These are just some of the use cases that I’ve seen this tool used for. It really just depends on the scenario that you need and the cost structure that works for you.
You can also tie this into Hadoop ecosystems. For example. For example, you could tie it in with cloud dataflow, and Dataproc supports the HBase API. BigTable can scale to petabytes and supports the key value HBaseAPI, which scans rows and assigns rights to them. The granularity is on a row basis. There are no opportunities. That is, you are not required to do anything that BigTable is designed to be queried against. That’s what it’s really used for. You’re going to want to use this against data. You want to query data in general. Now, in the SDK, there’s also a utility that can emulate the big table services essentially locally, so that the developers don’t have to go back out to Google to develop code. It can be done locally as well. And, once again, if you’re going to develop against the big table data services, you should be aware of this. When it comes to big tables, BigTable really is a great choice for big.Data supports a data API, streaming, and batch processing. For those who are unfamiliar with the two types of processing, streaming data is exactly what it is. It’s going to be sequential.
Batch processing will be done in a Loadsis-like manner. As if you were throwing things in the laundry and doing a load of laundry. You’re going to do this whenever you need to run it, kick it off, or schedule it. When we talk about performance, again, it’s very high performance. The workflows are generally considered faster than some comparable products, like Amazon. Now, here’s how the processing works at a high level. You have the clients that are going to query against the BigTable node. The big table nodes are going to then go over to the storage. Now the data storage system is called Colossus. Now, Colossus is a file system. Now, the database structure that isn’t shown in this diagram for the storage works like this. And I have a link for the whitepaper that you could download, which goes through how Colossus works as a file system. But here’s how it works at a high level: Basically, it uses what’s called “sharding.” Sharding now basically takes blocks of what are called contiguous rows, also known as tablets. These tablets are distributed throughout the file system.
So it’s a distributed file system. If you can get that, that distributed file system is going to have contiguous rows, and those are called what? They’re going to be distributed through that file system. That workload is going to be distributed through that whole global file system, essentially. This is what improves scalability performance. It really allows you to get that query back a few split seconds quicker. And that’s the goal of sharding and the Colossus file system. So all the requests from the clients go through the front end, essentially. The front end is going to send the data to the nodes. These nodes are called tablet servers. Essentially, just in case—not that you need to know that right now. and the cluster is going to be part of an instance. That instance is then going to handle requests and distribute those requests throughout the file system. That’s how it works at a high level.
Of course, how load balancing works, how schemes are set up, and how performance is handled is a little more complicated. There’s also full auditing, logging, and access control. I have a lot to talk about, and literally, this is a one-day subject area. If you take, like, some of the data courses, you spend five or six hours on it. With cloud BigTable, it integrates with big data tools. So, as you can see here, you have the option of looking at Dataproc data flow in two ways, but the way you want to look at it is this: You have a user here who’s essentially using Gus Util. Now gsutil is on the command line. This is part of the SDK kit. And in the demo, you’re going to download the SDK. Part of the SDK is the BQ tool. That is the BigQuery tool. This is how your developers can query the database efficiently. There’s also a web interface as well as an API, and you can query cloud storage. And as part of that, you could also query BigQuery and display the results. So let’s say you want to run reports and then dump those reports into cloud storage.