Google Professional Data Engineer – Datalab ~ Jupyter

July 30, 2023

1. Data Lab

Let’s quickly talk about Datalab, which is the Google Cloud Platform’s version of Jupiter or Anaconda, an environment where you can interactively execute code. Notebooks Jupiter or I Python are exactly this notebooks for running code. The way Datalab works is via a container which is running on a VM instance. To run your code, you’ll need to connect to this container, this Data datalab container on the cloud from your browser. There are a few interesting bits about notebooks and datalab which are worth keeping in mind. Notebooks have become incredibly popular for passing code around because they are far better than text files. They include the code as well as documentation in the form of markdowns and even the results.

Those results could be text images or HTML or JavaScript. All of these makes them really convenient and it’s an added convenience. That notebooks can now be stored in cloud storage. This is linked to a Git repository. An option that you can take advantage of is to clone that notebook from the cloud storage repository to your VM’s persistent disk. You do need to keep in mind that this is a clone. This clone becomes your workspace, in which you will add, remove or modify files. These notebooks will be auto saved. But if that persistent disk gets deleted, and if you’ve not committed your code to Git, then all of your changes will be lost. Just keep this in mind.

The whole point of a Data Lab notebook is to write code which can then access other GCP services BigQuery Cloud, ML and so on. This access is going to happen via a service account which needs to be authorized accordingly. If you use Jupiter or Anaconda for Python’s notebooks, you’ve seen that there is something known as a kernel. Opening a notebook basically starts up a back end kernel process which will manage the session and its variables. Each notebook in data lab has one Python kernel. So if you have N notebooks, you will be running N processes. Remember that kernels are single threaded. This means that your application is going to be really memory heavy.

If execution is slow, you might want to change your machine type in order to improve performance. As we shall see during the course of the labs, there are multiple different ways of connecting to Data Lab. Starting with the Gcloud SDK here, the docker container will run in a Compute Engine instance and you will be connecting to it through an SSH tunnel. You could also use the Cloud Shell command line interface to achieve the same effect. Here again, the docker container is running on a Compute Engine VM instance and you are connecting to it, but this time through Cloud shell rather than through SSH. And the last way of connecting to Datalab is via your local machine. In this way you are going to run the docker container on your local machine and so it clearly must have docker installed.

2. Lab: Creating And Working On A Datalab Instance

We’ve already studied that. Data Lab is a virtual machine instance which comes preinstalled with jupyter notebooks and all of the tools that you might require in order to run your machine learning models. Now let’s see how we can connect to our Data Lab instance on the Google Cloud platform. Click on the hamburger icon on the top left and go to APIs and Services. In order to be able to use Data Lab, we need to enable the Compute APIs on our GCP account. So go to Dashboard, and this will take you to a screen which looks like this. Click on Enable APIs and services at the very top. The next screen will give you a search box where you can search for the API that you want to enable. We want to enable the Google Compute API.

Type in Compute and select the Google Compute engine API. Click on the Enable button here in bright blue and wait for a little bit till the API is enabled. We now go back to the main page of our project. Click on Google Cloud Platform that will take you to the home page for this test project. We are now ready to bring up our Cloud Shell terminal window, which will allow us to connect to a VM instance on which Data Lab will be installed. Open up Cloud Shell and make sure your G Cloud components are updated. G Cloud Components update is what you have to run first and make sure you run Gcloud components. Install Datalab The next command will actually create a VM instance which will have Data Lab installed on it.

We use the Data Lab Create command from our terminal window, and the name of our VM instance is Training Data Analyst. You can pick any name that you want to. Just make sure you remember the name so that you can connect to this instance later. If you already have a Data Lab VM instances that you’ve created earlier using the Datalab Create command, you can connect to this instance by calling Data Lab Connect as you see on screen here in the subtitle at the bottom. Creating an instance you use data Lab create Connecting to a previous instance. Use Data Lab. Connect. When you call Data Lab Create, it will ask you to specify a zone where you want your VM instance created. I’m going to use zone 37, which is Asia South, and then wait for my instance to be set up.

It will first create a disk to attach to my VM instance, and then it will create the repository of Data Lab notebooks and go through the bunch of steps needed in order to create the Data Lab instance called Training Data Analyst. This command will create an SSH tunnel between your Cloud Shell instance and your Data Lab instance, and then ask you to click on the Web Preview button at the top right in order to connect to Data Lab. So notice the top right web preview button. I’m going to change the port, which I’m going to use for my Data Lab instance. I’m going to set it to port 80, 81. Go ahead and hit change and preview. And this will open up your Data Lab instance on a new browser window. Now, this happened very quickly.

That’s because I’ve sped up the video. But it might take a few minutes to set up and connect within the Doc folder here, you’ll see a number of test IPython notebooks, such as the Hello World ipynb. You can open up this notebook instance just like you were running Jupyter on your local machine. And then you can use the code cells in order to type in your Python code. Notice I’ve printed hello and hello is displayed to screen. You are now set up with Data Lab and Jupyter I Python notebooks on Datalab. All of this code is running on a VM instance on your Google Cloud platform. It’s not on your local machine at this point. If you’ve chosen to use the Google Cloud platform in order to run your TensorFlow code, you’re now up and ready to work with Datalab.

3. Lab: Importing And Exporting Data Using Datalab

Amongst other things. In this lecture, you’ll learn how you can reference cloud storage objects using Datalab. All the files you’ve been storing on the cloud, you can access them from within Data Lab very easily. In this lecture, we’ll see how we can get data into Data Lab for processing and how we can export data from Data Lab for storage. All the code that that we are going to run in this demo is in Data Lab docs Tutorials BigQuery Importing and exporting Data Ipynv the cool thing about these tutorials that Google has provided is that there are lots of comments which allow you to follow along very easily. You know exactly what’s going on if you just read these comments.

For example, using Data Lab, you can import data into a BigQuery table from a CSE file that’s stored on the cloud on your bucket, in fact. Or you can perform a bunch of transformations and manipulations within your Python code and stream data from Python directly into BigQuery. Here are some useful libraries that we’ll be using in this program the Data Lab context the BigQuery library alias as BQ, which allows us to work with BigQuery, and the Cloud Storage Library alias as storage. We’ll also be using Pandas, the numerical computation library that Python offers, which is very powerful and great for working with huge data sets. We’ll also use some IO libraries for serialization.

The percent GCs command allows us to interact with Google cloud storage directly from our IPython notebook. On Data lab. GCs Dash Edge gives you a help listing of all commands that you can run. In this way, you can use GCs to copy, create, delete list, and do a whole bunch of other manipulations on cloud storage. The GCs List command, with no additional arguments, lists out all buckets that we have available on cloud storage. We have the loony Asia bucket and the loony us bucket. The GCs View command allows us to view the contents. The files that you’ve stored in your bucket. Datalab will also give you reasonable errors if you type the command wrong.

Here I had to specify the complete path to a file, and this will allow me to view its contents. I’m going to go back and execute the GCs List command and move on with the other bits of code that is present in this file. GCs Read allows us to read objects that are stored in our bucket and store them within variables. In this notebook here, we read some sample data that Google has made publicly available to us. It is a CSV file named Cars, and we store the contents of this file in a variable named Cars. Use Dash Dash variable along with your GCs command in order to store information in a variable. Print the value of Cars and you’ll see that there are four cars that are represented here a Ford, two Chevys, and a Jeep Grand Cherokee.

Play around with these commands so that you get more familiar with how they work. Now I’m going to read a particular object from my own bucket loony Asia bucket, and I’m going to store it in the variable baby names. If you print the baby names variables, you’ll see the contents of our Biob 2016 text file. Switch back to the Cars variable. That’s what we are going to use in the remaining portion of this code. In the next step, we’ll write out this data from the CSV file into BigQuery, complete with schema. Use Pandas to read in this data as a CSV file and store it in a data frame. DF we use Pandas in order to easily access the schema of this particular table. There’s a convenient method in the BigQuery library that allows you to work with data frames.

BQ schema from Data and when you specify a data frame will give us the schema for this BigQuery table. You can now use the BigQuery library to create a brand new data set. We are going to call this data set Importing Sample. This data set will be created within the same project which holds this Data Lab environment. And here is the Python code that we can use to create a new table BQ table. Specify the name of the table along with the name of the data set. Specify the schema and indicate whether you want the contents of this table to be overwritten. If you want the table to be recreated, we run this bit of code and then switch over to the BigQuery Web console to check whether the data set and the corresponding table have been created.

And there you see it, the Car table within Importing Sample. Here is the schema that BigQuery has inferred from the data frame. It’s taken the headers and it’s figured out the data types to get the schema. If you look at the Preview tab and also the Details tab, you’ll find that this table is empty. Details says number of rows is zero table size is zero bytes. That’s because we haven’t loaded the data into this table yet. The next code cell in our Data Lab notebook performs this load. You take Sample table and call the Load operation on it. Make sure you point to the CSV file on cloud storage. Load takes in a bunch of other options. Source format is CSV and you want to skip leading rows.

You want to skip the header row. Refresh the BigQuery Web console and you’ll find that the car table now shows that it has four rows and 228 bytes of data. Switch back to Data Lab and examine how we’ve loaded this table. The mode in which we loaded it is Append, which means if you rerun this particular line of code, data from the CSV file will be appended to this table. Rerun this particular line of code and switch over to BigQuery. This time in BigQuery. Let’s run a real query. Select Star from importing sample cars. When you execute this, you’ll find that it has eight rows. The CSV file originally had four, but when you appended data by rerunning that bit of code, the four rows got appended once again to your BigQuery table.

The BigQuery table has been loaded. We can now query this data using the script BQ query. If you say BQ query Importing Sample, then you can specify what query you want to run here the query is select Star from Importing Sample Cars. This query can now be executed using Bqexecuteq. We can reference this query by the name Importing Sample. And there on screen you can see the eight rows that are present in the car table within the Importing Sample data set. We can also insert into a BigQuery table using a Pandas data frame. Let’s first set up a Pandas data frame. We want to read from Cloud Storage using Python APIs, just to see the different ways in which it can be done.

This involves setting up an instance of storage object point to the bucket. This is Cloud data. Lab samples. That’s the name of the bucket here. And the file that we are interested in is Cars Two CSV. We read this as a stream into the Cars Two variable and convert this stream into a data frame using Pandas Read CSV. And it’s really as straightforward as that. We read a stream from Cloud Storage and stored it in a data frame. And this is what the resultant data frame looks like. It has two entries for cars a Honda and a Tesla. You can clean up this data in your data frame by using all the helpful methods that a data frame provides.

The fill Any simply replaces all not a number cells with empty because we specified value equal to empty in place of our not a number cells. Let’s insert the contents of this data frame into our sample table. You can call Sample Table dot Insert and specify the data frame and BigQuery will just insert it into its table. Sample Table dot two data frame will allow you to view your table as a data frame and that’s what is printed out to Screen. Notice that our table now has ten rows. The eight rows that existed originally and the two rows that we just inserted using a data frame, switch over to the BigQuery web console, run the select star and you’ll be able to verify that this table now has ten rows.

In the next step we’ll write the cloud storage using our BigQuery table. For that, first let’s access the project ID of the current project. This you can do via the context. For this data lab, we’ll create a brand new bucket in our cloud storage. This bucket is made up of the current project ID. It has a suffix data lab samples. And within this bucket we want to export the contents of our BigQuery table into a CSV file called Cars CSV, which lives in the Temp folder. Print out this information, and this is what the bucket and the path to the CSV file looks like. Instantiate a new cloud storage bucket using the Storage bucket library called Bucket create and Bucket exist to ensure that it has been created successfully.

As you can see, it’s not just the GCs command, but you can also use the Python API to manipulate cloud storage from Data Lab. It’ll just take you a few seconds to switch over to the Web console and confirm that the bucket has been created. The bucket is currently empty. We haven’t extracted the BigQuery table contents to this bucket yet. We’ll instantiate a new reference to this BigQuery table using BQ table and specifying the data set and the table within it. Importing Samples Cars the table extract method is what allows us to export data from BigQuery. The destination is our bucket. Once this bit of code runs through, you can use the Web console to check whether Cars CSV is present in your newly created bucket.

And it is. You can also confirm this within your Python notebook by using the GCs List command. If GCs List is used along with dash dash objects, and then you specify the name of a bucket, it will list only the information within that bucket. We can access and check the same bucket using the Python API. Use storage bucket to instantiate this bucket. Access the first object within the bucket at index zero, perform a read stream operation, and then finally print the data. There you see it the entire contents of our BigQuery table in our CSV file stored on cloud storage, which we read from within Data Lab. You can also export data from a BigQuery table onto a file on your local file system. You simply say table two file and specify a full path to that file.

This will be stored in that virtual machine instance where you have Data Lab running. This demo also shows you how you can run Bash shell commands from within your IPython notebook as well, notice the LSL command you don’t need to switch to a terminal window to run it. You can run it from within this notebook. And here is a short Python script to check the contents of this file at the end of the lab. Clean up after yourself. We make sure to delete the bucket and also the data set from our BigQuery. Check the Web console in BigQuery as well as your cloud storage to ensure that the delete is complete. So how would you reference cloud storage objects from within Datalab? There are many ways to do so. You can use the GCs command. You can also use the cloud storage Python API.

4. Lab: Using The Charting API In Datalab

At the end of this lecture, you’ll become very familiar with Google’s charting API, and you’ll be able to make the call as to when you would use the annotation chart versus the line chart when you want to visualize data. Google charts are seamlessly integrated with data Labs and allows you to visualize data very quickly and very naturally from within your Python notebook. This demo is a brief introduction to working with Google Charts in Data Lab. Google offers a sample tutorial for Google Charts. Look in the tutorials directory under the data folder there you can see it interactive charts with Google Charting APIs. Open up this notebook and you’ll see a bunch of charts there displayed very prettily.

We don’t want to look at the execution outputs from earlier, so an easy way to clear all the results is to hit clear and clear all cells. This allows you to start afresh with any notebook. Run the Percent Percent Chart Help command to see what charts Google makes available to you. You’ll find a whole host of options listed. There are annotation charts, area charts, bubble charts. org, charts, scatter charts, and so on. You can get help on individual charts by specifying the name of the chart. In this case, it’s the table chart and the Dash dash help option. You can see that the table chart takes in a dash F option for the fields that it wants to plot, and a dash D option for the data that it plots.

Let’s first check out a line chart. In this example, we plot from data that we pull from BigQuery, so we’ll see how Google charting API and BigQuery work together. Select the timestamp and average latency for Http requests. This is from a public data set that Google makes available to us for our Data Lab examples. These are some sample logs from 2014. This query is stored in a variable named time series. As you can tell by the parameter dashname time series at the very top of the cell. Execute this query using BQ, execute Q and specifying the name of the query that’s time series. Instead of seeing this data in boring table form, let’s chart a line chart that will allow us to map how the latency look for the web request made to this particular site.

Within this chart, the Dash Dash fields argument specifies what columns we want to chart on our line chart, that is, timestamp and latency. The Dash dash data argument specifies the data set that we want to use to plot this chart. The time series data set. Execute this bit of code and you’ll find a pretty chart setup. The timestamp is on the x axis and the latency is on the y axis. You can see from the documentation there that bar charts and column charts are similar to line charts. So let’s switch over our chart to say bars and see how it looks. And here is a bar chart mapping the same data. This is less intuitive though. Line charts were clearly better for this plot. With the same time series data set, you can play around with a bunch of different chart types. I tried the column chart, the Scatter chart and so on.

Let’s now see a Scatter chart with a new query. A reference to that query will be stored in the births variable. This query is the public natality data set to get the gestation weeks and the weight in pounds for the baby’s born. For this particular period, we limited to 1000 records. Display this data in a Scatter chart. As you can see in the quote cell right in front of you, you can specify a whole bunch of attributes for your chart to make it more visually appealing and readable here. This chart has a title height width and we have also labels on the X and Y axes. Babies birth weight versus gestation weeks. This lends itself very well to a Scatter chart. Notice the clumping between 30 and 45 weeks. The Chart API also gives you the values for individual points if you hover over them.

We’ll now see how a pie chart works by performing some queries on GitHub data. Again using a public data set. We want the repository language and account of how many times or how much activity that particular language saw over a particular period. This is from a public data set with the table name GitHub underscore timeline. You can play around with the query as you want to. I’m going to simply change the limit to 100 records. Change the title of the chart. Top 100 OSS Programming Languages. Open Source Programming Languages. And there you see it. JavaScript is the most popular. The same query run with the original ten records that were specified is much more readable and much more visually appealing.

JavaScript is first and Java seems to be second. You can also view this data as a columns chart. It’s less fun though. This is what the columns chart looks like. The top ten programming languages. Let’s take a look at one last kind of chart. Here the time series chart. Here we query weather data from our Geological Survey table. This is a data set that we are familiar with. We select the max temperature and the day at which this temperature occurred. The name of the time series chart is Annotation. So specify Annotation in your chart query. The fields that we want to chart are timestamp and temperature. And the data that we use is this weather data set from our BigQuery query. And here is the Annotation chart.

Now, the Annotation chart is very similar to a line chart except that it allows you to zoom into different levels if you want to look at the data more deeply. So it provides you with these built in zoom parameters that you can use. You can plot the exact same data with a line chart and it will look very much the same, but you’ll notice that you can’t zoom into different levels. So if you want to give your chart viewer the ability to kind of zoom in and manipulate the chart to look at what he or she is interested in, you’ll use an annotation chart rather than a line chart. And this leads us to the question that I asked at the beginning of this lecture. You would use the annotation chart when you want to give the chart viewer the ability to zoom into different levels and view the chart in different ways.

Uncategorized

Related posts:

Leave a Reply Cancel reply