Google Professional Data Engineer – Regression in TensorFlow Part 3
July 27, 2023

6. Lab: Multiple Regression in TensorFlow

At the end of this lecture, you should have a good idea of what are some of the decisions that you need to make while training a machine learning model. These decisions will affect the parameters that you use during the training process. In this lecture, we’ll implement a model for multiple regression in TensorFlow. The code for this is very similar to linear regression. It’s just an extension to allow more causatory variables. Linear regression can be thought of as representing a cause effect relationship between two variables. The cause is an independent variable and the effect is a dependent variable. One cause, one effect. For example, the movement of government bond yields affect oil prices. But it is often the case that an effect depends not just on one cause, but on multiple causes.

Your independent variables can be more than one, and the effect is one dependent variable. Modeling this is a multiple regression problem. For example, the returns of ExxonMobil depends not only on the nasdaq share index, but also on how oil prices change. Many causes, one Effect you’ll find that the steps for multiple regression are very similar to those that we followed for linear regression. You should be able to find exact comparisons. We’ll set up our multiple regression model using a different data set, which requires a different helper function read Exom Oil nasdaq data. The return value of this function is a couple with three fields returns for Nasdaq, Oil, and Exom. As before, each of the returns is represented as a one dimensional array.

The read file function, which reads in data for our multiple regression, is a little different. We still read in CSV files for nasdaq, Oil, and ExxonMobil. We access the date and the adjusted close in each of these files, sort the data frame by date, and find the percentage change or the returns. At the end of this helper function, we return the nasdaq data, oil data, and the exhaust data, a couple with three fields representing returns. As before, we’ll set up a baseline implementation for multiple regression using Python ScikitLearn library. The next step is to get the returns for nasdaq oil as well as Exom using the helper function that we just set up. First step is to combine our regressors two causatory variables of our multiple regression nasdaq data as well as oil data.

If you remember, when you use Python libraries for regression, you have to set up the data the way the library expects the data to be. Which means our x variables should be arrays of arrays. Because we are working on both nasdaq and oil data arrays, we achieve this by using the vSTAC function v stack stacks arrays vertically to make a single array of arrays. After performing the vSTAC operation, we take the transpose of the result. This will give us data in the format that we require to run multiple regression using these Python libraries. Set up a linear regression model to represent this fit this model to our combined regressors. That is, Nasdaq returns as well as oil returns, and exhaust data is our dependent variable. You can compute a score for the resultant model.

This is the R square, a measure of the goodness of fit of this model. This is a slightly different measure than we’ve been working with, so let’s ignore that for now, and we can finally print out the coefficient and intercept of our regression. The one difference here is that because this is multiple regression, you’ll have two coefficients and one intercept. Execute this bit of code and here are the weights and biases for this model. We have two weight variables, one for Nasdaq and one for the oil data, and one bias variable. Let’s now set up a multiple regression model using TensorFlow. You’ll find that the code for this is very similar to our linear regression code, except that you have to take into account an additional x variable.

Each of these x variables have their own weight variables that we have to calculate. So we set up the Nasdaq W and the Oil W as variables, initialize them both to zeros, just like we did before, and also notice their shape.They are arrays of arrays with just one element in each dimension. Set up a variable for the bias B and initialize that to zero as well. We need two placeholders to feed in our two x variables, our returns for Nasdaq as well as oil prices. Notice the shape of these placeholders. They’re the same as we saw before. The first dimension is none or unspecified, because we don’t know how many data points will be feeding in, and the second dimension is one. We perform two intermediate computations.

We multiply Nasdaq x by its corresponding weight, and we do the same thing with Oil x. These are both matrix multiplication operations. We can now calculate the predicted value of y by fitting that linear equation y is equal to Nasdaq WX plus Oil WX plus B. We need to feed in labels for our training data set the actual Y values that correspond to the x data that we fed in. We use a placeholder to hold this data. We call it Y underscore as the four, and once again its shape is non comma one. We don’t know how many data points we are going to feed in, but each data point has a dimension of one. The cost function is calculated in an identical manner, performed the reduced mean computation on the squares of the differences between the actual and predicted values.

We set up an optimizer to minimize this cost function. We use the Ftrl optimizer as we did before. You are of course free to choose other optimizers, such as the gradient descent optimizer or the added grad optimizer. We reshape all our NumPy arrays so that they are in the form that the model expects. If you remember, a reshape of minus one comma one, converts a single dimensional array to an array of arrays, where the inner array has just one element in every array. Store the length of the entire data set in a Python variable. And now we are ready to look at our function. Trained with multiple points per epoch. The arguments to this function steps, train step and batch size form the three parameters that we can change within our training model.

For any machine learning model that you have to set up, these are three decisions that you need to make the number of epochs or steps that you’ll use to train your model, what optimizer you’ll use to minimize your cost function, and finally, what is the size of data that you’ll feed in in every epoch? The bat size for the training of any machine learning model. These are some of the properties that you can tweak to see if your result improves. A whole block of code here is exactly the same as we saw in linear regression. We initialize all the variables using the global variables initializer, we instantiate a session. We set up a for loop which runs the training model on a batch of data multiple times.

The number of times is equal to the number of epochs and we identify the start index and the end index of the batch of data that we are going to feed in in every epoch. In order to run the training step of our regression model, we need to feed in three pieces of information. The feed dictionary has to have the x values for nasdaq oil and our y labels. For exam, once we set up the feed dictionary, we can call session run on our optimizer and we print some data to screen for every 500 iterations. This will help us see how the model converges. Go ahead and call train with multiple points per epoch for 5000 epoch passenger optimizer.

That is a train step FDRL and use the entire data set size as our batch size. And at the end of 5000 steps you can see that our weight values w one and w two are very close to the baseline that we had calculated earlier, as is our bias value. Scroll back up and a quick comparison will show you that this is true. Some of the decisions that you need to make in the training stage of your machine learning model is how many epochs will you run this for, what is the optimizer function that you’ll use? And finally, what is your batch size? We’ve speaking all of these will give you different results and you need to find the one that works best for you.

7. Logistic Regression Introduced

Here is a question that I’d like you to keep in mind as we go through the contents of this video. In linear regression, we fit a linear curve that’s a straight line through a set of points. In logistic regression, what kind of curve do we fit through a set of points? Let’s now turn our attention to yet another family of regression models. These are logistic regression models. Because these are a lot less familiar than linear regression, we are going to spend a fair bit of time talking about the ideas, the concepts behind logistic regression. First, linear regression seeks to quantify effects given causes. Logistic regression is very similar given causes. It seeks to predict the probability of effects.

Linear and logistic regression are quite similar, yet also subtly different. They are similar in some of the underlying mathematics, but different in their use cases and in the fact that logistic regression can be used for only categorical y variables. When we do get to the implementation of logistic regression in TensorFlow, we shall see that there are two important differences between linear regression and logistic regression. The first has to do with the use of the softmax activation function. The second has to do with the use of cross entropy as the cost function in preference to minimizing square error. We’ll get to all of that eventually. Let’s first start with an analogy. Let’s say that you have a really important deadline.

One way of approaching that deadline might be to only start work five minutes before the deadline. Good luck with meeting your objectives. Another way might be to go to the other extreme and become really conservative. Start a whole year before the deadline. But again, that might be overkill. The problem with both of these two approaches is that each of them is suboptimal in some important respect. Let’s say that you start work a year in advance of the deadline. You are certain to meet that deadline, but you are also virtually certain to accomplish nothing else of use in that period. If you swung to the other extreme and started work just five minutes before the deadline, you have no chance 0% probability of meeting that deadline.

But your probability of getting other important work done is very high. What we really want is a Goldilock solution. We neither want to work too fast nor do we want to work too hard. What we really want to do is work smart. Let’s try and quantify the intuition behind each of these approaches. Working fast would involve starting very late and basically hoping for the best. Working hard would involve starting very early and doing almost nothing else productive. Working smart, however, would involve starting as late as possible while still being virtually sure of meeting the deadline. The question that we then really want to answer is how much in advance of the deadline should we start work so that our probability of meeting that deadline is 95% and our probability of getting other important work done is also 95%.

If we were to plot this on a graph, what we are really looking for is a point, a sweet spot on the graph. On the x axis, we would have the time before the deadline. When we start work. On the y axis, we would have the probability of meeting that deadline. A solution like starting just five minutes before the deadline has only a 0% probability of meeting that deadline, basically flush up against the x axis. On the other hand, if we started a year in advance, we would be at 100% probability of meeting that deadline. But there is virtually nothing else that we would get done. The sweet spot is the Working Smart solution. We are looking for the Working Smart solution which corresponds to a point somewhere along this graph where our probability of hitting that deadline is 95%.

What we now want to find out is how much in advance of the deadline should we start? What we really want to know is where that point, where that Working Smart point intersects the x axis. Logistic regression exists for exactly such use cases. It helps us to plot a curve which is shaped like that S, which you see on screen now then, given this S curve, we can find when we ought to start work. We simply drop a horizontal line from the 95% mark, find the point of intersection of that horizontal line with our S curve, and then drop a vertical line to find where it intersects with the x axis. Let’s say that that answer turned out to be hypothetically, say eleven days. Then that is telling us that we need to start work eleven days before the deadline in order to still have a 95% probability of making it.

By telling us what the relationship is between our actions and the outcome, logistic regression has helped us to strike a balance between working too fast and working too hard. We are able to work smart because logistic regression provided us with this curve. And indeed that’s the whole point. Logistic regression helps find how probabilities are affected by actions. There are a few interesting things going on in this graph, so let’s analyze them. This is an S shaped curve. On the y axis, we have the probability of meeting our deadline. On the x axis, we have the time to the deadline. In this instance, there are only two possibilities. We either meet the deadline or we miss it. This is a binary outcome. The probability of that outcome is represented by this curve, which flattens out at either end.

So there is a floor of zero and a ceiling of one. Clearly, probabilities cannot be negative or greater than one. So this shape is an important one. For any representation of probabilities to share, we can actually generalize most of these properties. Y is a hit or a miss variable. It’s categorical. It takes only two values zero or one. X, on the other hand, is a continuous variable. It represents how much before the deadline we start. It can take any value. And we can now explicitly model E of y that’s the probability that Y works out to be equal to one. Logistic regression makes use of a very specific curve equation, that is, the curve equation that you see on screen now. This is known as an S curve. This form of the equation is already assumed implicitly in logistic regression.

In much the same way that a linear form like Y is equal to A plus, BX is assumed by linear regression. And the similarities don’t end there. Logistic regression is the process of finding the best fitting such S curve. As in linear regression, A is the intercept and B is the regression coefficient. The only difference is that these constants are now associated with the exponent E, which is the universal constant 2. 71 eight to eight. This is the base of the natural logarithms. And in this way we can use this equation to relate the probability of Y being one with the value of the causationary variable x. In this equation, if the constants A and B are positive, we get an equation which is S curved, sloping up.

This implies that as X increases, the probability of Y being one increases as well. On the other hand, if we change this equation so that A and B are negative rather than positive, then the curve gets flipped around the y axis. In a situation like that, as the value of X increases, the probability of Y being one decreases. Once again, this is a flipped S curve and this occurs in the constants. A and B are negative rather than positive. S curves are popular because they do a really good job of modeling how actions determine probabilities. Let’s say you start five minutes before a deadline versus ten minutes before a deadline. That’s going to have a little impact on your probability of making that deadline.

It’s still going to be zero. So in this region, the curve is pretty flat. Likewise, let’s say you start one year in advance of the deadline versus nine months in advance. You are still certain of making it. Once again, the probability is going to be maxed out at one and quite independent of changes in how much time you put in. And that’s why, once again, the curve flattens out at the extreme. At either extreme, the curves flatten out because the probabilities are not changed a whole lot by your actions. But in the intermediate portion of the curve, the probability of the outcome will change very quickly, almost linearly. It is very sensitive to the changes in the cause variables. This serves as a very realistic model of many cause effect relationships and that’s why logistic regression is so popular.

Also, let’s be mindful of the difference between categorical and continuous variables. Continuous variables can take on any value. They are drawn from an infinite set of values for example, height, weight, income and so on. Categorical variables, on the other hand, can only take a finite set of values. Think enums in a programming language, or think male and female, or day of the week or month of the year. There is a specific discrete set of values, and the categorical variable can only assume one of those values. In logistic regression, the x variables could be either categorical or continuous, but the y variable can only be categorical. Many categorical variables are yes, no or true false. They just take two values. Such categorical variables are called binary variables.

This bit is an important one. Logistic regression can only be used with categorical variables. And really, logistic regression is a way to estimate how the probabilities of categorical variables are influenced by explanatory variables. Let’s come back to the question we posed at the start of this video. Logistic regression, like linear regression, involves fitting a curve of a specified shape through a set of points. The difference is that in the case of linear regression, that curve is linear I e. It’s a straight line. In the case of logistic regression, that is, an S curve or a logic curve. This is a characteristic curve in which the probability of the output is linear in its exponent. S curves are widely used in modeling, for instance, to predict tipping points if you’ve heard of them, or read the book of that name.