**13. Lab: Logistic Regression**

How would you apply the softmax activation function within a neuron when you’re setting up a logistic regression model in TensorFlow? At the end of the lecture. You should be able to answer this question confidently. In this lecture, we’ll discuss how we can implement a logistic regression in TensorFlow. We’ll start off by setting up the computation graph, specifying the cost function and go all the way till we get a converged model. In our implementation of logistic regression with stock market data. We had gone as far as setting up a baseline. The percentage accuracy of the baseline implementation, which we did using the Stats model API in Python was 72. 8. Import the TensorFlow libraries and then set up the variables whose values will be determined as a result of this regression.

There are pretty significant differences in the way that these variables are instantiated as compared with linear regression. Let’s take a look at the W variable first. Note that the shape of the W variable is now one comma two. And the shape of B, the bias variable, is just two. We’ve discussed why this is the case in some detail earlier. In the case of W, the first dimension one comes from the one dimensional feature vector which is custom returns on the S and P 502nd. Dimension Two comes from how many Categorical values the Output of this regression model can have. The output, basically, is Google. Stock is up or down based on One feature vector, the Returns of the SNP 500.

The Shape of the Bias vector for the Neuron that learns logistic regression is two. There is A Bias that’s associated with the A Fine transformation and a bias that’s associated with the Soft Max activation function. The W variable is Initialized to all ones using PF once the B variable is initialized to PF zeros as the four. In the next step, we set up the placeholder to feed in the X data for our logistic regression. The X data is a single dimensional feature vector which contains the returns for the SNP 500, the monthly returns. The shape of this data is non comma. One exactly the same as in linear regression. The feature vector is represented as an array of arrays. The first dimension is none or unspecified because we don’t know how many data points there will be for S and P 500 returns.

And the second dimension is one, because there is exactly One value for every data point. Next, we set up the placeholder for the actual output values. That is y underscore. The Shape for this placeholder is none comma two. The first dimension Here is none or unspecified because we don’t know how many labels there will be in The Training Data that we feed in. And the second dimension is two, because the Categorical label output can be one of Two values up or down, true Or False, and this is represented in One Hot notation. If you remember our discussion on categorical variables and one hot representation. A binary categorical variable which can only have values true or false, can be represented as a single array with two values.

The first element in that array is one if the value is true and the second element is zero. If the value to be represented is false, the first element of the array is set to zero and the second element is set to one. Every record representing a categorical variable with two possible values will have two elements. And this explains why our y underscore placeholder has the shape none comma two. Once we define the x and the y underscore placeholders, we are ready to set up our affine transformations. This is the multiplication of x by w plus b. We multiply x by w using PF dot matmull and assign it to the variable by. Once we have the assigned transformation set up, the next step is to set up the softmax activation function as well as our cost function in TensorFlow.

The libraries are such that both of these are combined into one function, which gives you that slightly scary looking code that you see on screen right now. Let’s parse exactly what’s going on in here. The PF neural network library has a soft max function softmax cross entropy with logits. This takes in the output training labels as well as your logits function. The labels are the Y underscore placeholder that we feed in, and logites is our a fine transformation. That is why the cross entropy is calculated by performing a TF dot reduce mean operation on the result of this soft max cross entropy with logist function. If you remember our discussion of cross entropy from earlier lectures, it’s a measure of the difference between two probability distributions.

Our objective here is to have the parameters of our logistic regression setup such that we have a minimum value for the cross entropy. A low cross entropy implies that the probability distribution of our predicted values as well as our actual values are very similar. The lower the cross entropy, the better fit our logistic regression. Just like in the case of a linear regression, we need to use an optimizer in order to minimize this crossentropy function. We use the gradient descent optimizer with a learning rate of zero five. This optimization forms the training step of our logistic regression. The most interesting thing about these two steps is how we use the soft max activation function within our neuron as well as the cross entropy cost function.

Without really worrying about the mathematical implementation of these two, it’s really important to understand the intuition of why these exist and the roles that they play. But using the TensorFlow library abstracts away all the grungy mathematical details. At this point, we’ve completed the setup of the computation graph for logistic regression and we are ready to feed in data in order to run our regression model to train it. The x data for our model has to be fed in in the form of a two dimensional array. The first dimension represents the number of points in this data set, and the second dimension is the value of each point represented by a single number. This number is enclosed within the nested array.

The most important thing to notice here is that we are only interested in the value of the returns. If you remember, we had passed in an entire column of ones for the intercept data that is no longer needed. We are only interested in column zero. We use the NumPy expand dims method in order to convert our single dimensional array to an array with two dimensions, where the second dimension has just one element within every sub array. Set up the y labels for our training data set in the format that the regression expects it to be. The result of this logistic regression can be one of two categorical values. The Google stock was up or down. These values will be represented by true or false, and these true false values can be represented in one hot notation.

This is why we iterate through every element in our y data, and if the element is true, we represent it with one comma zero, and if it’s false, we use zero comma one in one hot notation. As we’ve seen before, one comma zero represents true and zero comma one represents false. The next few lines of code involves setting up and running the regression, and this is very similar to both linear regression and multiple regression that we’ve seen before. Dataset underscores size is a Python variable which holds the length of our entire data set. We then set up a function called train with multiple points per epoch.

Abstracting the training of your regression model into this function allows you to tweak the number of epochs that you run, the optimizer that you use, and the batch size that you can specify. This allows you to quickly change the properties that you use for your training and find what works best for you. Set up a session object, initialize all the variables in your program, and then set up a for loop which trains your model in every epoch. Then there is some basic arithmetic that you have to do to find out what points have to be fed in. For this particular batch, set up the x values in a batch and the corresponding y labels as a part of your feed dictionary which you will use to call session run on your training step.

And then in order to see how the model converges, you can print some stuff out to screen the values of w, b, and the cross entropy. At this point, we have a fully trained logistic regression model. All that is left for us to see how well this model performs, whether the predicted values of the model match the actual values, and what’s the percentage accuracy. Calculating this is a little tricky, and it involves the use of a function TF r max. TF r max helps us identify whether the prediction was Google stock is up or down and whether the actual values were Google stock was up or down. This is the general structure of the PF r max function. This finds the index of the largest argument in the tensor. A tensor is, after all, a multidimensional array.

It takes in two arguments, the first of which is the tensor itself, and the second argument is the dimension along which we want to check for the largest element. Let’s consider an example of a tensor T, which has two dimensions. The values present in dimension zero are not important here. Take a look at the values that are present in dimension one. When we run T f dot r max on this tensor and specify the dimension one, we are looking for that index where the largest value in dimension one is present. This is the index three in our example. Notice that the second argument is one and that references the dimension along which we want to look at the values to find the largest value. The return value of TF dot r max will be the index of that element.

The return value will be three. In our TensorFlow program for logistic regression, we invoke TF arg max twice, one on the actual labels that we passed in Y underscore and another on the predicted values. Let’s look at the actual labels that we passed into our regression. If you remember earlier on in this lecture, we had spoken about the labels being represented in one hot format, which means a value of true was represented as one comma zero, and the value of false was zero comma one. Running TF arg max on Y underscore will give us a resultant value of zero if the label were true, that is, Google stock is up. It will give us a value of one if the label was false, that is, Google stock is down. The result of arc max is the index of the one hot element.

A list containing zeros and ones, as you can see on the right side of the screen, is the result of applying TF r max on our Y underscore actual labels. The second invocation of TF r max is on our predicted labels. These are the labels that are the output of our logistic regression after having passed through the soft max activation function. The output of logistic regression, which uses softmax as an activation function, is a set of probabilities, a probability that the value is a certain categorical variable. Let’s consider this example here on screen in the very first row, the probability that the output label is true is 0. 7, which means the probability that it’s false is 0. 3. The sum of the probabilities for every output should sum to one. Each row should sum to one.

This is different from the one hot notation representation of our actual values. Here, the probability of any of the output categorical variables can be any number from zero to one. We have just two possible output values. Google stock was up or down. The rule of 50% generally applies in binary classification problems of this kind. In the example of classifying whales as fish or mammals. If the probability of whales being fish is less than 50%, then whales are considered to be mammals. This same rule of 50% can be applied to our binary logistic regression problem as well. Let’s say the probability of Google stock being up or probability of true is 0. 7, that is, above 50%. The output of R max in that case will be index zero. If probability of false Google stock was down is 00:56, that’s above 50%.

The output of our R max will be index one. Let’s get back to our code where we invoke TF R max twice, one on the actual labels in one hot notation and the other on our predicted labels which have probabilities for each output value. Both of these are tensors of actual and predicted labels. The result of applying PSR max to each of these tensors on actual labels as well as predicted labels will give you a list of zeros and ones. These two lists, comprising of zeros and ones, can be compared element by element using PF equal and the result of TF equal will give us a list of true and false values. True if the prediction was correct, false if our prediction was wrong. We’ll get true if the indices of the actual label and our predicted value matched. False if the indices did not match.

This list of true and false values is stored in the variable correct prediction. With this list of true and false values, we do some clever calculation. To find the average. We perform a TF cast so that every true value is cast to a one and every false value is cast to a zero. The cast is from boolean to float 32. Once we’ve cast it to a list of zeros and ones, we just find the average of this list by calling PF reduce mean. This will give us the percentage accuracy. This accuracy can then be calculated using session run, just like any other computation node. Let’s go ahead and run this bit of code. We’ll train with multiple points per epoch. The number of epochs is 20,000. Train step is our gradient descent optimizer, and we’ll use a batch size of 50.

And this gives us an accuracy of 72, very close to, but a little less than the accuracy that we got when we use the logistic regression solver from stacks model API. We’ll train our model once again, but this time we’ll use a batch size equal to the entire data set. We’ll include the entire data set in the batch. Once you run the model in this manner, you’ll find that the accuracy matches what we got from our baseline implementation. It is 72. 8. The TF neural network library in TensorFlow contains a special method called soft max cross entropy with logits. This includes both the cross entropy and the softmax activation function within a neuron for logistic regression. Calculating the cross entropy as the cost function and the softmax activation are both performed within the same function.

**14. Estimators**

Here’s a question, a pretty easy one that I’d like you to think about as we go through the contents of this video. Why are estimators called a high level way to perform tasks like linear or logistic regression in TensorFlow? Estimators are a sort of convenience. They are cookie cutter mathematical models which are made available to us by TensorFlow. These are APIs for standard problems, such as is, linear regression, linear classification, and so on. These are high level APIs which allow us to outsource all of the complexity of the gradient descent into the estimator. Estimators really come into their own when we are working with very complicated neural networks with large feature vectors.

Those feature vectors might contain hundreds of attributes. Estimators are a quick and easy way to infer relationships between different attributes inside the feature vector. Estimators can also be extended by plugging custom models into a base class. From a software engineering perspective, this is an interesting extension because it relies on composition rather than inheritance. If you’re familiar with the strategy design pattern, that really is how estimators can be extended. We’ve now cycled through implementations of both linear and logistic regression in TensorFlow, and so we are quite familiar with all of the wiring up that goes into setting up these models.

We need to set up the computation graph, specify the cost function, instantiate a gradient descent optimizer, make decisions about the batch size and the number of epochs, and finally set up a for loop to iteratively invoke that optimizer. This is the basic outline that we need to adopt for both linear and logistic regression. And it’s quite clear that this is a lot of wiring for operations as common as these two. This is where estimators come in handy. These are a high level way of performing these same algorithms with just a few lines of code. So here’s the basic idea of how all estimators work. We need to instantiate an estimator object. We also need to specify a list of all possible features out there. This is a very large, possibly very complex feature vector.

We then need to create an input function which will tell our estimator object how and what data from that feature vector it cares about. This input function will also specify the number of epochs and the batch size. And that basically is all that we need to do. The estimator object will take on the responsibility of instantiating the optimizer, fetching data from the feature vector, carrying out whatever numerical optimization is required to minimize the cost function and give us back a nicely trained model which we can then use to evaluate new test data. This is particularly useful if you’re dealing with a very complex neural network.

Notice that in a neural network, all of the data that flows between any layers of that neural network are in a sense, elements of a large feature vector. One of the drawbacks of neural network based solutions is that we don’t have a whole lot of visibility into the relationships between those feature vectors, particularly those which lie in hidden layers. Estimators help provide a solution. We can very quickly look for simple relationships, such as those implied by linear or logistic regression between intermediate features inside our neural network. And we can do so without having to know anything at all about the computation, graph, the cost function, or the optimizer.

All that we need to do is to make decisions about the number of epochs and the batch size, and the rest is handled for us by the estimator. Hopefully, we now can answer the question that we posed at the start of this video. Estimators are a high level way to perform linear or logistic regression or similar operations in TensorFlow, because the abstract away many of the nitty gritty details of doing so yourself. You don’t have to choose which gradient descent optimizer to use. You do not have to make decisions about whether to use minibatch or stochastic gradient descent. All of these decisions are handled for you. That’s why estimators are a high level API.