Google Professional Data Engineer – TensorFlow and Machine Learning Part 11
August 6, 2023

24. Individual Neuron

Here is a question that I’d like us to consider as we learn the contents of the video. Neural networks consist of complex building blocks which have been interconnected in simple ways. Is this statement true or false? Let’s now really hone in on the role that neurons play in the learning process. The nodes in the computation graph are really just neurons. Each individual neuron only performs very simple operations on our data. These neurons, however, are connected in very complex and sophisticated ways. This is an important point to keep in mind. Each individual neuron is really simple in its working. What makes the neural network as a whole complex is the interconnections between those neurons. And those interconnections vary a great deal based on our application.

For instance, convolutional neural networks are a type of wiring up of neurons which tend to be used for image processing applications. On the other hand, if we were going with a text processing or a natural language processing application we would more likely go with a different type of neural network called a recurrent neural network. And the big difference here would be in the manner in which the feature data is represented and in which the neurons are connected in the corresponding neural network. Indeed, when actually using neural networks in real life applications the challenge is architecting the neural network, making decisions about the different layers and how they will be connected to each other.

The typical unit of abstraction is a layer because a layer consists of groups of neurons which perform similar functions. All of the complexity in neural networks is in the interconnections. The individual neurons really only apply two simple functions to their inputs. The first of these is known as an affine transformation. This is simply a linear transformation, very much like linear regression. The second of these is an activation function. This helps to model nonlinear functions like XOR or logistic regression and so on. Let’s now hone into one single neuron. Let’s understand the operation of one single neuron. As with any neural network, there are going to be a bunch of inputs into this neuron.

Let’s call it x one through XN. Those inputs will then be multiplied by weights w one through wn and then added to a bias that is B. And indeed, this transformation is known as an affine transformation because we are taking an input vector x, multiplying it by another vector w, adding a bias b and this is the affine function that gets passed to the activation function. That’s the second step in the operation of a neuron. An example of a common activation function is something known as the ZLOO. This stands for the rectified linear unit. The ReLU will return the maximum of its input and of the zero. So if you pass in the output of the affine transformation that’s WX plus b, the output of a ReLU activation function will simply be max of WX plus b and zero.

Now the exact transformation that we apply in the activation function might change the exact values of the weights and the biases might change. But at heart, each neuron is only going to carry out these two functions one after the other. Let’s understand this in a little more detail. Focus on the inputs. Here we have a lot of x values x one through XN. These represent potentially a feature vector. These are the features of whatever problem instance we are seeking to work with. The output from the activation function is the output of the neuron as a whole. There are different activation functions which are used in different use cases. The afine transformation and the activation function collectively define the neuron.

So training a neural network refers to the process of nailing down the values of the parameters in each of these functions. For the affine transformation, the parameters are WNB. So this is just a weighted sum with a bias added. Those weights help the neuron to adjust the amount of weightage given to different features in the feature vector, the values W one through Wn help accomplish this. The value B just gets added on to the output. It’s independent of any of the input features, and so it’s called the bias. And once again, we should be absolutely clear on where the values of W and B come from. The answer is that these are determined during the training process. This is true of all implementations of neural networks.

Specifically in TensorFlow w and B are variables. Note this again, w and B are variables, not placeholders and not constants. And the actual values of W and B are determined by TensorFlow during the training process.We shall see how that happens in more detail when we actually implement linear regression. The point about W and B being variables is an important one because these are going to change during the training process internally, within the graph. The objective of the training process is going to be to find the best values of W and B for each neuron. And how are those best values going to be found? Well, you as the programmer are going to have to specify a cost function and an optimizer and a number of steps.

You also have to point TensorFlow to your corpus of training data. Then TensorFlow will take care of a good deal of the plumbing for you. It will run the optimizer in order to minimize the cost function, and it will do so with different subsets of the training data, the corpus of training data, and it will execute as many optimization steps as you allowed it to. Each of these steps will cause the values of the variables W and B to change. And at the end of it all those will represent the best possible values. This entire process is the training process. The training process might be simpler or more complex depending on the configuration of our neural network, because neural networks can have different configurations and the interconnections and some of these can get very sophisticated.

The training process can also take a long time, but the basic idea is the same TensorFlow is going to run an optimizer usually specified by us to minimize the cost function also specified by us over a corpus of training data specified by us and do so for a number of iterations that once again we specify. Now clearly during the training process there has got to be a feedback loop. The output of deeper layers in the neural network will need to be fed back into those of the earlier layers in order to find the values of W and B in those earlier layers which work best. This process is called back propagation and it’s pretty much the norm, the standard way of training neural networks.

By the end of this training process, the values of the weights and the biases in every neuron in our neural network will have converged to the best values. And in this way the training algorithm will have told our neuron which of the values of the inputs matter and which do not. That’s because a high value on a particular W coefficient implies that our neuron is giving a high amount of weightage to that particular input xi. On the other hand, if one of the WS has the value of zero, that just means that the neuron is completely ignoring the corresponding x input and the bias is added on kind of exogenously to the product of W and x. So this acts as a correction.

Now, if we specify the cost function for the affyne transformation to be minimizing the square of the errors that’s the residuals, we basically just get linear regression. This shows that it’s entirely possible for an affin transformation to learn a linear function. But in order for this neuron to learn a nonlinear function, in other words, for the output of the neuron as a whole to be nonlinear, we’ve got to generalize it. Thankfully, this generalization is pretty easy. And this is where the activation function comes into play. There are a bunch of common activation functions which have been found to work well in different applications. One common one is the ReLU.

This is the max of zero. And the input ReLU, which stands for a rectified linear unit, is extremely commonly used as an activation function. Later, when we are talking about implementing logistic regression, we shall consider another common activation function called soft max. As the designer of the neural network, you get to pick what activation functions you would like your neurons to have. But all of them have the same basic idea. The activation function is needed for the neural network to model nonlinear functions and this is accomplished by chaining the output of the affine transformation into the input of the activation function. Let’s return to the question we posed at the start of this video.

This statement is false. Hopefully, we’ve learned enough to see this for ourselves. Neural networks consist of building blocks called neurons. Each neuron is actually pretty simple. It just carries out an affine transformation followed by an activation function. Each of these neurons is individually pretty simple, but it’s the interconnections between large numbers of neurons which give neural networks the ability to reverse engineer even complicated functions. And so this statement is false. It gets it exactly backwards. The individual units are simple. It’s the interconnections which are complex.

25. Learning Regression

Here is a question that I’d like us to ponder over as we go through the contents of this video. Learning or reverse engineering linear regression requires just one neuron and just an Affine transformation. An Affine transformation or a linear transformation is all that’s required. And it’s only one neuron that’s needed to learn or reverse engineer linear regression. True of calls. Let’s now jump into a simple example. Let’s use regression linear regression as the function that needs to be learned by our neural network. We’ve discussed already that neural networks can learn arbitrarily complex functions. They can really reverse engineer any piece of code at all.

All that you need to do is to add enough layers to your neural network and wire them up correctly. Of course, this is sometimes easier said than done, but there is little ambiguity about learning linear regression. Linear regression is a straight line relationship of the form y is equal to WX plus B. And this pretty neatly fits in with the Affine transformation that is performed by a single neuron. So we pass in a set of points into the neuron, and the output, once a neuron has been trained, is the regression line in the form of those constants W and b. Let’s now double click on the neuron in this process. So the inputs are still all of the x values exactly as before.

But now these are passed into an Affine transformation unit inside a neuron of the sort we just looked at here. Because the output of the Affine transformation itself exactly mirrors the regression equation. We don’t even need an activation function. As long as the values of W and B are indeed the best values, that is all that is required. That is the best fitting line for this set of points. This point is worth repeating because here the function that our neuron needs to learn is a linear one. We don’t need an activation function. The Affine transformation alone will suffice. Let’s very quickly go over once again how this Affine transformation works.

It’s just a weighted sum of the inputs with a bias added. The inputs are all of the x variables. These are the causal or explanatory variables which need to be linked or used to predict the output. The constant W represents the weights. These measure how much weightage each of the input should be given by the neuron, and the constant B is the bias. This will adjust the output of that weighted sum of w and x, either up or down. We’ve already answered this question where do the weights w and the bias B come from? And the answer is that they are determined during the training process. Remember also that in TensorFlow these are variables.

So the input values, the x values are placeholders, but the weights and the bias are variables which will be determined for us during the training process. That training process is managed by TensorFlow. But we do need to specify an optimizer, a cost function, the corpus of training data that’s the input values and the number of steps. We’ll have more to say on the mechanics of the training process and on the optimization process. In just a bit we are going to actually implement linear and logistic regression. And for that we need to understand what cost function we are going to go with.

Let’s just go with simple regression where we have one x variable. So we want to represent the values of the y variable in terms of a constant A and the slope B. Hopefully, in a perfect world, one set of constants A and B would be enough to perfectly fit all of the y values given the x values. But in reality we have n equations and just two unknowns. So perfect equality will not be possible. We will have to live with a bunch of error terms or residuals.Let’s call them e one through en. This equation can now be represented in simple matrix form. We have a y matrix on the left hand side.

This is equal to a constant A multiplied by a vector of ones plus a constant v multiplied by a vector of the x variables. And then we’ve got to add in a vector of all of the residuals. This is the matrix notation behind a simple linear regression. The best values of A and B are those which will minimize the sums of the squares of those residuals. This method is known as minimizing the least square error. Here the error that corresponds to any one point x I comma yi is obtained by dropping a perpendicular line from that point onto our regression line. The point of intersection will have the same x variable. After all, the line is perpendicular, but it will have a different y value. And this difference between the actual and the fitted y values constitutes the error or the residual.

These two terms are subtly different in a mathematical sense, but they can often just be used synonymously. The objective of linear regression is to minimize the sum of the squares of all of these errors. This implies finding the best values of the constants A and B. There is an infinite number of candidate lines. Let’s just, for the sake of example, consider two. Let’s call these line one and line two. These have the parameters a one and b one and a two and B two respectively. Coming back to a single neuron, the training process needs to find some way. It needs to have some mathematical procedure for choosing the best such line.

If there were only these two lines, it needs to know mathematically why it should pick line one rather than line two. And the answer to this is via an optimization to minimize the least square error. So the optimizer will drop vertical lines from each point to each of the candidate lines. Then for each candidate line we will calculate the sum of the squares of the length of these dotted lines. Now, these dotted lines are, of course, the errors or the residuals. So what the optimizer is seeking to do is to minimize these errors as measured by the sums of the squares of their lengths and the line which has this property of all candidate lines, it has the lowest sum of the square of the errors.

This is the best fit or the regression line. Now, there are many cookie cutter methods specifically designed for solving this regression problem. These include the method of moments, the method of least squares, and the maximum likelihood estimation method. But in reality, none of these is used quite directly by TensorFlow. When we implement linear regression in TensorFlow, we have a training process. That training process will go through an iterative set of steps, and those steps will determine the values of the weights w and the bias B. Also, just to be clear, this training process is global, as in, it applies to the entire computation graph as a whole and not to an individual neuron.

We will go through this process in more detail when we actually implement linear regression in TensorFlow in multiple different ways. Let’s return to the question we posed at the start of the video. This statement is true. An afine transformation of any variable involves multiplying that variable by a constant and then adding a constant to it. So if you have a variable x, an affine transformation will be a function of the form w x plus B. This is exactly what one neuron accomplishes in its affine transformation portion. We do not need multiple neurons, nor do we need any nonlinearity. So this statement on screen is true.

Leave a Reply

How It Works

img
Step 1. Choose Exam
on ExamLabs
Download IT Exams Questions & Answers
img
Step 2. Open Exam with
Avanset Exam Simulator
Press here to download VCE Exam Simulator that simulates real exam environment
img
Step 3. Study
& Pass
IT Exams Anywhere, Anytime!