**20. One-hot Notation and L1 Distance**

As we go through this video, here is a question that I’d like you to think about. We’ve discussed the one hot notation. The question is, can this one hot notation or representation be used to hold continuous data items such as heights or weights? Or is this only restricted to categorical data items? Do think over this question as we go through the contents of the video. Let’s quickly take a look at some measures which are used to calculate the distance between a pair of points. In an n dimensional hyperspace, the most common way of calculating the distance between a pair of points is the Euclidean distance.

This is the distance formula which we probably learnt in coordinate geometry in high school. The Euclidean distance is a prototype for other distance metrics. So let’s parse it. Given a pair of points x and xi, the Euclidean distance between these two points is calculated by first taking the difference between all of the coordinates of these two respective points. Notice here that each of these points could have any number of dimensions. So these are vectors that we are dealing with. We then square all of these distances and compute that sum and finally calculate the square root of this quantity.

What we get this way is the Euclidean distance, which is also the same as the straight line distance between a pair of points. For instance, in a two dimensional hyperspace, this is as the crow flies distance. This is what we would get if we were to apply the Pythagoras theorem to a couple of points. In other words, the Euclidean distance is equal to the sum of the squares of the differences between the x and the y coordinates. In our example on screen now, the Euclidean distance is the length of that line segment between the two highlighted points. While dealing with points represented in a discrete space rather than a continuous space.

The l one distance is a preferred metric. This also has various other names such as the snake distance, the city block distance, or the Manhattan distance. In any event, the idea is simple. The Manhattan distance corresponds to the number of grid blocks, either horizontal or vertical, that would have to be traversed in order to go from one point to the other. To make this real, let’s consider an example. Let’s find the Manhattan distance between the points one comma zero and five comma four. These are the coordinates of the two highlighted points. Here, the number of horizontal blocks which will have to be traversed is equal to five minus one, which is equal to four.

This is simply the difference between the x coordinates of the two points. In the same manner, the number of vertical blocks which will have to be traversed is given by four minus zero. That’s the difference between the y coordinates of these two points. Notice that we effectively calculate the absolute value of the differences. That is because the number of steps has always got to be a positive number. So calculating the L one distances between a pair of points involves finding the differences between all of the coordinates of those points, converting all of those differences to their absolute values to get a series of positive numbers, and then going ahead and adding up all of those positive numbers.

That sum is the L one distance. In this way, we are able to add up both of these quantities and get the total number of blocks that will have to be traversed, which is equal to eight. The L one distance between the points one comma zero and five comma four is equal to eight because that is the total number of steps that we will need to take in order to go from the other. Notice that this is only the case when our steps are constrained to be either horizontal or vertical. We cannot go in a straight line between the two points as we did in the case of the Euclidean distance. Next, we are going to turn our attention to a demo.

This will involve actually coding up a digital recognition program in TensorFlow. We will do so using the K nearest neighbors algorithm. We will make use of the L one distance in order to measure the distances between pairs of points. And we will end this implementation by testing the accuracy on some test data. Before switching over to the code demo, I’d like to outline the approach that we are going to take. The first step in our implementation is going to consist of accessing the images in the MNIST data set. We will do so by using some TensorFlow libraries that are provided for this purpose. We will also be able to access these images in batches.

Once that’s done, we will set about actually calculating the L one distance. Here we are going to need to calculate the distance between a test digit that’s what we are going to accept as input and all of the training digits in our data set. Once we have these distances calculated, it’s relatively easy to implement the Knee Nearest Neighbors algorithm. We will predict the label for our test data on the basis of the nearest neighbors in the training data. Before we plunge into the implementation, let’s also clearly understand the mnest data set, the attributes, that is, the feature vectors. This is going to consist of a set of images.

Each image is represented by pixels. This is a standardized image where all images have 28 by 28. That is a total of 784 pixels. This is going to be the case for every image in the MNIST data set. The labels associated with these data sets will correspond to one hot vectors. This is a new term and it’s an important one, so let’s please pay attention. What we have here is an image which clearly represents the digit four. Now, the label of this image is represented in the form of a vector. That vector has ten elements in total. Why ten? Well, because there are N possible output labels, the digits zero through nine. So each label is represented using a vector which has ten indices.

Now, all of the elements in that index are going to be zero, with the exception of the one digit which it actually represents. So here in our ten element vector, all of the elements are going to be zero, other than that with index four. So the index of the one element which is equal to one, that’s the hot element, corresponds to the label of our image in exactly the same manner. An image representing the digit five would also be encoded using a ten element vector. But in this vector, it’s only the fifth element, that is, the element with index equal to five, which is going to be equal to one. This is the one hot notation which is used for representing images.

Let’s now turn back to the question we had posed at the start of the video. The one hot representation cannot be used to hold continuous data items. Think about it. The one hot representation of a digit, for instance, consists of a vector or a list with ten possible values, out of which only one is equal to one. The one element which is nonzero is called the one hot element. All of the other elements are zero. Clearly, this is only possible when we have individual columns for each possible value of our variable. That is not at all possible if our variable is continuous in nature. Continuous variables can take on an infinite set of values. Categorical variables can only take on a finite set of values. And that’s why only categorical variables can be represented in the one hot. Hello.

**21. Steps in the K-Nearest-Neighbors Implementation**

As we go through this lecture, here is a question that I’d like you to think about. Let’s say that we have an RGB image of 784 pixels. Does it make sense? Is it ideal or optimal to represent this as a tensor? Which shape 784 comma three? Is this an optimal representation of the pixels in an image that is represented in RGB format? Now that we’ve understood the one hot notation for representing the labels of these mnest images, let’s also understand the feature vectors that we are going to go with. There are different possible ways of expressing an image using a feature vector. We will go with a representation based on the 784 pixels which comprise each image.

And do recall that each image is a gray scale image, 28 pixels in height and width. So 784 pixels in total. Each one of those pixels is a gray scale value, which means that it is represented using a single channel. That channel is going to correspond to a number between zero and one. One represents white and zero represents black. All of the numbers in between represent shades of gray. With this representation, let’s take a look at how we will create a feature vector corresponding to one image. In effect, we are going to make use of a 784 element vector. We will do so by flattening all of the pixels in an image and adding them or packing them on to one single long vector. So our feature vector is going to consist of a tensor.

A tensor with just one dimension. That tensor is going to have a shape of 784. Now, you should be aware that this particular representation of an image is actually a lossy one. It’s not the best representation. This is not the representation which we would use with a more advanced technique, such as a convolutional neural network. Convolutional neural networks represent the state of the art in image recognition problems like this one, and they work with feature vectors that are grids. Here, we are not using a grid. Rather, we are using a single long vector with just one dimension. A grid is a better representation because it captures information about the artifacts of the image.

It, for instance, will take into account that pixels which are at the edges or the boundaries are grouped together in a specific way. That information is lost in this flat representation. To see how, consider that the first 28 elements in our 784 element vector correspond to the first row. But now when we switch to the second row or all of the rows in between, we are losing the fact that those rows are stacked together. We are losing the fact that some of those pixels are very close together when represented in a two dimensional format, but very far away from each other when represented in a flattened one dimensional format. In any case, this flattened one dimensional tensor format is easy to use for our purposes.

Here. If we were going to use a more advanced algorithm and if you wanted a high level of accuracy, we would have gone with a two dimensional representation rather than this flattened one dimensional one. Instead, the next step in our algorithm is going to involve calculating the L one distance between a test digit and all of the training digits. Doing this requires jumping through a few hoops in TensorFlow. So let’s understand exactly how we accomplish this. Let’s say we have a single training digit. This is a 784 element vector, so it has just one dimension with 784 elements. We want to calculate the L one distance of this training digit from a test digit, which is also a 784 element vector.

So what we are going to do is to first compute the negative of the test digit. We’ll do so using the function TF negative. This is a standard TensorFlow function. Once that’s done, we add up the two vectors. This has the effect of calculating an element wise sum of the two vectors, because one of these is the negation of the test digit. In effect, what we’ve done is to perform an element wise subtraction of the test digit from the training digit. Notice again that this element wise subtraction is because we negated the test digit using PF negative. But we did not negate the training digit. That is exactly as it was in our data set. We will now go ahead and repeat this set of steps for every image in our training data.

The MNIs data comes in batches of 5000 training images at a time. So we will go ahead and calculate 5000 vectors, each of which correspond to the element wise difference between the one S digit and each of the corresponding 5000 training images. So at the end of this process, we are going to be left with 5000 vectors, each of them with 784 elements each. Now, while calculating the L one distance, we really focus on the absolute value, that is the number of steps that we need to take in order to go from one vector to the other. So we’ve got to apply the TensorFlow TF dot ABS method to convert any negative values into positive values because after all, we are going to add all of the distances irrespective of their sign.

So go ahead and invoke the TF ABS method on each of the different vectors, which we obtained by comparing our one test digit to each of the 5000 training digits, the values highlighted in orange are the ones which flipped sign they were negative. But while calculating the L one distance, because we need to focus on the absolute values, we consider them as positive instead. Next, we will go ahead and add element wise each of these vectors and reduce them into just one number per distance vector. We do so using the TF dot reduce sum method, which simply adds all of the elements in a vector. So for instance, the value one is equal to the sum of all of the elements in the absolute vector which was previously calculated.

So where we previously had 5000 vectors, each with 784 elements, after applying PF dot reduced sum to each of these, we are going to be left with 5000 vectors, each with just one element. Each element corresponds to the distance between that training digit and the test digit. Because there are 5000 training digits, we are going to end up with 5000 distances. The next step for us is to sort all of these distances and then take into account some kind of average of the key neighbors which are nearest to our test digit. Now, the simplest case of the k nearest neighbors algorithm is where we just consider the one nearest neighbor. In that case, we don’t even really need to solve all that.

We need to do is to find that value which has the smallest distance that would correspond to the digit which is closest to our s digit in the training data. So let’s take stock of where we stand. We’ve calculated the l one distance between each training digit between each of the 5000 training digits and the one s digit. Now all that’s left for us to do is to actually run the algorithm and take into account the k nearest neighbors. To find what this test digit is. In general, for values of k greater than one, we will need to sort by distance by l one distance, find the knees neighbors, and perform some kind of average or election algorithm within those knees neighbors.

But if we use just k equal to one, all we need to do is to find the one image in our training data which is nearest to our test data. This can be accomplished in TensorFlow with just one command. It’s called np. org min. NP arg min is enough to find the index of the nearest image in our training data set. We can then figure out the label corresponding to this nearest neighbor and use that as our assessment of the test digit. Let’s now turn back to the question that we had posed at the start of this class. Let’s say we have an RGB image with 784 pixels. It is not optimal. It is not really the most efficient way to represent this tensor. As of shape 784 comma three.

The second element that the three represents the number of channels. That is fine because that is indeed the number of channels in an RGB network. But the first element of the shape vector, which is 784, indicates that we have represented all of the pixels as one long uninsional array that causes information to be lost. For instance, information about which pixels are adjacent to each other whence stacked vertically. While using convolutional neural networks, which are the most popular type of neural network for working with images, we would typically not use one long single array with all of the pixels laid out one after another. We would most likely go with a 28 by 28 array if those are the width and the height of the image respectively.