Introduction to Deep Learning for Self Driving Cars (Part — 1)

Foundational Concepts in the field of Deep Learning and Machine LearningContinue reading on Medium »

Introduction to Deep Learning for Self Driving Cars (Part — 1)

Foundational Concepts in the field of Deep Learning and Machine Learning

One of the coolest things that happened in last decade is that Google released a framework for deep learning called TensorFlow. TensorFlow makes all that hard work that we’ve done superfluous because now you have a software framework. They can very easily configure and train deep networks and TensorFlow can be run on many machines at the same time. So, in this medium article, we’ll focus on TensorFlow because if one becomes a machine learning expert, these are the tools that people in the trade use everyday.

A convolutional neural network is a specialized type of deep neural network that turns out to be particularly important for self-driving cars.

What is Deep Learning?

Deep Learning is an exciting branch of machine learning (ML) that uses data, lots of data, to teach computers how to do or learn things only humans can do. Myself, I’m very interested in solving the problem of perception, recognizing what’s in an image what people are saying when they’re talking on their phone, helping robots explore the world and interact with it. Deep Learning emerged as a central tool to solve perception problems in recent time. It’s the state of the art on everything having to do with computer vision and speech recognition. But there is more. Increasingly, people are finding
that Deep Learning is a much better tool to solve complex problems, like discovering new medicines, understanding natural language (NLP), understanding documents (OCR), and, for example, ranking them for search.

Solving Problems — Big & Small

Many companies today, have made deep learning a central part of their mission learning toolkit. Facebook, Baidu, Microsoft and Google, are all using deep learning in their products and pushing the research forward. It’s easy to understand why, deep learning shines wherever there is lots of data and complex problems to solve. And all these companies are facing lots of complicated problems. Understanding what’s in an image, to help you find it. Or translating a document into another language that you can speak.

Now, I will explore a continuum of complexity from very simple models to very large ones that one will still be able to train in minutes on a personal computer to very elaborate tasks like predicting the meaning of words or
classifying images. One of the nice things about deep learning is it’s really a family of techniques that adapts to all sorts of data and all sorts of problems. All of them using a common infrastructure and a common langauge to describe things.

Supervised Classification

This entire article, I’m going to focus on the problem of classification. Classification is the task of taking an input, like a letter, and giving it a label that says this is a B. The typical setting is that you have a lot of examples, called the training sets, that have already been sorted in this is an A, this is a B, and so on.

Now here is a completely new example, and our goal is going to be to figure out which of those classes it belongs to. There is a lot more to machine
learning then just classification, but classification, or marginally prediction, is the central building block of machine learning. Once you know how to classify things, it’s very easy, for example, to learn how to detect them or to rank them.

Training a Logistic Classifier

So let’s get started training a logistic classifier. A logistic classifier is what’s
called the linear classifier. It takes the input, for example, the pixels in an image, and applies a linear function to them to generate its predictions.

A linear function is just a giant matrix multiplier. It takes all the inputs as a big vector that will denote x and multiplies them with a matrix to generate its
predictions, one per output class. Throughout we’ll denote the inputs by x, the weights by w and the bias term by b.

The weights of that matrix and the bias is where the machine learning comes in play. We’re going to train that model. That means we’re going to try to find the values for the weights and bias which are good at performing those predictions. How are we going to use scores to perform the classification?

Well, let’s recap our task.

Each image that we have as an input can have one and only one possible label. So we’re going to turn those scores into probabilities. We’re going to want the probability of the correct class to be very close to 1. And the probability for every other class to be close to 0. The best way to turn scores into probabilities is to use a softmax function. Which I’ll denote here by S.

It’s important to know about that it can take any kind of scores and turn them into proper probabilities. Proper probabilities sum to 1 and they will be large when the scores are large. And small, when the scores are comparatively smaller. Scores, in the context of logistic regression, are often also called logits.

Cross Entropy

One-hot encoding works very well for most problems until we get into situations where you have tens of thousands, or even millions of classes. In that case, our vector becomes really, really large and has mostly zeros everywhere and that becomes very inefficient. What’s nice about this approach is that we can now measure how well we’re doing by simply comparing two vectors. One that comes out of our classifiers and contains the probabilities of our classes and the one-hot encoded vector that corresponds to our labels. Let’s see how we can do this in practice.

The natural way to measure the distance between those two probability vectors is called the Cross Entropy. I’ll denote it by D here. Be careful, the cross entropy is not symmetric. So we have to make sure that our labels and our distributions are in the right place. Our labels, because they’re one-hot encoded, will have a lot of zeroes in them and we don’t want to take the log of zeroes. For our distribution, the softmax will always guarantee that we have a little bit of probability going everywhere, so we never really take a log of zero.

So let’s recap, as we have a lot of pieces already. So we have an input, it’s going to be turned into logits using a linear model, which is basically our matrix multiply and a bias. We’ll then feed the logits, which are scores, into a softmax to turn them into probabilities. And then we’re going to compare those probabilities to the one-hot encoded labels using the cross entropy function. This entire setting is often called multinomial logistic classification.

Minimizing Cross Entropy

Okay, so now we have all the pieces of our puzzle. The question of course is how we’re going to find those weights w and those biases b that will get our
classifier to do what we want it to achieve. That is, have a low distance for
the correct class and have a high distance for the incorrect class.

One thing we can do is measure that distance averaged over the entire training sets for all the inputs and all the labels that we have available. That’s called the training loss. This loss, which is the average cross-entropy over our entire training set, is one homogeneous function. Every example in our training set gets multiplied by this one big matrix W. And then they get all added up in one big sum. We want all the distances to be small, which would mean we’re doing a good job at classifying every example in the training set. So we want the loss to be small. The loss is a function of the weights and the biases. So we are simply going to try and minimize that function. Imagine that the loss is a function of two weights. Weight one and weight two. Just for the sake of argument. It’s going to be a function which will be large in some areas, and small in others. We’re going to try the weights which cause this loss to be the smallest.

We’ve just turned the machine learning problem into one of numerical optimization. And there’s lots of ways to solve a numerical optimization problem. The simplest way is one we’ve probably encountered before, gradient descent. Taking the derivative of our loss, with respect to our parameters, and follow that derivative by taking a step backwards and repeat until we get to the bottom. Gradient descent is relatively simple, especially when we have powerful numerical tools that compute the derivatives for us. Remember, I’m showing you the derivative for a function of just two parameters here, but for a typical problem it could be a function of thousands,
millions or even billions of parameters.

Measuring Performance

After we have trained our first model, there is something very important to discuss. We have a training set, as well as a validation set, and a test set. What is that all about? Don’t skip that part. It has to do with measuring how well we’re doing without accidentally shooting ourself in the foot, and it is a lot more subtle then we might initially think. It’s also very important because
as we will discover later, once we know how to measure our performance on a problem, we’ve already solved half of it. Let me explain why measuring
performance is subtle.

Let’s go back to our classification task. We’ve got a whole lot of images with labels. We could say, okay, I’m going to run my classifier on those images, and see how many I got right. That’s my error measure. And then we go out and
use your classifier on new images, images that we’ve never seen in the past, and we measure how many we get right, and our performance gets worse. The classifier doesn’t do as well. So what happened?

Well, imagine I construct a classifier that simply compares the new image to any of the other images that I’ve already seen in my training set, and just returns the label. By the measure we defined earlier, it’s a great classifier. It would get 100% accuracy on the training set. But as soon as it sees a new image, it’s lost. It has no idea what to do. It’s not a great classifier. The problem is that our classifier has memorized the training set, and it fails to generalize to new examples. It’s not just a theoretical problem. Every classifier that we will build will tend to try and memorize the training set. And it will usually do that very, very well.

Our job though, is to help it generalize to new data instead. So, how do we measure generalization instead of measuring how well the classifier
memorized the data?

The simplest way is to take a small subset of the training set, not use it in training, and measure the error on that test data. Problem solved, now our classifier cannot cheat because it never sees the test data, so it can’t memorize it. But there is still a problem, because training a classifier is usually a process of trial and error. We try a classifier, we measure its performance and then we try another one and we measure again. And another, and another, we tweak the model, we explore the parameters, we measure, and finally, we have what we think is the perfect classifier. And then after all this care we’ve taken to separate our test data from our training data and only measuring our performance on the test data, now we deploy our system in a real production environment.

And we get more data and we score our performance on that new data and it doesn’t do nearly as well. What can possibly have happened? What happened is that our classifier has seen our test data, indirectly, through our own eyes. Every time we made a decision about which classifier to use, which parameter to tune, we actually give information to our classifier about the test set. Just a tiny bit, but it adds up. So over time, as we run many, and many experiments, our test data bleeds into our training data. So what can you do?

There are many ways to deal with this. I’ll give you the simplest one. Take another chunk of the training set and hide it under a rock. Never look at it until we have made our final decision. We can use our validation set to measure our actual error, and maybe the validation set will bleed into the training sets. But that’s okay because we’ll always have this test set that we can rely on to actually measure our real performance.

Stochastic Gradient Descent

The problem with scaling gradient descent is simple. We need to compute these gradients. Here’s another rule of thumb. If computing our loss takes n floating point operations, computing its gradient takes about three times that compute. This last function is huge. It depends on every single element in our training set. That can be a lot of compute if our data set is big, and we want to be able to train on lots of data because in practice, on real problems, we’ll always get more gains, the more data we use. And because gradient descent is iterative, we have to do that for many steps. That means going through our data tens or hundreds of times. That’s not good. So instead, we’re going to cheat. Instead of computing the loss, we’re going to compute an estimate of it, a very bad estimate, a terrible estimate, in fact.

That estimate is going to be so bad, we might wonder why it works at all. And we would be right because we’re going to also have to spend some time making it less terrible. The estimate we’re going to use is simply computing the average loss for a very small random fraction of the training data. Think between one and a thousand training samples each time. I say random because it’s very important. If the way we pick our samples isn’t random enough, it no longer works at all. So we’re going to take a very small sliver of the training data, compute the loss for that sample, compute the derivative for that sample, and pretend that that derivative is the right direction to use to do gradient descent. It is not at all the right direction. And, in fact, at times, it might increase the real loss, not reduce it. But we’re going to compensate by doing this many, many times, taking very, very small steps each time. So each step is a lot cheaper to compute, but we pay a price. We have to take many more smaller steps instead of one large step. On balance though, we win by a lot. In fact, doing this is vastly more efficient than doing gradient decent. This technique is called stochastic gradient descent, and is at the core of deep learning.

That’s because stochastic gradient descent scales well with both data and model size, and we want both big data and big models. Stochastic gradient descent, SGD for short, is nice and scalable. But because it’s fundamentally a pretty bad optimizer that happens to be the only one that’s fast enough, it comes with a lot of issues in.

Parameter Hyperspace

Learning rate training can be very strange. For example, we might think that
using a higher learning rate means that we learn more, that’s just not true. In fact, we can often take a model, lower the learning rate and get to a better model faster. It gets even worse. We might be tempted to look at the curve that shows the loss over time to see how quickly we learn. Here the higher learning rate starts faster but then it plateaus when the lower learning
rate keeps on going and gets better. It is a very familiar picture for anyone who’s trained neural networks. Never trust how quickly you learn, it has often little to do with how well you train.

This is where SGD gets its reputation for being Black Magic.

We have many, many hyperparameters that we could play with. Initialization parameters, learning rate parameters, decay, momentum and we have to get them right. In practice, it’s not that bad but if we have to remember just one thing is that when things don’t work, always try to lower our learning rate first. There are lots of good solutions for small models. But sadly, none that’s completely satisfactory, so far, for the very large models that we really care about.

What's Your Reaction?