Yoong Kang Lim

fast.ai v3 Lesson 2: Data cleaning, production, SGD from scratch

I’m doing some 5-minute summaries of fast.ai Practical Deep Learning for Coders lessons.

This is lesson 2.

Click here for the rest of the summaries.


How to build your image dataset

The lesson shows you a way to grab image URLs from Google Image Search using JavaScript you run on your developer console. The output is a text file with a list of URLs that you can use to programmatically download each image. The fastai library gives you a convenience function to do exactly this.

Deploying models to production

Basically, for most applications you can deploy a model to standard infrastructure using any Python web framework. The lesson specifically mentions Tom Christie’s Starlette which has the benefit of being built on top of ASGI – allowing you to use async/await.

(On a side note, I did exactly this, but with Django. Check out my project here: Toxic Comment Classification)

SGD from scratch

This is really old hat if you’ve done any machine learning, but for the sake of completeness, here’s the summary:

SGD stands for “stochastic gradient descent”. The lesson shows you how to do this from scratch using the simplest model possible, a straight line.

The idea is that the measure of how well a model fits the data can be expressed using a “loss function”. A common loss function is mean squared error, or MSE. It’s basically this:

def mse(y_hat, y):
    return ((y_hat - y) ** 2).mean()

The arguments to that function are vectors (specifically NumPy vectors).

The name actually tells you exactly what the function means. The error refers to y_hat - y. Taking the square of this means the order doesn’t really matter, we could have defined the error as y - y_hat. And then we simply take the mean.

The loss function gives you a “score”, where the lower your score the better your model.

The goal of deep learning (or any machine learning, really) is to find the parameters such that they minimize that loss function. This technique is called gradient descent.

The most efficient way to do this is using a numerical method, i.e. letting the computer do the work. In this numerical method, you start at some random point, find the derivative (actually you find the gradient, but in 1-dimension they’re equivalent).

This derivative tells you which direction you should shift your parameters (i.e. whether your parameters should be larger or smaller). You will need to subtract from your parameters a number that is a function of the gradient, and in order to do this you will need to choose an appropriate constant to scale this gradient by – this is called the “learning rate”.

The idea is that if you iterate this enough times over the data set (each iteration over the whole data set is called an “epoch”), you’ll eventually converge at a solution. In practice, we don’t use the whole data set, but we choose minibatches and update the weights for each batch. When we use minibatches, we call this algorithm stochastic gradient descent (as opposed to regular gradient descent).

Learning rate problems

You’ll need to ensure you choose the right learning rate. A sign that your learning rate is too high is when you see your validation loss increasing too much. If your learning rate is too low, you will take too long to train.


Overfitting and underfitting were discussed very briefly. Techniques that reduce overfitting is called “regularisation”. Specific methods for regularisation will be discussed later in the course (but basically, use dropout).

It’s also important to be able to identify when your model is overfitting or underfitting. To do that you need to choose a validation set.


  • If your training loss and validation loss both go down, and then at some point your validation loss goes up, you’re starting to overfit.
  • If your training loss is higher than your validation loss, you might be underfitting. (hint: you should generally do better on data your model is using to train compared to data your model hasn’t seen)

If you use too many epochs, you might end up overfitting (but it was mentioned in the lesson that it’s pretty hard to do).

Too few epochs means you might underfit – it will look similar to using a lower learning rate, so you’ll need to tweak both of these things to see how the model changes.

The lesson says there is a misconception that if the training loss is lower than the validation loss it means you are overfitting. That’s not true – you’re only overfitting when your training loss continues to go down while your validation loss goes up.

If you like posts like this, you might want to follow me on Twitter. Also, if you need any help building or improving your projects (Python/Django, JavaScript, Machine Learning, etc.) feel free to shoot me an email.