fast.ai v3 Lesson 3: Multi-label classification; segmentation
I’m doing some 5-minute summaries of fast.ai Practical Deep Learning for Coders lessons.
This is lesson 3.
This lesson continues with the same two-step process described in lesson one, but applied to several other domains.
There are several differences (described later in this post) in that some of the problems are no longer multi-class classification problems.
Turns out the only meaningful difference in most of these is the way you get the data into a form that you can pass into a fastai learner.
There was some discussion on the fastai library’s
DataBlocks API, which uses something similar to the fluent interface. This API allows you to do any number of configurations to your data (source of data, transforms, normalisation, validation sets, etc.) and ends with you calling
databunch() and passing in the class of the
DataBunch you’re interested in.
“Learners” in the fastai library expect a
DataBunch, as previously mentioned.
This is from a Kaggle competition, and each satellite image has multiple labels. Thus, this is a multi-label classification problem.
There was some discussion on how to get the data in the right format. Unlike in multi-class classification, we can’t put them in different folders (your images belong to several labels).
Image segmentation (CamVid)
This is an image segmentation problem, where the dataset contains of stills from video feeds, and some ground truth hand-labelled categories for every pixel.
Using something called a U-Net, which ostensibly can still use ResNet, the two-step process achieved results that beat the state-of-the-art in 2015.
Human pose finder
There was another dataset where the model finds the coordinates of the centre of a human face.
As coordinates are continuous values, this is actually a regression problem.
IMDB sentiment classifier
This is a dataset of IMDB reviews. The only new part of this is that it uses a step called tokenisation, which splits text into tokens. Each token has a number associated with it.
Progressively increase resolution
Jeremy talks about a technique he found for computer vision which gives really good results.
The way it works is you train on a lower-resolution dataset first to the point of nearly overfitting. Then, you use transfer learning to use the model you just trained by replacing your learner’s
DataBunch with a
DataBunch of larger resolution images, and repeating the two-step process.
As far as the CNN is concerned, this is an entirely new dataset. Training on these larger images has a regularising effect – it generalises better.
This was alluded to in my lesson 1 summary, but the technique was explained in detail here. It looks like the particular algorithm here has changed since the last course.
The learning rate basically changes within the epoch (or a cycle, if you only have one cycle). It starts low, and then ramps up to the maximum learning rate you specify. And then it starts to decay to a lower learning rate.
The reason this works well is that the initial ramp up allows you to (a) jump quickly to a place near the global minimum, and (b) jump out of any local minima. Then the learning rate starts to decay, as it should, preventing your model from diverging.
Mixed precision learning
This is actually really interesting. If you use a lower precision for your mathematical computations (like 16-bit floating point rather than 32-bit), there is a possibility that you can end up with better results. Turns out when you use less precision, you can sometimes get better generalisation.
Other benefits are, less GPU memory usage and faster training time. (Looks like this is a technique I really need to try out)
What a neural network really is
The lesson then talks about what a neural network actually is.
Short answer: basically a higher order function consisting of a linear functions followed by non-linear functions.
If a linear function is
f, and a non-linear function is
g, a neural network is basically something like
g(f(g(f(g(f(x)))))) depending on how many “layers” you have.
The reason this works is due to something called the universal approximation theorem.
Historically, we used sigmoid functions for the non-linear functions. Recently, we almost always want to use something called the “rectified linear unit” or ReLU. It’s a silly name, and is basically this:
def relu(x): return np.max(0, x)
The reason we need non-linear functions (sometimes called non-linear activation functions) is because a series of linear functions can only approximate linear functions.
This is actually quite easy to prove with linear algebra (not covered in the lesson).
A linear function is a fancy name for a matrix multiplication. If you have several linear transformations, it’s basically this function:
y = A @ B @ x
@ operator refers to a matrix multiplication.
Since matrix multiplication is “associative”, this means
A @ (B @ x) is equal to
(A @ B) @ x. And
A @ B simply gives you another matrix,
C. So this becomes
C @ x which is just another matrix multiplication, i.e. a linear function.
Introducing non-linearity solves the problem.