Practical Deep Learning - Lesson 3

This lesson covers how to train a digit classifier from scratch using the MNIST dataset.
Deep Learning
Python
Fastai
Course
Author

David Gwyer

Published

May 29, 2025

This chapter is all about the building blocks of creating a successful model. A computer vision model in this case, that can recognise hand-written digits. We’ll cover the individual components that make up the overall model, and how they are used together to form a working system that can perform accurate inferencing.

We’ll be using the MNIST dataset in this lesson for making predictions about hand-written digits. The full dataset is around 60,000 training images, and 10,000 test images. However, for the purposes of this lesson to keep things simple we’ll use a sample of the full dataset to try and predict only the digits ‘3’ and ‘7’ (rather than the full range of digits ‘0’ through ‘10’.

Downloading the Image Dataset

Downloading the dataset is pretty straightforward using the untar_data() function from fast.ai. Once the images are downloaded we can view some of the folders and filenames.

URLs.MNIST_SAMPLE
'https://s3.amazonaws.com/fast-ai-sample/mnist_sample.tgz'
path = untar_data(URLs.MNIST_SAMPLE)
path
Path('/home/dgwyer/.fastai/data/mnist_sample')
Path.BASE_PATH = path
path
Path('.')
path.ls()
(#3) [Path('labels.csv'),Path('valid'),Path('train')]
(path/'train').ls()
(#2) [Path('train/7'),Path('train/3')]
(path/'train/7').ls()
(#6265) [Path('train/7/7420.png'),Path('train/7/9878.png'),Path('train/7/47453.png'),Path('train/7/18966.png'),Path('train/7/27005.png'),Path('train/7/31957.png'),Path('train/7/14379.png'),Path('train/7/5811.png'),Path('train/7/33104.png'),Path('train/7/43686.png'),Path('train/7/58687.png'),Path('train/7/46356.png'),Path('train/7/4242.png'),Path('train/7/50455.png'),Path('train/7/54561.png'),Path('train/7/20105.png'),Path('train/7/2814.png'),Path('train/7/17185.png'),Path('train/7/38776.png'),Path('train/7/22313.png')...]
(path/'train/3').ls()
(#6131) [Path('train/3/47123.png'),Path('train/3/21559.png'),Path('train/3/17103.png'),Path('train/3/59660.png'),Path('train/3/59408.png'),Path('train/3/20738.png'),Path('train/3/8195.png'),Path('train/3/15109.png'),Path('train/3/54568.png'),Path('train/3/21075.png'),Path('train/3/20705.png'),Path('train/3/16811.png'),Path('train/3/43816.png'),Path('train/3/20869.png'),Path('train/3/42951.png'),Path('train/3/54020.png'),Path('train/3/48064.png'),Path('train/3/59996.png'),Path('train/3/44596.png'),Path('train/3/1476.png')...]
total_dataset_images = len((path/'train/7').ls()) + len((path/'train/3').ls())
total_dataset_images
12396

As you can see, the MNIST_SAMPLE dataset only contains the digits ‘3’ and ‘7’, and the dataset has 12396 images in total. Let’s take a look at a sample hand-written ‘7’ image.

img_path = path/'train'/'7'/os.listdir(path/'train'/'7')[0]
img = PILImage.create(img_path)
img.show(figsize=(2,2));

Let’s take a look at a few more random sample ‘7’ digits.

digit_dir = path/'train'/'7'
sampled_files = random.sample(digit_dir.ls(), 9)
sampled_files
[Path('train/7/17833.png'),
 Path('train/7/24882.png'),
 Path('train/7/56828.png'),
 Path('train/7/56543.png'),
 Path('train/7/10940.png'),
 Path('train/7/23192.png'),
 Path('train/7/19257.png'),
 Path('train/7/30153.png'),
 Path('train/7/10795.png')]
imgs = [PILImage.create(f) for f in sampled_files]
show_images(imgs, nrows=3, figsize=(5,5))

We can do the same for the ‘3’ hand-written digits.

digit_dir = path/'train'/'3'
sampled_files = random.sample(digit_dir.ls(), 9)
imgs = [PILImage.create(f) for f in sampled_files]
show_images(imgs, nrows=3, figsize=(5,5))

As you can see these are all clearly hand written ‘3’ and ‘7’ digits which are easily recognisable by humans but how well can we train a deep learning model to achieve the same task? We’ll get to this a little later on.

Analyzing Pixel Data

Let’s store all ‘3’ and ‘7’ hand-drawn images into variables for convenience.

threes = (path/'train'/'3').ls().sorted()
sevens = (path/'train'/'7').ls().sorted()
threes[1]
Path('train/3/10000.png')

Neural network models are interested in numbers only, so let’s take a closer look at the numerical structure of our dataset. We’ll take an image path for a ‘3’ digit and convert it into an image format via the Python Imaging Library (PIL).

im3_path = threes[1]
im3 = Image.open(im3_path)
im3.shape
(28, 28)
array(im3)[4:10,4:10]
array([[  0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,  29],
       [  0,   0,   0,  48, 166, 224],
       [  0,  93, 244, 249, 253, 187],
       [  0, 107, 253, 253, 230,  48],
       [  0,   3,  20,  20,  15,   0]], dtype=uint8)

Each image is made up of 28 x 28 pixels (784 in total), each numbered from 0-255, which defines a grayscale image. In the NumPy array above we’re only showing pixel values for the top left portion of the image. We can display the image data as a PyTorch tensor too.

tensor(im3)[4:10, 4:10]
tensor([[  0,   0,   0,   0,   0,   0],
        [  0,   0,   0,   0,   0,  29],
        [  0,   0,   0,  48, 166, 224],
        [  0,  93, 244, 249, 253, 187],
        [  0, 107, 253, 253, 230,  48],
        [  0,   3,  20,  20,  15,   0]], dtype=torch.uint8)

We can use a Pandas DataFrame to ‘color’ the grayscale values for a more intuitive visualization. In the plot below white pixels are stored as the number 0, black is the number 255, and shades of gray are between the two.

im3_t = tensor(im3)
df = pd.DataFrame(im3_t[4:15,4:22])
df.style.set_properties(**{'font-size':'6pt'}).background_gradient('Greys')
  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 29 150 195 254 255 254 176 193 150 96 0 0 0
2 0 0 0 48 166 224 253 253 234 196 253 253 253 253 233 0 0 0
3 0 93 244 249 253 187 46 10 8 4 10 194 253 253 233 0 0 0
4 0 107 253 253 230 48 0 0 0 0 0 192 253 253 156 0 0 0
5 0 3 20 20 15 0 0 0 0 0 43 224 253 245 74 0 0 0
6 0 0 0 0 0 0 0 0 0 0 249 253 245 126 0 0 0 0
7 0 0 0 0 0 0 0 14 101 223 253 248 124 0 0 0 0 0
8 0 0 0 0 0 11 166 239 253 253 253 187 30 0 0 0 0 0
9 0 0 0 0 0 16 248 250 253 253 253 253 232 213 111 2 0 0
10 0 0 0 0 0 0 0 43 98 98 208 253 253 253 253 187 22 0

Pixel Similarity

One approach to modelling the problem of digit classification is to take the average of each dataset (all the ‘3’ and ‘7’ images), and use them to determine how similar individual images are to the average ‘ideal’ images. Hopefully, this will lead to a simple but workable image classification of ‘3’ or ‘7’ digits.

First, we need to convert all the digits into PyTorch tensors and then stack all the indivisual ‘3’ and ‘7’ images and finally take the average of each. Remember, to convert a single image into a tensor we can do the following (just a slice is shown).

tensor(Image.open(threes[1]))[4:10, 4:10]
tensor([[  0,   0,   0,   0,   0,   0],
        [  0,   0,   0,   0,   0,  29],
        [  0,   0,   0,  48, 166, 224],
        [  0,  93, 244, 249, 253, 187],
        [  0, 107, 253, 253, 230,  48],
        [  0,   3,  20,  20,  15,   0]], dtype=torch.uint8)

The shape of this tensor is:

tensor(Image.open(threes[1])).shape
torch.Size([28, 28])

So we have just one three image digit in our tensor, but now let’s add all the other ‘3’ digits. We’ll do this by converting all the individual images from the ‘3’ dataset to a PyTorch tensor one at a time and then storing them all in a standard Python list.

three_tensors = [tensor(Image.open(o)) for o in threes]

Here we make use of a Python list comprehension to do the tensor conversion and list generation.

len(threes), len(three_tensors)
(6131, 6131)

So our newly generated list of ‘3’ digits has the same number of entries as the image paths list, and we can confirm that the first entry is the same as before.

three_tensors[1][4:10, 4:10]
tensor([[  0,   0,   0,   0,   0,   0],
        [  0,   0,   0,   0,   0,  29],
        [  0,   0,   0,  48, 166, 224],
        [  0,  93, 244, 249, 253, 187],
        [  0, 107, 253, 253, 230,  48],
        [  0,   3,  20,  20,  15,   0]], dtype=torch.uint8)

Let’s do the same for the ‘7’ digits too before moving on.

seven_tensors = [tensor(Image.open(o)) for o in sevens]
len(sevens), len(seven_tensors)
(6265, 6265)

We can display a ‘3’ and ‘7’ image from each of our generated lists to make sure they look okay. Note that since the images are now PyTorch tensors we need to use the show_image() or show_images() function, otherwise Jupyter will just output numerical values.

matplotlib.rc('image', cmap='Greys')
show_images([three_tensors[1], seven_tensors[1]], figsize=(3,3));

To complete the calculation of the ideal (average) ‘3’ and ‘7’ digit we need to stack all three tensors and all seven tensors into two new tensors, and then take the average of each one. We’ll use the PyTorch stack() and mean() functions for this.

So, to convert our list of ‘3’ and ‘7’ tensors into individual stacked tensors we can use the stack function. While we’re at it, we’ll cast the pixel values from integers to floats (required when calculating means), and also normalize them to a number between 0 and 1. This is pretty standard practice when image data is in float format.

stacked_threes = torch.stack(three_tensors).float()/255
stacked_sevens = torch.stack(seven_tensors).float()/255
stacked_threes.shape, stacked_sevens.shape
(torch.Size([6131, 28, 28]), torch.Size([6265, 28, 28]))

You can think of each new tensor as a vertical stack of 28 x 28 digits all on top of one another. Using the mean() function we can ‘collapse’ these into a single tensor by taking the mean of all the pixel values at each location in the image.

mean3 = stacked_threes.mean(0)
mean7 = stacked_sevens.mean(0)
show_images([mean3, mean7], figsize=(3,3));

The zero in stacked_threes.mean(0) describes the dimension along which you wish to calculate the mean. As our stacked tensors were piled on top of one another we can use the first dimension.

The resulting images above represent what the ‘ideal’ image for a ‘3’ and ‘7’ looks like. They appear chunkier than the individual digits as the darker areas are where the pixels align and are common to most images. The blurrier areas are where the pixels are less consistent over all images in the dataset.

Starting with the threes, we can calculate the mean distance between each pixel in a random digit selected from our dataset and the ‘ideal’ digit.

a_3 = stacked_threes[1]
show_image(a_3);

dist_3_abs = (a_3 - mean3).abs().mean()
dist_3_sqr = ((a_3 - mean3)**2).mean().sqrt()
dist_3_abs,dist_3_sqr
(tensor(0.1114), tensor(0.2021))

Here we are using two variations of the mean to calculate the distance between our ‘3’ and the idea three: 1. Mean absolute difference or L1 norm 2. Root mean squared error (RMSE) or L2 norm

Both give us a sense of measure of the closeness between the selected digit and the ideal average. Let’s now compare the same digit with the ideal seven using both metrics.

dist_7_abs = (a_3 - mean7).abs().mean()
dist_7_sqr = ((a_3 - mean7)**2).mean().sqrt()
dist_7_abs,dist_7_sqr
(tensor(0.1586), tensor(0.3021))

We can use the calculated means to try and determine if the selected digit was a ‘3’ or a ‘7’. i.e. Was it closer to the ideal three, or seven? In both mean calculations the distance was closer to the ideal ‘3’ so we can ‘predict’ that the selected digit is in fact a three.

PyTorch provides (as you might expect!) ready functions to calculate the L1 and L2 norms, which produce the same result.

Note: Intuitively, the difference between L1 norm and mean squared error (MSE) is that the latter will penalize bigger mistakes more heavily than the former (and be more lenient with small mistakes).

F.l1_loss(a_3.float(),mean7), F.mse_loss(a_3,mean7).sqrt()
(tensor(0.1586), tensor(0.3021))

Computing Metrics Using Broadcasting

Next we’ll look at using metrics to evaluate our predictions. We have already encountered two metrics, mean squared error, and mean absolute error. While these are both useful to make predictions the values in themselves are not that intuitive. So in practice we other metrics such as accuracy as it measures how often the model gets the correct label; which is often the most direct way to evaluate classification performance.

When evaluating metrics we always use data that the model was NOT trained on. In the pixel similarity model we don’t yet have any trained components but it is still a useful practice to do anyway. Let’s compile ‘stacked’ tensors of ‘3’ and ‘7’ digits as we did before but this time using the validation data.

path.ls()
(#3) [Path('labels.csv'),Path('valid'),Path('train')]
valid_3_tens = torch.stack([tensor(Image.open(o)) for o in (path/'valid'/'3').ls()])
valid_7_tens = torch.stack([tensor(Image.open(o)) for o in (path/'valid'/'7').ls()])
valid_3_tens.shape, valid_7_tens.shape
(torch.Size([1010, 28, 28]), torch.Size([1028, 28, 28]))

To calculate the accuracy metric we’ll need to do the following: - Define a function to calculate the mean for a specific image (or stack of images) - Confirm it works for a single image (compare to previous value) - Define a function to predict if a digit is a ‘3’ or not - Calculate the accuaray of predicting a ‘3’ or ‘7’ and the overall accuracy for all digits in the validations dataset

def mnist_distance(a,b): return (a-b).abs().mean((-1,-2))
mnist_distance(a_3, mean3)
tensor(0.1114)

That’s the first two tasks done. We have a general function now to calculate the distance between a sample digit and the ‘ideal’ digit, and confirmed it matches the value from before. Now we need another funtion to make the prediction about a digit.

def is_3(x): return mnist_distance(x, mean3) < mnist_distance(x, mean7)
is_3(a_3)
tensor(True)

And finally, let’s calculate the accuracies:

accuracy_3s = is_3(valid_3_tens).float().mean()
accuracy_7s = 1 - is_3(valid_7_tens).float().mean()
overall_acc = (accuracy_3s + accuracy_7s)/2
accuracy_3s, accuracy_7s, overall_acc
(tensor(0.9436), tensor(0.9815), tensor(0.9625))

So we have a 94% accuracy for the ‘3’ digits, 98% accuracy(!) for the ‘7’ digits, and a 96% accuracy rating overll for the two digits combined. We’ll compare these accuracy results with stochastic gradient descent in the next section.

Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) can be defined as the following set of steps. Here, weights are just some initially random parameters that we want to optimise to somehow improve our model predictions:

  1. Initialize the weights.
  2. For each image, use these weights to predict whether it appears to be a 3 or a 7.
  3. Based on these predictions, calculate how good the model is (its loss).
  4. Calculate the gradient, which measures for each weight, how changing that weight would change the loss
  5. Step (that is, change) all the weights based on that calculation.
  6. Go back to the step 2, and repeat the process.
  7. Iterate until you decide to stop the training process (for instance, because the model is good enough or you don’t want to wait any longer).
stacked_threes.shape, stacked_sevens.shape
(torch.Size([6131, 28, 28]), torch.Size([6265, 28, 28]))
torch.cat([stacked_threes, stacked_sevens]).shape
torch.Size([12396, 28, 28])

This concatenates the two separate stacks of ‘3’ digits and ‘7’ digits into one larger stacked list. The shape of this new rank-3 is now 12396 x 28 x 28. That is, 12396 layers of 28 x 28 digits if you want to think of it that way.

However, we also want the image data to be a long vector of 28*28=784 values rather than a 28 x 28 matrix. The data doesn’t change at all, we’re simply ‘flattening’ the data to make it easier to feed as input to the neural network. We can use the PyTorch view function to do the rank-3 to rank-2 conversion.

train_x = torch.cat([stacked_threes, stacked_sevens]).view(-1, 28*28)
train_x.shape
torch.Size([12396, 784])

We also need a label for each image. We’ll use 1 for 3s and 0 for 7s.

train_y = tensor([1]*len(threes) + [0]*len(sevens)).unsqueeze(1)
train_y.shape
torch.Size([12396, 1])

Here, we create a concatenated tensor of 1s and 0s each of which correspond to the length of the 3s and 7s tensors respectively. unsqueeze() is used to convert a rank-1 tensor of shape [12396], to a rank-2 tensor of shape [12396, 1]. A Dataset in PyTorch is required to return a tuple of (x,y) when indexed. Python provides a zip function which, when combined with list, provides a simple way to get this functionality.

dset = list(zip(train_x,train_y))
x,y = dset[0]
x.shape,y
(torch.Size([784]), tensor([1]))

dset stores all the image/label pairs in a 12396 list of tuples. Remember, the first tuple item is a 784 long rank-1 tensor representing the original 28 x 28 image in flattened form, and the second tuple item is the label associated with the image.

x[265:270], y # small part of the image data and the label
(tensor([0.8078, 0.9961, 0.9961, 0.9961, 0.9961]), tensor([1]))

We need to do the same for the validation set.

valid_x = torch.cat([valid_3_tens, valid_7_tens]).view(-1, 28*28)
valid_y = tensor([1]*len(valid_3_tens) + [0]*len(valid_7_tens)).unsqueeze(1)
valid_dset = list(zip(valid_x,valid_y))

Now, let’s define a function to create the initial set of random weights for our model (one for every pixel).

def init_params(size, std=1.0): return (torch.randn(size)*std).requires_grad_()
weights = init_params((28*28,1))
weights.shape
torch.Size([784, 1])

A sample of our generated weights look like this for the first few pixels. Notice how the gradient is enabled for these weights.

weights[0: 10]
tensor([[ 1.9269],
        [ 1.4873],
        [ 0.9007],
        [-2.1055],
        [ 0.6784],
        [-1.2345],
        [-0.0431],
        [-1.6047],
        [-0.7521],
        [ 1.6487]], grad_fn=<SliceBackward0>)

We also need to generate the initial value for the bias. This is a single number (rank-0 tensor) as there is just one bias per image.

bias = init_params(1)
bias
tensor([0.3472], requires_grad=True)
train_x[0].shape, weights.shape, weights.T.shape
(torch.Size([784]), torch.Size([784, 1]), torch.Size([1, 784]))

We now have enough data defined that we can make a first prediction calculation. That is, multiply the first image in the training set by the random weights, sum them, and add the bias.

(train_x[0]*weights.T).sum() + bias
tensor([-6.2330], grad_fn=<AddBackward0>)

We basically want to do a dot product between the image data vector and the weights and add the bias value. But multiplying vectors in Pytorch using the * operator is an element-wise operation only so we manually need to perform a summation too.

However, we can use the @ operator instead to do a full matrix multiplication (i.e. dot product in this case) to achieve the same result.

train_x[0] @ weights + bias
tensor([-6.2330], grad_fn=<AddBackward0>)

And because it’s just matrix multiplication we can calculate the dot product (i.e. initial predictions) for all images in the training dataset just as easily. Notice that the first calculated value is the same as the one calculated above as this represents the first image in the dataset.

def linear1(xb): return xb.float() @ weights + bias
preds = linear1(train_x)
preds
tensor([[-12.6396],
        [ -0.3468],
        [  9.8740],
        ...,
        [-19.9456],
        [ -1.6718],
        [-23.8029]], grad_fn=<AddBackward0>)
len(preds), preds.shape, train_y.shape
(12396, torch.Size([12396, 1]), torch.Size([12396, 1]))

And now we can determine how many of the predictions were correct. We’re treating every prediction above zero as a three, and seven if below zero, and if it matches the label the result is true.

corrects = (preds>0.0).float() == train_y
corrects.shape, corrects
(torch.Size([12396, 1]),
 tensor([[False],
         [False],
         [ True],
         ...,
         [ True],
         [ True],
         [ True]]))

And the overall accuracy is the number of correct predictions.

corrects.float().mean().item()
0.6535172462463379

As expected this is not very good since we’re starting with a random set of weights. To improve we will need to use a loss function, calculate the gradients of the loss with respect to each weight, and use them the update the weights and hopefully improve the loss. We will put all this together into a complete training loop in the next section!

A Complete Training Loop

The training loop will consist of predictions, loss value, gradient calculations, and weight updates. This will be repeated until the loss reaches (converges) to an acceptable value. In order to calculate the loss we need a loss function.

def mnist_loss(predictions, targets):
    predictions = predictions.sigmoid()
    return torch.where(targets==1, 1-predictions, predictions).mean()

The mnist_loss function defines a simple custom loss for binary classification by first applying a sigmoid to the model’s raw outputs to convert them into probabilities (values between 0.0 and 1.0). We use torch.where to compute the loss. For targets that are 1 (positive class), it returns 1 - prediction (penalizing underconfidence), and for targets that are 0 (negative class), it returns prediction (penalizing false positives). This creates a loss that encourages high probabilities for the correct class and low probabilities for the incorrect class. Finally, the loss is averaged over the batch.

DataLoader and Dataset

Before moving on it’s useful to clarify about Dataset and DataLoader and how they work. For instance we can feed in a Python collection to a DataLoader and it will return an iterator over mini-batches.

coll = range(15)
dl = DataLoader(coll, batch_size=5, shuffle=True)
list(dl)
[tensor([ 8,  0, 13,  3,  2]),
 tensor([14,  4,  6,  7,  9]),
 tensor([ 5,  1, 10, 12, 11])]

This is very convenient, and powerful! However, for training a model, we require a collection containing independent and dependent variables (inputs and targets of the model). A collection that contains tuples of independent and dependent variables is known in PyTorch as a Dataset. Here’s an example of an extremely simple Dataset:

ds = L(enumerate(string.ascii_lowercase))
ds
(#26) [(0, 'a'),(1, 'b'),(2, 'c'),(3, 'd'),(4, 'e'),(5, 'f'),(6, 'g'),(7, 'h'),(8, 'i'),(9, 'j'),(10, 'k'),(11, 'l'),(12, 'm'),(13, 'n'),(14, 'o'),(15, 'p'),(16, 'q'),(17, 'r'),(18, 's'),(19, 't')...]

When we pass a Dataset to a DataLoader we will get back mini-batches which are themselves tuples of tensors representing batches of independent and dependent variables. You can think of a DataLoader as yielding batches of data.

dl = DataLoader(ds, batch_size=6, shuffle=True)
list(dl)
[(tensor([25, 11,  4,  1,  7, 21]), ('z', 'l', 'e', 'b', 'h', 'v')),
 (tensor([19,  0,  8, 13, 16, 23]), ('t', 'a', 'i', 'n', 'q', 'x')),
 (tensor([ 3,  6, 12, 17, 18,  2]), ('d', 'g', 'm', 'r', 's', 'c')),
 (tensor([14,  9, 10, 15, 22,  5]), ('o', 'j', 'k', 'p', 'w', 'f')),
 (tensor([24, 20]), ('y', 'u'))]

Building a Training Loop

Let’s start to work on the training loop now. First we’ll re-initialize our parameters.

weights = init_params((28*28,1))
bias = init_params(1)

If you remember from earlier we already created the training and validation datasets. These are lists of tuples, with each tuple containing a 784 vector of pixel values and a boolean label representing if the image is a ‘3’ or ‘7’. For example the first tuple in the training dataset dset is (image data cropped):

(dset[0][0][125:135],  dset[0][1])
(tensor([0.8588, 0.6510, 0.4627, 0.4627, 0.0235, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]),
 tensor([1]))

We can now create the training and validation DataLoader objects.

dl = DataLoader(dset, batch_size=256)
valid_dl = DataLoader(valid_dset, batch_size=256)
xb,yb = first(dl)
xb.shape,yb.shape
(torch.Size([256, 784]), torch.Size([256, 1]))
train_x.shape
torch.Size([12396, 784])

Next, we define functions to calculate the gradient, and train for one epoch. That is, cycle through all mini-batches that our DataLoader yields.

def calc_grad(xb, yb, model):
    preds = model(xb)
    loss = mnist_loss(preds, yb)
    loss.backward()
def train_epoch(model, lr, params):
    for xb,yb in dl:
        calc_grad(xb, yb, model)
        for p in params:
            p.data -= p.grad*lr
            p.grad.zero_()

We’ll also need functions to evaluate a single batch accuracy and to average this over all batches:

def batch_accuracy(xb, yb):
    preds = xb.sigmoid()
    correct = (preds>0.5) == yb
    return correct.float().mean()
def validate_epoch(model):
    accs = [batch_accuracy(model(xb.float()), yb) for xb,yb in valid_dl]
    return round(torch.stack(accs).mean().item(), 4)

Let’s try this out and train for one eopch.

lr = 1.
params = weights,bias
train_epoch(linear1, lr, params)
validate_epoch(linear1)

Let’s repeat for a few more epochs.

for i in range(20):
    train_epoch(linear1, lr, params)
    print(validate_epoch(linear1), end=' ')
0.5031 0.6842 0.7503 0.9013 0.9457 0.956 0.9618 0.9647 0.9667 0.9681 0.9701 0.9706 0.972 0.9725 0.973 0.9735 0.974 0.9745 0.975 0.975 

Not bad. Our accuracy is over 97% for a single epoch, which is already better than the pixel similarity approach!

Going Further - Creating an Optimizer!

So far we’ve created almost all training and validation code from scratch. While this is very important for our understanding PyTorch provides some useful classes to make it easier to implement. The first thing we’ll do is use the nn.Linear module which does the same thing as our init_params and linear together. It contains both the weights and biases in a single class.

linear_model = nn.Linear(28*28,1)
linear_model
Linear(in_features=784, out_features=1, bias=True)

We can see what parameters are available in our linear module.

w,b = linear_model.parameters()
w.shape,b.shape
(torch.Size([1, 784]), torch.Size([1]))

We can use this to create a basic optimizer.

class BasicOptim:
    def __init__(self,params,lr): self.params,self.lr = list(params),lr

    def step(self, *args, **kwargs):
        for p in self.params: p.data -= p.grad.data * self.lr

    def zero_grad(self, *args, **kwargs):
        for p in self.params: p.grad = None
opt = BasicOptim(linear_model.parameters(), lr)

And thus our single epoch training loop can be simplified to:

def train_epoch(model):
    for xb,yb in dl:
        calc_grad(xb, yb, model)
        opt.step()
        opt.zero_grad()

We can try this out for multiple epochs using a simple loop inside a function.

def train_model(model, epochs):
    for i in range(epochs):
        train_epoch(model)
        print(validate_epoch(model), end=' ')
train_model(linear_model, 20)
0.4932 0.83 0.8598 0.9194 0.9394 0.9536 0.9629 0.9662 0.9682 0.9706 0.9726 0.9741 0.9745 0.9755 0.977 0.977 0.9775 0.9775 0.9779 0.9784 

We can generalize this more by using the fastai SGD class which essentially does the same as the BasicOptim class.

linear_model = nn.Linear(28*28,1)
opt = SGD(linear_model.parameters(), lr)
train_model(linear_model, 20)
0.4932 0.791 0.8696 0.9223 0.9419 0.9546 0.9633 0.9672 0.9692 0.9721 0.9726 0.9741 0.9736 0.9755 0.977 0.977 0.9779 0.9779 0.9779 0.9784 

Another abstraction fastai provides is Learner.fit, which we can use instead of train_model. To create a ‘Learner’ we first need to create a DataLoaders object, by passing in our training and validation DataLoaders.

dls = DataLoaders(dl, valid_dl)

We can then pass this into Learner() using some of the functionality we defined earlier, and then call the fit() method to begin training for a specified number of epochs.

learn = Learner(dls, nn.Linear(28*28,1), opt_func=SGD, loss_func=mnist_loss, metrics=batch_accuracy)
learn.fit(10, lr=lr)
epoch train_loss valid_loss batch_accuracy time
0 0.636727 0.503329 0.495584 00:00
1 0.471677 0.220506 0.803729 00:00
2 0.175191 0.163141 0.854269 00:00
3 0.077810 0.100827 0.915604 00:00
4 0.041945 0.074958 0.935231 00:00
5 0.027924 0.060686 0.947988 00:00
6 0.022141 0.051695 0.955348 00:00
7 0.019544 0.045649 0.963690 00:00
8 0.018206 0.041354 0.965653 00:00
9 0.017388 0.038157 0.967125 00:00

Adding a Nonlinearity

There’s just a couple more things we need to do to make this a ‘proper’ neural network. Add a non-linear activation function, and another layer. This adds sufficient (non-linear) complexity that it helps the model learn better weights, and hence, better accuracy. Instead of using nn.Linear we use nn.Sequential so we can easilty ‘compose’ neural networks comprised of multiple layers.

simple_net = nn.Sequential(
    nn.Linear(28*28,30),
    nn.ReLU(),
    nn.Linear(30,1)
)

Then we create our Learner and call the fit function as before, except this time we are using an additional layer.

learn = Learner(dls, simple_net, opt_func=SGD, loss_func=mnist_loss, metrics=batch_accuracy)
learn.fit(40, 0.1)
epoch train_loss valid_loss batch_accuracy time
0 0.317320 0.410194 0.504416 00:00
1 0.148899 0.226347 0.809127 00:00
2 0.082341 0.113844 0.916585 00:00
3 0.054091 0.077327 0.942591 00:00
4 0.040992 0.060660 0.955839 00:00
5 0.034286 0.051260 0.964181 00:00
6 0.030415 0.045308 0.965653 00:00
7 0.027889 0.041224 0.966634 00:00
8 0.026063 0.038250 0.968106 00:00
9 0.024646 0.035980 0.969087 00:00
10 0.023498 0.034179 0.971541 00:00
11 0.022539 0.032703 0.973013 00:00
12 0.021721 0.031463 0.973994 00:00
13 0.021015 0.030397 0.974485 00:00
14 0.020396 0.029464 0.974975 00:00
15 0.019849 0.028638 0.974975 00:00
16 0.019360 0.027900 0.976448 00:00
17 0.018920 0.027235 0.976448 00:00
18 0.018522 0.026633 0.976938 00:00
19 0.018158 0.026085 0.977429 00:00
20 0.017824 0.025587 0.977429 00:00
21 0.017515 0.025131 0.977429 00:00
22 0.017229 0.024712 0.978410 00:00
23 0.016962 0.024328 0.978901 00:00
24 0.016712 0.023974 0.979392 00:00
25 0.016477 0.023646 0.979392 00:00
26 0.016256 0.023343 0.979392 00:00
27 0.016046 0.023061 0.979882 00:00
28 0.015848 0.022799 0.980864 00:00
29 0.015659 0.022555 0.980864 00:00
30 0.015480 0.022328 0.980864 00:00
31 0.015309 0.022115 0.981845 00:00
32 0.015146 0.021915 0.982336 00:00
33 0.014990 0.021727 0.981845 00:00
34 0.014840 0.021551 0.981845 00:00
35 0.014697 0.021384 0.982336 00:00
36 0.014560 0.021228 0.982336 00:00
37 0.014427 0.021079 0.981845 00:00
38 0.014300 0.020939 0.981845 00:00
39 0.014178 0.020806 0.981845 00:00
plt.plot(L(learn.recorder.values).itemgot(2));

This pushes us up to 98% accuracy with a two-layer neural network! What would happen if we used, say, an 18-layer network?

dls = ImageDataLoaders.from_folder(path)
learn = vision_learner(dls, resnet18, pretrained=False,
                    loss_func=F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(1, 0.1)
epoch train_loss valid_loss accuracy time
0 0.093842 0.011492 0.996075 00:10

Now we have achieved almost 100% accuracy, which demonstrates the power of deep learning neural networks! Even though this is a fairly simple model (by today’s standards) it’s still a useful exercise to get an early feel for training models from scratch on datasets and generating high quality results.

More AI Content?

If you liked this post please consider following me on Twitter and LinkedIn for more AI content.