Practical Deep Learning - Lesson 3

This chapter is all about the building blocks of creating a successful model. A computer vision model in this case, that can recognise hand-written digits. We’ll cover the individual components that make up the overall model, and how they are used together to form a working system that can perform accurate inferencing.

We’ll be using the MNIST dataset in this lesson for making predictions about hand-written digits. The full dataset is around 60,000 training images, and 10,000 test images. However, for the purposes of this lesson to keep things simple we’ll use a sample of the full dataset to try and predict only the digits ‘3’ and ‘7’ (rather than the full range of digits ‘0’ through ‘10’.

Downloading the Image Dataset

Downloading the dataset is pretty straightforward using the untar_data() function from fast.ai. Once the images are downloaded we can view some of the folders and filenames.

URLs.MNIST_SAMPLE

'https://s3.amazonaws.com/fast-ai-sample/mnist_sample.tgz'

path = untar_data(URLs.MNIST_SAMPLE)

path

Path('/home/dgwyer/.fastai/data/mnist_sample')

Path.BASE_PATH = path

path

Path('.')

path.ls()

(#3) [Path('labels.csv'),Path('valid'),Path('train')]

(path/'train').ls()

(#2) [Path('train/7'),Path('train/3')]

(path/'train/7').ls()

(#6265) [Path('train/7/7420.png'),Path('train/7/9878.png'),Path('train/7/47453.png'),Path('train/7/18966.png'),Path('train/7/27005.png'),Path('train/7/31957.png'),Path('train/7/14379.png'),Path('train/7/5811.png'),Path('train/7/33104.png'),Path('train/7/43686.png'),Path('train/7/58687.png'),Path('train/7/46356.png'),Path('train/7/4242.png'),Path('train/7/50455.png'),Path('train/7/54561.png'),Path('train/7/20105.png'),Path('train/7/2814.png'),Path('train/7/17185.png'),Path('train/7/38776.png'),Path('train/7/22313.png')...]

(path/'train/3').ls()

(#6131) [Path('train/3/47123.png'),Path('train/3/21559.png'),Path('train/3/17103.png'),Path('train/3/59660.png'),Path('train/3/59408.png'),Path('train/3/20738.png'),Path('train/3/8195.png'),Path('train/3/15109.png'),Path('train/3/54568.png'),Path('train/3/21075.png'),Path('train/3/20705.png'),Path('train/3/16811.png'),Path('train/3/43816.png'),Path('train/3/20869.png'),Path('train/3/42951.png'),Path('train/3/54020.png'),Path('train/3/48064.png'),Path('train/3/59996.png'),Path('train/3/44596.png'),Path('train/3/1476.png')...]

total_dataset_images = len((path/'train/7').ls()) + len((path/'train/3').ls())
total_dataset_images

As you can see, the MNIST_SAMPLE dataset only contains the digits ‘3’ and ‘7’, and the dataset has 12396 images in total. Let’s take a look at a sample hand-written ‘7’ image.

img_path = path/'train'/'7'/os.listdir(path/'train'/'7')[0]
img = PILImage.create(img_path)
img.show(figsize=(2,2));

Let’s take a look at a few more random sample ‘7’ digits.

digit_dir = path/'train'/'7'
sampled_files = random.sample(digit_dir.ls(), 9)
sampled_files

[Path('train/7/17833.png'),
 Path('train/7/24882.png'),
 Path('train/7/56828.png'),
 Path('train/7/56543.png'),
 Path('train/7/10940.png'),
 Path('train/7/23192.png'),
 Path('train/7/19257.png'),
 Path('train/7/30153.png'),
 Path('train/7/10795.png')]

imgs = [PILImage.create(f) for f in sampled_files]
show_images(imgs, nrows=3, figsize=(5,5))

We can do the same for the ‘3’ hand-written digits.

digit_dir = path/'train'/'3'
sampled_files = random.sample(digit_dir.ls(), 9)
imgs = [PILImage.create(f) for f in sampled_files]
show_images(imgs, nrows=3, figsize=(5,5))

As you can see these are all clearly hand written ‘3’ and ‘7’ digits which are easily recognisable by humans but how well can we train a deep learning model to achieve the same task? We’ll get to this a little later on.

Analyzing Pixel Data

Let’s store all ‘3’ and ‘7’ hand-drawn images into variables for convenience.

threes = (path/'train'/'3').ls().sorted()
sevens = (path/'train'/'7').ls().sorted()

threes[1]

Path('train/3/10000.png')

Neural network models are interested in numbers only, so let’s take a closer look at the numerical structure of our dataset. We’ll take an image path for a ‘3’ digit and convert it into an image format via the Python Imaging Library (PIL).

im3_path = threes[1]
im3 = Image.open(im3_path)
im3.shape

(28, 28)

array(im3)[4:10,4:10]

array([[  0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,  29],
       [  0,   0,   0,  48, 166, 224],
       [  0,  93, 244, 249, 253, 187],
       [  0, 107, 253, 253, 230,  48],
       [  0,   3,  20,  20,  15,   0]], dtype=uint8)

Each image is made up of 28 x 28 pixels (784 in total), each numbered from 0-255, which defines a grayscale image. In the NumPy array above we’re only showing pixel values for the top left portion of the image. We can display the image data as a PyTorch tensor too.

tensor(im3)[4:10, 4:10]

tensor([[  0,   0,   0,   0,   0,   0],
        [  0,   0,   0,   0,   0,  29],
        [  0,   0,   0,  48, 166, 224],
        [  0,  93, 244, 249, 253, 187],
        [  0, 107, 253, 253, 230,  48],
        [  0,   3,  20,  20,  15,   0]], dtype=torch.uint8)

We can use a Pandas DataFrame to ‘color’ the grayscale values for a more intuitive visualization. In the plot below white pixels are stored as the number 0, black is the number 255, and shades of gray are between the two.

im3_t = tensor(im3)
df = pd.DataFrame(im3_t[4:15,4:22])
df.style.set_properties(**{'font-size':'6pt'}).background_gradient('Greys')

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	29	150	195	254	255	254	176	193	150	96	0	0
2	0	0	48	166	224	253	253	234	196	253	253	253	253	233	0	0
3	93	244	249	253	187	46	10	8	4	10	194	253	253	233	0	0
4	107	253	253	230	48	0	0	0	0	0	192	253	253	156	0	0
5	3	20	20	15	0	0	0	0	0	43	224	253	245	74	0	0
6	0	0	0	0	0	0	0	0	0	249	253	245	126	0	0	0
7	0	0	0	0	0	0	14	101	223	253	248	124	0	0	0	0
8	0	0	0	0	11	166	239	253	253	253	187	30	0	0	0	0
9	0	0	0	0	16	248	250	253	253	253	253	232	213	111	2	0
10	0	0	0	0	0	0	43	98	98	208	253	253	253	253	187	22

Pixel Similarity

One approach to modelling the problem of digit classification is to take the average of each dataset (all the ‘3’ and ‘7’ images), and use them to determine how similar individual images are to the average ‘ideal’ images. Hopefully, this will lead to a simple but workable image classification of ‘3’ or ‘7’ digits.

First, we need to convert all the digits into PyTorch tensors and then stack all the indivisual ‘3’ and ‘7’ images and finally take the average of each. Remember, to convert a single image into a tensor we can do the following (just a slice is shown).

tensor(Image.open(threes[1]))[4:10, 4:10]

tensor([[  0,   0,   0,   0,   0,   0],
        [  0,   0,   0,   0,   0,  29],
        [  0,   0,   0,  48, 166, 224],
        [  0,  93, 244, 249, 253, 187],
        [  0, 107, 253, 253, 230,  48],
        [  0,   3,  20,  20,  15,   0]], dtype=torch.uint8)

The shape of this tensor is:

tensor(Image.open(threes[1])).shape

torch.Size([28, 28])

So we have just one three image digit in our tensor, but now let’s add all the other ‘3’ digits. We’ll do this by converting all the individual images from the ‘3’ dataset to a PyTorch tensor one at a time and then storing them all in a standard Python list.

three_tensors = [tensor(Image.open(o)) for o in threes]

Here we make use of a Python list comprehension to do the tensor conversion and list generation.

len(threes), len(three_tensors)

(6131, 6131)

So our newly generated list of ‘3’ digits has the same number of entries as the image paths list, and we can confirm that the first entry is the same as before.

three_tensors[1][4:10, 4:10]

tensor([[  0,   0,   0,   0,   0,   0],
        [  0,   0,   0,   0,   0,  29],
        [  0,   0,   0,  48, 166, 224],
        [  0,  93, 244, 249, 253, 187],
        [  0, 107, 253, 253, 230,  48],
        [  0,   3,  20,  20,  15,   0]], dtype=torch.uint8)

Let’s do the same for the ‘7’ digits too before moving on.

seven_tensors = [tensor(Image.open(o)) for o in sevens]

len(sevens), len(seven_tensors)

(6265, 6265)

We can display a ‘3’ and ‘7’ image from each of our generated lists to make sure they look okay. Note that since the images are now PyTorch tensors we need to use the show_image() or show_images() function, otherwise Jupyter will just output numerical values.

matplotlib.rc('image', cmap='Greys')
show_images([three_tensors[1], seven_tensors[1]], figsize=(3,3));

To complete the calculation of the ideal (average) ‘3’ and ‘7’ digit we need to stack all three tensors and all seven tensors into two new tensors, and then take the average of each one. We’ll use the PyTorch stack() and mean() functions for this.

So, to convert our list of ‘3’ and ‘7’ tensors into individual stacked tensors we can use the stack function. While we’re at it, we’ll cast the pixel values from integers to floats (required when calculating means), and also normalize them to a number between 0 and 1. This is pretty standard practice when image data is in float format.

stacked_threes = torch.stack(three_tensors).float()/255
stacked_sevens = torch.stack(seven_tensors).float()/255
stacked_threes.shape, stacked_sevens.shape

(torch.Size([6131, 28, 28]), torch.Size([6265, 28, 28]))

You can think of each new tensor as a vertical stack of 28 x 28 digits all on top of one another. Using the mean() function we can ‘collapse’ these into a single tensor by taking the mean of all the pixel values at each location in the image.

mean3 = stacked_threes.mean(0)
mean7 = stacked_sevens.mean(0)
show_images([mean3, mean7], figsize=(3,3));

The zero in stacked_threes.mean(0) describes the dimension along which you wish to calculate the mean. As our stacked tensors were piled on top of one another we can use the first dimension.

The resulting images above represent what the ‘ideal’ image for a ‘3’ and ‘7’ looks like. They appear chunkier than the individual digits as the darker areas are where the pixels align and are common to most images. The blurrier areas are where the pixels are less consistent over all images in the dataset.

Starting with the threes, we can calculate the mean distance between each pixel in a random digit selected from our dataset and the ‘ideal’ digit.

a_3 = stacked_threes[1]
show_image(a_3);

dist_3_abs = (a_3 - mean3).abs().mean()
dist_3_sqr = ((a_3 - mean3)**2).mean().sqrt()
dist_3_abs,dist_3_sqr

(tensor(0.1114), tensor(0.2021))

Here we are using two variations of the mean to calculate the distance between our ‘3’ and the idea three: 1. Mean absolute difference or L1 norm 2. Root mean squared error (RMSE) or L2 norm

Both give us a sense of measure of the closeness between the selected digit and the ideal average. Let’s now compare the same digit with the ideal seven using both metrics.

dist_7_abs = (a_3 - mean7).abs().mean()
dist_7_sqr = ((a_3 - mean7)**2).mean().sqrt()
dist_7_abs,dist_7_sqr

(tensor(0.1586), tensor(0.3021))

We can use the calculated means to try and determine if the selected digit was a ‘3’ or a ‘7’. i.e. Was it closer to the ideal three, or seven? In both mean calculations the distance was closer to the ideal ‘3’ so we can ‘predict’ that the selected digit is in fact a three.

PyTorch provides (as you might expect!) ready functions to calculate the L1 and L2 norms, which produce the same result.

Note: Intuitively, the difference between L1 norm and mean squared error (MSE) is that the latter will penalize bigger mistakes more heavily than the former (and be more lenient with small mistakes).

F.l1_loss(a_3.float(),mean7), F.mse_loss(a_3,mean7).sqrt()

(tensor(0.1586), tensor(0.3021))

Computing Metrics Using Broadcasting

Next we’ll look at using metrics to evaluate our predictions. We have already encountered two metrics, mean squared error, and mean absolute error. While these are both useful to make predictions the values in themselves are not that intuitive. So in practice we other metrics such as accuracy as it measures how often the model gets the correct label; which is often the most direct way to evaluate classification performance.

When evaluating metrics we always use data that the model was NOT trained on. In the pixel similarity model we don’t yet have any trained components but it is still a useful practice to do anyway. Let’s compile ‘stacked’ tensors of ‘3’ and ‘7’ digits as we did before but this time using the validation data.

path.ls()

(#3) [Path('labels.csv'),Path('valid'),Path('train')]

valid_3_tens = torch.stack([tensor(Image.open(o)) for o in (path/'valid'/'3').ls()])
valid_7_tens = torch.stack([tensor(Image.open(o)) for o in (path/'valid'/'7').ls()])
valid_3_tens.shape, valid_7_tens.shape

(torch.Size([1010, 28, 28]), torch.Size([1028, 28, 28]))

To calculate the accuracy metric we’ll need to do the following: - Define a function to calculate the mean for a specific image (or stack of images) - Confirm it works for a single image (compare to previous value) - Define a function to predict if a digit is a ‘3’ or not - Calculate the accuaray of predicting a ‘3’ or ‘7’ and the overall accuracy for all digits in the validations dataset

def mnist_distance(a,b): return (a-b).abs().mean((-1,-2))
mnist_distance(a_3, mean3)

tensor(0.1114)

That’s the first two tasks done. We have a general function now to calculate the distance between a sample digit and the ‘ideal’ digit, and confirmed it matches the value from before. Now we need another funtion to make the prediction about a digit.

def is_3(x): return mnist_distance(x, mean3) < mnist_distance(x, mean7)
is_3(a_3)

tensor(True)

And finally, let’s calculate the accuracies:

accuracy_3s = is_3(valid_3_tens).float().mean()
accuracy_7s = 1 - is_3(valid_7_tens).float().mean()
overall_acc = (accuracy_3s + accuracy_7s)/2

accuracy_3s, accuracy_7s, overall_acc

(tensor(0.9436), tensor(0.9815), tensor(0.9625))

So we have a 94% accuracy for the ‘3’ digits, 98% accuracy(!) for the ‘7’ digits, and a 96% accuracy rating overll for the two digits combined. We’ll compare these accuracy results with stochastic gradient descent in the next section.

Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) can be defined as the following set of steps. Here, weights are just some initially random parameters that we want to optimise to somehow improve our model predictions:

Initialize the weights.
For each image, use these weights to predict whether it appears to be a 3 or a 7.
Based on these predictions, calculate how good the model is (its loss).
Calculate the gradient, which measures for each weight, how changing that weight would change the loss
Step (that is, change) all the weights based on that calculation.
Go back to the step 2, and repeat the process.
Iterate until you decide to stop the training process (for instance, because the model is good enough or you don’t want to wait any longer).

stacked_threes.shape, stacked_sevens.shape

(torch.Size([6131, 28, 28]), torch.Size([6265, 28, 28]))

torch.cat([stacked_threes, stacked_sevens]).shape

torch.Size([12396, 28, 28])

This concatenates the two separate stacks of ‘3’ digits and ‘7’ digits into one larger stacked list. The shape of this new rank-3 is now 12396 x 28 x 28. That is, 12396 layers of 28 x 28 digits if you want to think of it that way.

However, we also want the image data to be a long vector of 28*28=784 values rather than a 28 x 28 matrix. The data doesn’t change at all, we’re simply ‘flattening’ the data to make it easier to feed as input to the neural network. We can use the PyTorch view function to do the rank-3 to rank-2 conversion.

train_x = torch.cat([stacked_threes, stacked_sevens]).view(-1, 28*28)
train_x.shape

torch.Size([12396, 784])

We also need a label for each image. We’ll use 1 for 3s and 0 for 7s.

train_y = tensor([1]*len(threes) + [0]*len(sevens)).unsqueeze(1)
train_y.shape

torch.Size([12396, 1])

Here, we create a concatenated tensor of 1s and 0s each of which correspond to the length of the 3s and 7s tensors respectively. unsqueeze() is used to convert a rank-1 tensor of shape [12396], to a rank-2 tensor of shape [12396, 1]. A Dataset in PyTorch is required to return a tuple of (x,y) when indexed. Python provides a zip function which, when combined with list, provides a simple way to get this functionality.

dset = list(zip(train_x,train_y))
x,y = dset[0]
x.shape,y

(torch.Size([784]), tensor([1]))

dset stores all the image/label pairs in a 12396 list of tuples. Remember, the first tuple item is a 784 long rank-1 tensor representing the original 28 x 28 image in flattened form, and the second tuple item is the label associated with the image.

x[265:270], y # small part of the image data and the label

(tensor([0.8078, 0.9961, 0.9961, 0.9961, 0.9961]), tensor([1]))

We need to do the same for the validation set.

valid_x = torch.cat([valid_3_tens, valid_7_tens]).view(-1, 28*28)
valid_y = tensor([1]*len(valid_3_tens) + [0]*len(valid_7_tens)).unsqueeze(1)
valid_dset = list(zip(valid_x,valid_y))

Now, let’s define a function to create the initial set of random weights for our model (one for every pixel).

def init_params(size, std=1.0): return (torch.randn(size)*std).requires_grad_()
weights = init_params((28*28,1))
weights.shape

torch.Size([784, 1])

A sample of our generated weights look like this for the first few pixels. Notice how the gradient is enabled for these weights.

weights[0: 10]

tensor([[ 1.9269],
        [ 1.4873],
        [ 0.9007],
        [-2.1055],
        [ 0.6784],
        [-1.2345],
        [-0.0431],
        [-1.6047],
        [-0.7521],
        [ 1.6487]], grad_fn=<SliceBackward0>)

We also need to generate the initial value for the bias. This is a single number (rank-0 tensor) as there is just one bias per image.

bias = init_params(1)
bias

tensor([0.3472], requires_grad=True)

train_x[0].shape, weights.shape, weights.T.shape

(torch.Size([784]), torch.Size([784, 1]), torch.Size([1, 784]))

We now have enough data defined that we can make a first prediction calculation. That is, multiply the first image in the training set by the random weights, sum them, and add the bias.

(train_x[0]*weights.T).sum() + bias

tensor([-6.2330], grad_fn=<AddBackward0>)

We basically want to do a dot product between the image data vector and the weights and add the bias value. But multiplying vectors in Pytorch using the * operator is an element-wise operation only so we manually need to perform a summation too.

However, we can use the @ operator instead to do a full matrix multiplication (i.e. dot product in this case) to achieve the same result.

train_x[0] @ weights + bias

tensor([-6.2330], grad_fn=<AddBackward0>)

And because it’s just matrix multiplication we can calculate the dot product (i.e. initial predictions) for all images in the training dataset just as easily. Notice that the first calculated value is the same as the one calculated above as this represents the first image in the dataset.

def linear1(xb): return xb.float() @ weights + bias
preds = linear1(train_x)
preds

tensor([[-12.6396],
        [ -0.3468],
        [  9.8740],
        ...,
        [-19.9456],
        [ -1.6718],
        [-23.8029]], grad_fn=<AddBackward0>)

len(preds), preds.shape, train_y.shape

(12396, torch.Size([12396, 1]), torch.Size([12396, 1]))

And now we can determine how many of the predictions were correct. We’re treating every prediction above zero as a three, and seven if below zero, and if it matches the label the result is true.

corrects = (preds>0.0).float() == train_y
corrects.shape, corrects

(torch.Size([12396, 1]),
 tensor([[False],
         [False],
         [ True],
         ...,
         [ True],
         [ True],
         [ True]]))

And the overall accuracy is the number of correct predictions.

corrects.float().mean().item()

0.6535172462463379

As expected this is not very good since we’re starting with a random set of weights. To improve we will need to use a loss function, calculate the gradients of the loss with respect to each weight, and use them the update the weights and hopefully improve the loss. We will put all this together into a complete training loop in the next section!

A Complete Training Loop

The training loop will consist of predictions, loss value, gradient calculations, and weight updates. This will be repeated until the loss reaches (converges) to an acceptable value. In order to calculate the loss we need a loss function.

def mnist_loss(predictions, targets):
    predictions = predictions.sigmoid()
    return torch.where(targets==1, 1-predictions, predictions).mean()

The mnist_loss function defines a simple custom loss for binary classification by first applying a sigmoid to the model’s raw outputs to convert them into probabilities (values between 0.0 and 1.0). We use torch.where to compute the loss. For targets that are 1 (positive class), it returns 1 - prediction (penalizing underconfidence), and for targets that are 0 (negative class), it returns prediction (penalizing false positives). This creates a loss that encourages high probabilities for the correct class and low probabilities for the incorrect class. Finally, the loss is averaged over the batch.

DataLoader and Dataset

Before moving on it’s useful to clarify about Dataset and DataLoader and how they work. For instance we can feed in a Python collection to a DataLoader and it will return an iterator over mini-batches.

coll = range(15)
dl = DataLoader(coll, batch_size=5, shuffle=True)
list(dl)

[tensor([ 8,  0, 13,  3,  2]),
 tensor([14,  4,  6,  7,  9]),
 tensor([ 5,  1, 10, 12, 11])]

This is very convenient, and powerful! However, for training a model, we require a collection containing independent and dependent variables (inputs and targets of the model). A collection that contains tuples of independent and dependent variables is known in PyTorch as a Dataset. Here’s an example of an extremely simple Dataset:

ds = L(enumerate(string.ascii_lowercase))
ds

(#26) [(0, 'a'),(1, 'b'),(2, 'c'),(3, 'd'),(4, 'e'),(5, 'f'),(6, 'g'),(7, 'h'),(8, 'i'),(9, 'j'),(10, 'k'),(11, 'l'),(12, 'm'),(13, 'n'),(14, 'o'),(15, 'p'),(16, 'q'),(17, 'r'),(18, 's'),(19, 't')...]

When we pass a Dataset to a DataLoader we will get back mini-batches which are themselves tuples of tensors representing batches of independent and dependent variables. You can think of a DataLoader as yielding batches of data.

dl = DataLoader(ds, batch_size=6, shuffle=True)
list(dl)

[(tensor([25, 11,  4,  1,  7, 21]), ('z', 'l', 'e', 'b', 'h', 'v')),
 (tensor([19,  0,  8, 13, 16, 23]), ('t', 'a', 'i', 'n', 'q', 'x')),
 (tensor([ 3,  6, 12, 17, 18,  2]), ('d', 'g', 'm', 'r', 's', 'c')),
 (tensor([14,  9, 10, 15, 22,  5]), ('o', 'j', 'k', 'p', 'w', 'f')),
 (tensor([24, 20]), ('y', 'u'))]

Building a Training Loop

Let’s start to work on the training loop now. First we’ll re-initialize our parameters.

weights = init_params((28*28,1))
bias = init_params(1)

If you remember from earlier we already created the training and validation datasets. These are lists of tuples, with each tuple containing a 784 vector of pixel values and a boolean label representing if the image is a ‘3’ or ‘7’. For example the first tuple in the training dataset dset is (image data cropped):

(dset[0][0][125:135],  dset[0][1])

(tensor([0.8588, 0.6510, 0.4627, 0.4627, 0.0235, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]),
 tensor([1]))

We can now create the training and validation DataLoader objects.

dl = DataLoader(dset, batch_size=256)
valid_dl = DataLoader(valid_dset, batch_size=256)
xb,yb = first(dl)
xb.shape,yb.shape

(torch.Size([256, 784]), torch.Size([256, 1]))

train_x.shape

torch.Size([12396, 784])

Next, we define functions to calculate the gradient, and train for one epoch. That is, cycle through all mini-batches that our DataLoader yields.

def calc_grad(xb, yb, model):
    preds = model(xb)
    loss = mnist_loss(preds, yb)
    loss.backward()

def train_epoch(model, lr, params):
    for xb,yb in dl:
        calc_grad(xb, yb, model)
        for p in params:
            p.data -= p.grad*lr
            p.grad.zero_()

We’ll also need functions to evaluate a single batch accuracy and to average this over all batches:

def batch_accuracy(xb, yb):
    preds = xb.sigmoid()
    correct = (preds>0.5) == yb
    return correct.float().mean()

def validate_epoch(model):
    accs = [batch_accuracy(model(xb.float()), yb) for xb,yb in valid_dl]
    return round(torch.stack(accs).mean().item(), 4)

Let’s try this out and train for one eopch.

lr = 1.
params = weights,bias
train_epoch(linear1, lr, params)
validate_epoch(linear1)

Let’s repeat for a few more epochs.

for i in range(20):
    train_epoch(linear1, lr, params)
    print(validate_epoch(linear1), end=' ')

0.5031 0.6842 0.7503 0.9013 0.9457 0.956 0.9618 0.9647 0.9667 0.9681 0.9701 0.9706 0.972 0.9725 0.973 0.9735 0.974 0.9745 0.975 0.975

Not bad. Our accuracy is over 97% for a single epoch, which is already better than the pixel similarity approach!

Going Further - Creating an Optimizer!

So far we’ve created almost all training and validation code from scratch. While this is very important for our understanding PyTorch provides some useful classes to make it easier to implement. The first thing we’ll do is use the nn.Linear module which does the same thing as our init_params and linear together. It contains both the weights and biases in a single class.

linear_model = nn.Linear(28*28,1)
linear_model

Linear(in_features=784, out_features=1, bias=True)

We can see what parameters are available in our linear module.

w,b = linear_model.parameters()
w.shape,b.shape

(torch.Size([1, 784]), torch.Size([1]))

We can use this to create a basic optimizer.

class BasicOptim:
    def __init__(self,params,lr): self.params,self.lr = list(params),lr

    def step(self, *args, **kwargs):
        for p in self.params: p.data -= p.grad.data * self.lr

    def zero_grad(self, *args, **kwargs):
        for p in self.params: p.grad = None

opt = BasicOptim(linear_model.parameters(), lr)

And thus our single epoch training loop can be simplified to:

def train_epoch(model):
    for xb,yb in dl:
        calc_grad(xb, yb, model)
        opt.step()
        opt.zero_grad()

We can try this out for multiple epochs using a simple loop inside a function.

def train_model(model, epochs):
    for i in range(epochs):
        train_epoch(model)
        print(validate_epoch(model), end=' ')

train_model(linear_model, 20)

0.4932 0.83 0.8598 0.9194 0.9394 0.9536 0.9629 0.9662 0.9682 0.9706 0.9726 0.9741 0.9745 0.9755 0.977 0.977 0.9775 0.9775 0.9779 0.9784

We can generalize this more by using the fastai SGD class which essentially does the same as the BasicOptim class.

linear_model = nn.Linear(28*28,1)
opt = SGD(linear_model.parameters(), lr)
train_model(linear_model, 20)

0.4932 0.791 0.8696 0.9223 0.9419 0.9546 0.9633 0.9672 0.9692 0.9721 0.9726 0.9741 0.9736 0.9755 0.977 0.977 0.9779 0.9779 0.9779 0.9784

Another abstraction fastai provides is Learner.fit, which we can use instead of train_model. To create a ‘Learner’ we first need to create a DataLoaders object, by passing in our training and validation DataLoaders.

dls = DataLoaders(dl, valid_dl)

We can then pass this into Learner() using some of the functionality we defined earlier, and then call the fit() method to begin training for a specified number of epochs.

learn = Learner(dls, nn.Linear(28*28,1), opt_func=SGD, loss_func=mnist_loss, metrics=batch_accuracy)
learn.fit(10, lr=lr)

epoch	train_loss	valid_loss	batch_accuracy	time
0	0.636727	0.503329	0.495584	00:00
1	0.471677	0.220506	0.803729	00:00
2	0.175191	0.163141	0.854269	00:00
3	0.077810	0.100827	0.915604	00:00
4	0.041945	0.074958	0.935231	00:00
5	0.027924	0.060686	0.947988	00:00
6	0.022141	0.051695	0.955348	00:00
7	0.019544	0.045649	0.963690	00:00
8	0.018206	0.041354	0.965653	00:00
9	0.017388	0.038157	0.967125	00:00

Adding a Nonlinearity

There’s just a couple more things we need to do to make this a ‘proper’ neural network. Add a non-linear activation function, and another layer. This adds sufficient (non-linear) complexity that it helps the model learn better weights, and hence, better accuracy. Instead of using nn.Linear we use nn.Sequential so we can easilty ‘compose’ neural networks comprised of multiple layers.

simple_net = nn.Sequential(
    nn.Linear(28*28,30),
    nn.ReLU(),
    nn.Linear(30,1)
)

Then we create our Learner and call the fit function as before, except this time we are using an additional layer.

learn = Learner(dls, simple_net, opt_func=SGD, loss_func=mnist_loss, metrics=batch_accuracy)
learn.fit(40, 0.1)

epoch	train_loss	valid_loss	batch_accuracy	time
0	0.317320	0.410194	0.504416	00:00
1	0.148899	0.226347	0.809127	00:00
2	0.082341	0.113844	0.916585	00:00
3	0.054091	0.077327	0.942591	00:00
4	0.040992	0.060660	0.955839	00:00
5	0.034286	0.051260	0.964181	00:00
6	0.030415	0.045308	0.965653	00:00
7	0.027889	0.041224	0.966634	00:00
8	0.026063	0.038250	0.968106	00:00
9	0.024646	0.035980	0.969087	00:00
10	0.023498	0.034179	0.971541	00:00
11	0.022539	0.032703	0.973013	00:00
12	0.021721	0.031463	0.973994	00:00
13	0.021015	0.030397	0.974485	00:00
14	0.020396	0.029464	0.974975	00:00
15	0.019849	0.028638	0.974975	00:00
16	0.019360	0.027900	0.976448	00:00
17	0.018920	0.027235	0.976448	00:00
18	0.018522	0.026633	0.976938	00:00
19	0.018158	0.026085	0.977429	00:00
20	0.017824	0.025587	0.977429	00:00
21	0.017515	0.025131	0.977429	00:00
22	0.017229	0.024712	0.978410	00:00
23	0.016962	0.024328	0.978901	00:00
24	0.016712	0.023974	0.979392	00:00
25	0.016477	0.023646	0.979392	00:00
26	0.016256	0.023343	0.979392	00:00
27	0.016046	0.023061	0.979882	00:00
28	0.015848	0.022799	0.980864	00:00
29	0.015659	0.022555	0.980864	00:00
30	0.015480	0.022328	0.980864	00:00
31	0.015309	0.022115	0.981845	00:00
32	0.015146	0.021915	0.982336	00:00
33	0.014990	0.021727	0.981845	00:00
34	0.014840	0.021551	0.981845	00:00
35	0.014697	0.021384	0.982336	00:00
36	0.014560	0.021228	0.982336	00:00
37	0.014427	0.021079	0.981845	00:00
38	0.014300	0.020939	0.981845	00:00
39	0.014178	0.020806	0.981845	00:00

plt.plot(L(learn.recorder.values).itemgot(2));

This pushes us up to 98% accuracy with a two-layer neural network! What would happen if we used, say, an 18-layer network?

dls = ImageDataLoaders.from_folder(path)
learn = vision_learner(dls, resnet18, pretrained=False,
                    loss_func=F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(1, 0.1)

epoch	train_loss	valid_loss	accuracy	time
0	0.093842	0.011492	0.996075	00:10

Now we have achieved almost 100% accuracy, which demonstrates the power of deep learning neural networks! Even though this is a fairly simple model (by today’s standards) it’s still a useful exercise to get an early feel for training models from scratch on datasets and generating high quality results.