Identify MNIST Digits With 97% Accuracy By Using A 2 Layer Neural Network

Adding a hidden layer increased my accuracy by over 20%
Grokking Deep Learning
Deep Learning
Python
Author

Leo Gau

Published

March 16, 2021

What I’m Building

In this post, I’ll show you how I implemented a 2 layer neural network which is able to achieve over 97% accuracy on the MNIST test set.

This network builds on the work in my previous posts. If you’d like a refresher, they are: - How I Implemented The Most Simple Neural Network Using Python - How I Identify Handwritten Digits Using Only Python - How I Go From 70 Lines Of Code To Only 26 Using The NumPy Library

Now, I’ll show you the code first and then explain some concepts you need to understand what’s going on.

The Code

import numpy as np
np.random.seed(1)

def relu(x):
    return (x > 0) * x

def relu2deriv(output):
    return output > 0

def flatten_image(image):
    return np.array(image).reshape(1, 28*28)

class NeuralNet:
    def __init__(self):
        self.alpha = 0.00001
        self.hidden_size = 1000
        self.weights_0_1 = np.random.random((28 * 28, self.hidden_size)) * 0.0001
        self.weights_1_2 = np.random.random((self.hidden_size, 10)) * 0.0001

    def predict(self, input):
        layer_0 = input
        layer_1 = relu(np.dot(layer_0, self.weights_0_1))
        layer_2 = np.dot(layer_1, self.weights_1_2)
        return layer_2

    def train(self, input, labels, epochs):
        for i in range(epochs):
            layer_2_error = 0
            for j in range(len(input)):
                layer_0 = input[j]
                layer_1 = relu(np.dot(layer_0, self.weights_0_1))
                layer_2 = np.dot(layer_1, self.weights_1_2)

                label = labels[j]
                goal = np.zeros(10)
                goal[label] = 1

                layer_2_error = np.sum((layer_2 - goal) ** 2)

                layer_2_delta = (layer_2 - goal)
                layer_1_delta = layer_2_delta.dot(self.weights_1_2.T) * relu2deriv(layer_1)

                self.weights_1_2 = self.weights_1_2 - (self.alpha * layer_1.T.dot(layer_2_delta))
                self.weights_0_1 = self.weights_0_1 - (self.alpha * layer_0.T.dot(layer_1_delta))

            print("Error: " + str(layer_2_error))
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

images = x_train
labels = y_train

prepared_images = [flatten_image(image) for image in images]
prepared_labels = np.array(labels)

nn = NeuralNet()
nn.train(prepared_images, prepared_labels, 5)

test_set = x_test
test_labels = y_test
num_correct = 0
for i in range(len(test_set)):
    prediction = nn.predict(flatten_image(test_set[i]))
    correct = test_labels[i]
    if np.argmax(prediction) == int(correct):
        num_correct += 1

print(str(num_correct/len(test_set) * 100) + "%")

Why Do I Need More Than One Layer?

At their core, neural networks find correlations between the input data and target data. Sometimes there’s just no correlation to be found using the number of weights given.

One way to increase the accuracy is to give the network more weights to use. And the way to give it more weights is to add more layers.

We might not have correlation between the input and output layers but we can create an extra layer to help us out.

Now, even with more weights, there’s another problem. For any 3 layer network, there is a 2 layer network which can do that exact same thing.

Since there is no special processing at the extra layer, it’s not contributing any new information to the network. It correlates 1:1 with the input layer. What we need is for the middle layer to sometime correlate and sometimes not correlate with the input layer. We need it to have it’s own processing.

This is called conditional correlation. One way to create conditional correlation is to turn off the node when the value would be negative. If the value is negative, it would normally be negatively correlated with the input. However, if we turn it off (set the value to 0) then it doesn’t effect the output at all.

This means a node can be selectively correlated with inputs.

Let me flesh this out with an example.

Let’s say a node has 2 inputs, left and right. The left input is 1 and the right input is -1. If we use both weights, the node would be 0. They cancel each other out. However, if we set right input to 0, then the node would only be correlated with the left input value. The node is now adding additional information to the network by saying, “Make me perfectly correlated with the left input, but only if the right input is 0.”

This wasn’t possible earlier. This is the power of adding layers to the network.

The technical term for a situation where 2 variables are not predictable from a straight line is “nonlinearity”. The functions we use to create nonlinearities are called activation functions. The one I use in my network - turn off the node when it’s value would be negative - is called Rectified Linear Unit (ReLU).

So that’s one piece of the puzzle. I add another layer to give the neural network more weights to play with and create conditional correlation with activation functions.

Another problem you might be thinking about now is, how do we adjust the weights of the new layers? In a single layer network, we get the derivative of the delta and the input, but now we have weights that didn’t directly contribute to the loss.

Backpropagation

The process of updating the weights in intermediate layers is called backpropagation.

How do you use the delta from the final layer (layer_2) to figure out how to change the weights in an intermediate layer (layer_1)? It turns out, through some calculus, I can multiply the layer_2 delta with the layer_1 inputs.

If I had more layers, I could keep multiplying the delta with the node input to get the weight_deltas.

With knowledge of activation functions and backpropagation, I can now break down what I did in code.

Import Libraries And Get Data

For an explaination of this code, look at some of my previous posts.

import numpy as np
np.random.seed(1)
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

images = x_train
labels = y_train

Activation Functions

Like I mentioned above, the activation function I used is called ReLU. ReLU will set any values that would be negative to 0. Any positive values remain as is.

def relu(x):
    return (x > 0) * x

During backpropagation, I don’t want to adjust the weights if ReLU set it to 0. Therefore, I need a function to tell me if ReLu did that or not.

The relu2deriv function will be used to cancel the weight adjustment if it was altered during the prediction time. If ReLU set the value to 0, the weight should not be adjusted at all.

def relu2deriv(output):
    return output > 0

Finally, I have my trusty flatten_image function to prepare the data.

def flatten_image(image):
    return np.array(image).reshape(1, 28*28)

The NeuralNet class has three new features to look at.

First, it has 2 sets of weights and a hidden_size which determines the size of the hidden layer.

self.hidden_size = 1000
self.weights_0_1 = np.random.random((28 * 28, self.hidden_size)) * 0.0001
self.weights_1_2 = np.random.random((self.hidden_size, 10)) * 0.0001

A hidden_size=1000 is going to give the network a lot more weights to figure out the correlation between the images and labels.

Second, the prediction function now does 2 weighted sums

layer_0 = input
layer_1 = relu(np.dot(layer_0, self.weights_0_1))
layer_2 = np.dot(layer_1, self.weights_1_2)

After setting layer_0 as the inputs, I calculate layer_1 by taking the weighted sum of layer_0 and the first set of weights, self.weights_0_1. I then use our activation function on the result to get layer_1. The final layer, layer_2, is the weighted sum of layer_1 and the second set of weights, self.weights_1_2.

Finally, let’s look at how the weights get updated.

layer_2_delta = (layer_2 - goal)
layer_1_delta = layer_2_delta.dot(self.weights_1_2.T) * relu2deriv(layer_1)

layer_2_weight_delta = layer_1.T.dot(layer_2_delta)
layer_1_weight_delta = layer_0.T.dot(layer_1_delta)

self.weights_1_2 = self.weights_1_2 - (self.alpha * layer_2_weight_delta)
self.weights_0_1 = self.weights_0_1 - (self.alpha * layer_1_weight_delta)

I get the layer_2_delta like before, getting the difference between the prediction and the goal. The layer_1_delta is derived by taking the weighted sum between the layer_2_delta and the weights connected to that layer, self.weights_1_2. I also need to use the relu2deriv function at this point to tell me if ReLU adjusted the node values or not.

A note on the .T syntax. .T in NumPy is shorthand for transpose. It lets me reshape the matrix so everything lines up correctly to do matrix math.

So that’s the deltas, but how much do I adjust the weights by?

I find the layer_2_weight_delta but calculating the weighted sum of layer_1 and layer_2_delta. So it’s the inputs into the layer and the delta at that layer. The layer_1_weight_delta is the same. I calculate the weighted sum of the inputs into that layer, layer_0, and the delta, layer_1_delta.

The weights are still adjusted by subtracting the weight deltas multiplied by some learning rate (self.alpha).

class NeuralNet:
    def __init__(self):
        self.alpha = 0.00001
        self.hidden_size = 1000
        self.weights_0_1 = np.random.random((28 * 28, self.hidden_size)) * 0.0001
        self.weights_1_2 = np.random.random((self.hidden_size, 10)) * 0.0001

    def predict(self, input):
        layer_0 = input
        layer_1 = relu(np.dot(layer_0, self.weights_0_1))
        layer_2 = np.dot(layer_1, self.weights_1_2)
        return layer_2

    def train(self, input, labels, epochs):
        for i in range(epochs):
            layer_2_error = 0
            for j in range(len(input)):
                layer_0 = input[j]
                layer_1 = relu(np.dot(layer_0, self.weights_0_1))
                layer_2 = np.dot(layer_1, self.weights_1_2)

                label = labels[j]
                goal = np.zeros(10)
                goal[label] = 1

                layer_2_error = np.sum((layer_2 - goal) ** 2)

                layer_2_delta = (layer_2 - goal)
                layer_1_delta = layer_2_delta.dot(self.weights_1_2.T) * relu2deriv(layer_1)

                layer_2_weight_delta = layer_1.T.dot(layer_2_delta)
                layer_1_weight_delta = layer_0.T.dot(layer_1_delta)

                self.weights_1_2 = self.weights_1_2 - (self.alpha * layer_2_weight_delta)
                self.weights_0_1 = self.weights_0_1 - (self.alpha * layer_1_weight_delta)

            print("Error: " + str(layer_2_error))
prepared_images = [flatten_image(image) for image in images]
prepared_labels = np.array(labels)

nn = NeuralNet()
nn.train(prepared_images, prepared_labels, 5)

test_set = x_test
test_labels = y_test
num_correct = 0
for i in range(len(test_set)):
    prediction = nn.predict(flatten_image(test_set[i]))
    correct = test_labels[i]
    if np.argmax(prediction) == int(correct):
        num_correct += 1

print(str(num_correct/len(test_set) * 100) + "%")
Error: 0.10415136838556452
Error: 0.07391762691913144
Error: 0.0634422993538635
Error: 0.05133955942375137
Error: 0.039768839003841615
97.48%

If I train this network over 5 epochs like the other networks, I see the error go down to 0.03 and I get 97.48% accuracy.

Not too shabby! And what an improvement over the 1 layer network which only got 76% correct!

What’s Next?

Next I’ll dive into regularization and try to increase the accuracy even further.

See you then!

Find me on Twitter if you want discuss any of what I’ve written!