# Identify MNIST Digits With 97% Accuracy By Using A 2 Layer Neural Network

Adding a hidden layer increased my accuracy by over 20%

- What I'm Building
- The Code
- Why Do I Need More Than One Layer?
- Backpropagation
- Import Libraries And Get Data
- Activation Functions
- What's Next?

In this post, I'll show you how I implemented a 2 layer neural network which is able to achieve over 97% accuracy on the MNIST test set.

This network builds on the work in my previous posts. If you'd like a refresher, they are:

- How I Implemented The Most Simple Neural Network Using Python
- How I Identify Handwritten Digits Using Only Python
- How I Go From 70 Lines Of Code To Only 26 Using The NumPy Library

Now, I'll show you the code first and then explain some concepts you need to understand what's going on.

```
import numpy as np
np.random.seed(1)
def relu(x):
return (x > 0) * x
def relu2deriv(output):
return output > 0
def flatten_image(image):
return np.array(image).reshape(1, 28*28)
class NeuralNet:
def __init__(self):
self.alpha = 0.00001
self.hidden_size = 1000
self.weights_0_1 = np.random.random((28 * 28, self.hidden_size)) * 0.0001
self.weights_1_2 = np.random.random((self.hidden_size, 10)) * 0.0001
def predict(self, input):
layer_0 = input
layer_1 = relu(np.dot(layer_0, self.weights_0_1))
layer_2 = np.dot(layer_1, self.weights_1_2)
return layer_2
def train(self, input, labels, epochs):
for i in range(epochs):
layer_2_error = 0
for j in range(len(input)):
layer_0 = input[j]
layer_1 = relu(np.dot(layer_0, self.weights_0_1))
layer_2 = np.dot(layer_1, self.weights_1_2)
label = labels[j]
goal = np.zeros(10)
goal[label] = 1
layer_2_error = np.sum((layer_2 - goal) ** 2)
layer_2_delta = (layer_2 - goal)
layer_1_delta = layer_2_delta.dot(self.weights_1_2.T) * relu2deriv(layer_1)
self.weights_1_2 = self.weights_1_2 - (self.alpha * layer_1.T.dot(layer_2_delta))
self.weights_0_1 = self.weights_0_1 - (self.alpha * layer_0.T.dot(layer_1_delta))
print("Error: " + str(layer_2_error))
```

```
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
images = x_train
labels = y_train
prepared_images = [flatten_image(image) for image in images]
prepared_labels = np.array(labels)
nn = NeuralNet()
nn.train(prepared_images, prepared_labels, 5)
test_set = x_test
test_labels = y_test
num_correct = 0
for i in range(len(test_set)):
prediction = nn.predict(flatten_image(test_set[i]))
correct = test_labels[i]
if np.argmax(prediction) == int(correct):
num_correct += 1
print(str(num_correct/len(test_set) * 100) + "%")
```

At their core, neural networks find correlations between the input data and target data. Sometimes there's just no correlation to be found using the number of weights given.

One way to increase the accuracy is to give the network more weights to use. And the way to give it more weights is to add more layers.

We might not have correlation between the input and output layers but we can create an extra layer to help us out.

Now, even with more weights, there's another problem. For any 3 layer network, there is a 2 layer network which can do that exact same thing.

Since there is no special processing at the extra layer, it’s not contributing any new information to the network. It correlates 1:1 with the input layer. What we need is for the middle layer to sometime correlate and sometimes not correlate with the input layer. We need it to have it’s own processing.

This is called *conditional correlation*. One way to create conditional correlation is to turn off the node when the value would be negative. If the value is negative, it would normally be negatively correlated with the input. However, if we turn it off (set the value to 0) then it doesn’t effect the output at all.

This means a node can be selectively correlated with inputs.

Let me flesh this out with an example.

Let’s say a node has 2 inputs, left and right. The left input is 1 and the right input is -1. If we use both weights, the node would be 0. They cancel each other out. However, if we set right input to 0, then the node would only be correlated with the left input value. The node is now adding additional information to the network by saying, “Make me perfectly correlated with the left input, but only if the right input is 0.”

This wasn’t possible earlier. This is the power of adding layers to the network.

The technical term for a situation where 2 variables are not predictable from a straight line is “nonlinearity”. The functions we use to create nonlinearities are called *activation functions*. The one I use in my network - turn off the node when it's value would be negative - is called Rectified Linear Unit (ReLU).

So that's one piece of the puzzle. I add another layer to give the neural network more weights to play with and create conditional correlation with activation functions.

Another problem you might be thinking about now is, how do we adjust the weights of the new layers? In a single layer network, we get the derivative of the delta and the input, but now we have weights that didn’t directly contribute to the loss.

The process of updating the weights in intermediate layers is called backpropagation.

How do you use the delta from the final layer (layer_2) to figure out how to change the weights in an intermediate layer (layer_1)? It turns out, through some calculus, I can multiply the layer_2 delta with the layer_1 inputs.

If I had more layers, I could keep multiplying the delta with the node input to get the weight_deltas.

With knowledge of activation functions and backpropagation, I can now break down what I did in code.

```
import numpy as np
np.random.seed(1)
```

```
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
images = x_train
labels = y_train
```

```
def relu(x):
return (x > 0) * x
```

During backpropagation, I don’t want to adjust the weights if ReLU set it to 0. Therefore, I need a function to tell me if ReLu did that or not.

The `relu2deriv`

function will be used to cancel the weight adjustment if it was altered during the prediction time. If ReLU set the value to 0, the weight should not be adjusted at all.

```
def relu2deriv(output):
return output > 0
```

Finally, I have my trusty `flatten_image`

function to prepare the data.

```
def flatten_image(image):
return np.array(image).reshape(1, 28*28)
```

The `NeuralNet`

class has three new features to look at.

First, it has 2 sets of weights and a `hidden_size`

which determines the size of the hidden layer.

```
self.hidden_size = 1000
self.weights_0_1 = np.random.random((28 * 28, self.hidden_size)) * 0.0001
self.weights_1_2 = np.random.random((self.hidden_size, 10)) * 0.0001
```

A `hidden_size=1000`

is going to give the network a lot more weights to figure out the correlation between the images and labels.

Second, the prediction function now does 2 weighted sums

```
layer_0 = input
layer_1 = relu(np.dot(layer_0, self.weights_0_1))
layer_2 = np.dot(layer_1, self.weights_1_2)
```

After setting `layer_0`

as the inputs, I calculate `layer_1`

by taking the weighted sum of `layer_0`

and the first set of weights, `self.weights_0_1`

. I then use our activation function on the result to get `layer_1`

.
The final layer, `layer_2`

, is the weighted sum of `layer_1`

and the second set of weights, `self.weights_1_2`

.

Finally, let's look at how the weights get updated.

```
layer_2_delta = (layer_2 - goal)
layer_1_delta = layer_2_delta.dot(self.weights_1_2.T) * relu2deriv(layer_1)
layer_2_weight_delta = layer_1.T.dot(layer_2_delta)
layer_1_weight_delta = layer_0.T.dot(layer_1_delta)
self.weights_1_2 = self.weights_1_2 - (self.alpha * layer_2_weight_delta)
self.weights_0_1 = self.weights_0_1 - (self.alpha * layer_1_weight_delta)
```

I get the `layer_2_delta`

like before, getting the difference between the prediction and the goal. The `layer_1_delta`

is derived by taking the weighted sum between the `layer_2_delta`

and the weights connected to that layer, `self.weights_1_2`

. I also need to use the `relu2deriv`

function at this point to tell me if ReLU adjusted the node values or not.

A note on the `.T`

syntax. `.T`

in NumPy is shorthand for transpose. It lets me reshape the matrix so everything lines up correctly to do matrix math.

So that's the deltas, but how much do I adjust the weights by?

I find the `layer_2_weight_delta`

but calculating the weighted sum of `layer_1`

and `layer_2_delta`

. So it's the inputs into the layer and the delta at that layer.
The `layer_1_weight_delta`

is the same. I calculate the weighted sum of the inputs into that layer, `layer_0`

, and the delta, `layer_1_delta`

.

The weights are still adjusted by subtracting the weight deltas multiplied by some learning rate (`self.alpha`

).

```
class NeuralNet:
def __init__(self):
self.alpha = 0.00001
self.hidden_size = 1000
self.weights_0_1 = np.random.random((28 * 28, self.hidden_size)) * 0.0001
self.weights_1_2 = np.random.random((self.hidden_size, 10)) * 0.0001
def predict(self, input):
layer_0 = input
layer_1 = relu(np.dot(layer_0, self.weights_0_1))
layer_2 = np.dot(layer_1, self.weights_1_2)
return layer_2
def train(self, input, labels, epochs):
for i in range(epochs):
layer_2_error = 0
for j in range(len(input)):
layer_0 = input[j]
layer_1 = relu(np.dot(layer_0, self.weights_0_1))
layer_2 = np.dot(layer_1, self.weights_1_2)
label = labels[j]
goal = np.zeros(10)
goal[label] = 1
layer_2_error = np.sum((layer_2 - goal) ** 2)
layer_2_delta = (layer_2 - goal)
layer_1_delta = layer_2_delta.dot(self.weights_1_2.T) * relu2deriv(layer_1)
layer_2_weight_delta = layer_1.T.dot(layer_2_delta)
layer_1_weight_delta = layer_0.T.dot(layer_1_delta)
self.weights_1_2 = self.weights_1_2 - (self.alpha * layer_2_weight_delta)
self.weights_0_1 = self.weights_0_1 - (self.alpha * layer_1_weight_delta)
print("Error: " + str(layer_2_error))
```

```
prepared_images = [flatten_image(image) for image in images]
prepared_labels = np.array(labels)
nn = NeuralNet()
nn.train(prepared_images, prepared_labels, 5)
test_set = x_test
test_labels = y_test
num_correct = 0
for i in range(len(test_set)):
prediction = nn.predict(flatten_image(test_set[i]))
correct = test_labels[i]
if np.argmax(prediction) == int(correct):
num_correct += 1
print(str(num_correct/len(test_set) * 100) + "%")
```

If I train this network over 5 epochs like the other networks, I see the error go down to `0.03`

and I get `97.48%`

accuracy.

Not too shabby! And what an improvement over the 1 layer network which only got `76%`

correct!

Next I'll dive into regularization and try to increase the accuracy even further.

See you then!

Find me on Twitter if you want discuss any of what I've written!