I recently remade my own personal Java neural network framework, mainly for educational purpose. And by trial and error with simple cases, it definitely was learning! So all good!… Right?!?

## The Subtle errors

Backpropagation can be arduous to implement and get right. This is not necessarily because of the theory or math behind backpropagation is too hard to understand or comprehend. But because subtle errors in the implementation, can still return promising results, but still not as great as a correct implementation.

Some examples of these implementations errors could be:

**The Off by 1:**- The off by 1 implementation could result in that every node except the first node in a layer is trained. Most of the network will still be trained and can learn to “ignore” the input of the first node – But this is a waste of resources.

The same example applies to the first weight connecting to every node – And other variations.

- The off by 1 implementation could result in that every node except the first node in a layer is trained. Most of the network will still be trained and can learn to “ignore” the input of the first node – But this is a waste of resources.
**The forgotten bias:**- The forgotten bias implementation is as the name suggests when you forget to apply backpropagation to the bias weights in addition to the “traditional” weights.

A neural network does not require biases, but it certainly increases the performance of it.

- The forgotten bias implementation is as the name suggests when you forget to apply backpropagation to the bias weights in addition to the “traditional” weights.
**The neglected term:**- The neglected term implementation was my personal pitfall – When calculating the partial derivative, I made the huge mistake of not adding the activation prime (e.g. sigmoidPrime) term.

Since the sign of the gradient in most cases wasn’t flipped at this stage, it meant that in most cases, the neural network kept learning and in the correct direction, but the learning steps were of the wrong size – And sometimes even in the completely wrong direction! And the error factor kept building up for every layer.

- The neglected term implementation was my personal pitfall – When calculating the partial derivative, I made the huge mistake of not adding the activation prime (e.g. sigmoidPrime) term.

These are just some of the things you might get wrong – And right of my head without any empirical data, I would believe that even an implementation which contains all of these errors, would be still able to train a simple neural network to some extent.

## How to detect a faulty implementation?

First I’ll like to explain the process before we move into the uncharted area of math and equations.

It’s important to stress that numerical gradient checking, **won’t** in any way help you debug where the error might be. It can only increase your certainty that your implementation is correct or incorrect and in which degree.

To detect these faulty implementations we use a method named numerical gradient checking, which will validate that the calculated partial derivatives of your model are equal to or at least almost the actual derivatives.

Since the gradient is truly just a glorified way of describing the slope at any given point of a function. What we truly are doing is that we compare our calculated derivative with an approximation of the slope.

## The Math

Suppose we have a function \(J(\theta)\) as a function of \(\theta\), which calculates the error of our neural network – and therefore want to minimize.

Now also suppose we have a function \(g(\theta)\) which supposedly computes the derivative \(\frac{d}{d\theta}J(\theta)\). But how would we verify our implementation of \(g(\theta)\)?

We know that the slope of a function is given by the change of y with respect for the change of x:

$$slope = \frac{\Delta y}{\Delta x} = \frac{y_2 – y_1}{x_2 – x_1}$$

Which means that

$$slope \approx \frac{J(\theta + \epsilon)\ -\ J(\theta\, -\, \epsilon)}{2\epsilon}$$

And since this is exactly what \(g(\theta)\) supposedly would compute, it means that the following should be true:

$$g(\theta) \approx \frac{J(\theta + \epsilon)\ -\ J(\theta\, -\, \epsilon)}{2\epsilon}$$

In practice \(\epsilon\) is set to a small constant which often has a value around \(10^{-4}\) since problems begin to arise if computers try to handle far too small numbers.

## In practice

So in practice to perform numerical gradient checking, we first use our backpropagation algorithm (\(g(\theta)\)) to compute the partial derivatives of our weights. Then for every weight in our network, we calculate the estimated slope of our error/cost function \(J(\theta)\), by changing a single weight’s value by a very small amount \(\epsilon\), in both directions and then applying the slope formula.

We are now able to compare our estimated slopes and our calculated derivatives, and by subtracting them you might get low values around 0.00003 or higher values around 0.007 – But how small of a number is small enough?

An \(\epsilon\) value of \(10^{-4}\) should give you an accuracy of at least 4 digits, but usually more. Which is great information, so now bugger off and check all 1 million weights and their derivatives, by hand.

This clearly is only a viable solution if your network is of a small size. So another method is to compute a single value, which will represent how accurate your derivatives are

First

And then divide the norm of the difference by the norm of the sum of the vectors:

$$Error = \frac{norm(gradients – slopes)}{norm(gradients + slopes)}$$

The Error should be around \(10^{-8}\) or less if you have calculated your derivatives correctly.

## Common mistakes

While implementing numerical gradient checking isn’t that hard of a task, people often at their first try fall into one of these common mistakes:

**Calculating the slope of the output and not the error.**- Sometimes when implementing gradient checking, people use a simple network with a single output. And when
they calculate the slope of \(J(\theta)\) they mistake it as the output of the neural network and not the error/cost.Remember: We want to minimize the error,

**not**the output.

- Sometimes when implementing gradient checking, people use a simple network with a single output. And when
**Forgetting to reset the weights.**- After changing a weight by a small amount in both directions, people sometimes forget to reset the weight back to its initial value, before they move on to the next weight, which will skew their results.