Hello

I started coding a library in C to run experiments on neural network size, activation function, batch size etc. I am currently facing two problems.

**1.** My NN trains itself on the mnist data set and only learns the training data after about 200 epochs which is as i understand a lot. I also only get an error rate of 20% minimum and anything lower takes extremely long. This does not make sense as tutorials i have seen with tensorflow the error rate drops really low (<10%) just after 5 epochs. My NN uses stochastic gradient decent, i.e. the weight values are updated every sample. I thought this could maybe be a problem so i tried changing my code to implement mini batch gradient decent which lead me to my second problem.

**2. **I don't understand how to update the weights with mini batch gradient decent. When i feed a training sample into the network i can calculate the cost of that specific sample and the overall error that must be propagated back. I save this error into a variable that accumulates over my minibatch loop. Once done i have a variable that contains the accumulated error of all the samples in my mini batch. Now i must update my weights and this is where i get stuck. To update the weights the error must be multiplied with the hidden unit values which are dependent on the input sample you feed the network. Which values for the hidden units do i use? I hope my question makes sense

1. assuming that your NN has no bugs in it and that you're using a quite robust optimizer (Adam, for example), the first thing I would try is a larger learning rate. However, if you're converging around 20% error rate (i.e. significantly higher than other people's results with the same network and the same dataset), this suggests either your loss function is bugged, or your optimizer is stuck at some local sink. The latter is very unlikely for a high-dimensional surface.

2. mini batch gradient is similar to SGD, except that you take a larger sample (rather than just one) to estimate the gradient. So you could simply "for each batch, average the gradients you get before back propagation".

Posted by: cl4ptp

1. assuming that your NN has no bugs in it and that you're using a quite robust optimizer (Adam, for example), the first thing I would try is a larger learning rate. However, if you're converging around 20% error rate (i.e. significantly higher than other people's results with the same network and the same dataset), this suggests either your loss function is bugged, or your optimizer is stuck at some local sink. The latter is very unlikely for a high-dimensional surface.

My network uses standard stochastic gradient decent. No optimizer is used. I believe my network does not have any bugs but there obviously are because it does not work properly. My cost function is as follows:

float cost(float error){

return (0.5*(pow(error,2.0)));

}

The error is the correct answer - the network guess.

Posted by: cl4ptp

2. mini batch gradient is similar to SGD, except that you take a larger sample (rather than just one) to estimate the gradient. So you could simply "for each batch, average the gradients you get before back propagation".

Can you please clarify what is the gradient? is the gradient the derivative of the cost function? Do you average the cost between all batch samples then derive the average cost and then propogate that back?

If this is the case or not, when the gradient propogates back when you calculate the change you must make to the weights you multiply the gradient by either the hidden unit value or the input value on a simple perception model. If the gradient was calculated over many inputs which input value/hidden unit value do you multiply the gradient with to get the delta weight value?

Let me try clarify what i am asking. I have attached a picture of a simple netwok. to calculate the change in w1 we use the formula:

∂C/∂w1= ∂C/∂O*∂O/∂h*∂h/∂w1

which is equal to

∂C/∂w1= (Output-expected answer) * (w2) * (input)

This equation includes the input value to calculate the delta weight1. So if i use an average gradient and i propogate it back to calculate delta weight1 which sample value for input do i use as i did not use a single input value to calculate the average gradient, i used all the batch sample inputs. I hope this makes sense?

Ok so i figured it out. When using mini batches you should not accumulate and average out the error at the output of the network. Each training examples error gets propogated back as you would normally except instead of updating the weights you accumulate the changes you would have made to each weight. When you have looped through the mini batch you then average the accumulations and change the weights accordingly.

I was under the impression that when using mini batches you do not have to propogate any error back until you have looped through the mini batch. I was wrong you still need to do that the only difference is you only update the weights once you have looped through your mini batch size.