The purpose of a neural network model is to make good predictions, like any other machine learning model. In this we stack up various units called ’neurons’ in multiple layers and each layer having one or many neurons. Each neuron acts like a miniature model with its own parameters that include weights and the bias term. The neural net, is a complex structure of these miniature models. We have three basic steps in training a neural network that include forward propagation, calculating loss and backward propagation. Lets analyse them further :
1.1 Forward Propagation
In this, lets say we have L hidden layers and thus in total we have L + 2 layers in total including input and output layers. Say we have an input X, then we take the input and feed it to our first layer. Each node of the input vector is connected to each activation unit in the first hidden layer. Then, we have the weights of each activation unit of the hidden layer equal in number to input vector X and we feed the weighted input to the hidden layer. Then, in the layer we add the bias term corresponding to each activation unit and apply the activation function. The final value or the output of this hidden layer 1 is fed to the next hidden layer and so on.
Thus, in this step we take the input, we pass the input to model and multiply with weights and add bias at every layer and find the calculated output of the model.
1.2 Calculating loss
Next, after we have propagated forward in the whole network, we find that our model outputs a value for the given input. This is called the predicted value. Now we want this predicted value to be as close to the output labelled in our train set.So we calculate the error between the predicted and the actual values of output and this error is called the loss. This is an approximation of how wrong our predictions are relative to the given output.
1.3 Back Propagation
After we have calculated loss, it’s time to train our model. In this,the goal is to use optimization algorithms like gradient descent to update our parameters such that the loss is minimum. We know that gradient of our loss function is the vector that points in the direction of greatest steepness and therefore we want to repeatedly take steps in the opposite direction of the gradient to eventually arrive at the minimum. In back prop, we are moving the error backwards through the model. We move the error backwards via the same weights and biases as we did in the forward propagation. The goal is to calculate error attributed to each neuron and adjust its weights accordingly by shifting it back layer by layer. The larger the error attributed to a neuron, the more is the need to change the neuron’s weights and biases.
Therefore,at each step, the goal is to calculate error in weights and bias and update the parameters using gradient descent(we are using this algorithm although there are many other). At each step of moving back, we calculate the partial derivative of the loss w.r.t. that parameters and then update the parameters at each iteration thus changing the weights and moving towards goal of minimum error.
2 Calculating Forward and Back Prop
2.0.1 Calculating forward and back prop for given diagram:
Say we have input layer with 4 elements and hidden layer with activation units and the predicted output yˆ. Lets first calculate forward prop equations of first hidden layer, say weight wi,j connects the input unit i to the hidden layer unit j. Say for first hidden unit a0, we will have :
z0 = w0,0x0 + w1,0x1 + w2,0x2 + w3,0x3 + b0
a0 = σ(z0)
z1 = w0,1x0 + w1,1x1 + w2,1x2 + w3,1x3 + b1
a1 = σ(z1)
z2 = w0,2x0 + w1,2x1 + w2,2x2 + w3,2x3 + b2
a2 = σ(z2)
z3 = w0,3x0 + w1,3x1 + w2,3x2 + w3,3x3 + b3
a3 = σ(z3)
z4 = w0,4x0 + w1,4x1 + w2,4x2 + w3,4x3 + b4
a4 = σ(z4)
Then we calculate the activation of the output layer or ˆy as:
Then we calculate the loss of the output layer :
l = −y log(ˆy) − (1 − y)log(1 − yˆ)
Then, calculating derivative of loss w.r.t to ˆy and z[2]:
dz[2] = yˆ− y
Propagating backwards, we have to calculate derivative w.r.t individuals parameters. From output layer to hidden layer, we have :
[2] [2]
dwi,0 = dz .aifori = 0..4
db[2] = dz[2]
Then moving from this layer to backwards:
dzi[1] = wi,[2]0 [2] ∗ σ0(zi[1])fori = 0..4 ∗ dz
[1] [1]
dwi,j = dzj .xifori = 0..3,j = 0..4
[1] [1]
dbi = dzi for(i = 0..4)
Therefore, above are formulas calculating each units equations for forward and backward propagation. Then, we update these weights and run this process for the given number of training examples.
2.0.2 Vectorizing the above equations
We have of dimension (4, 1) and hidden layer with
b[1]0
[1]
b1
[1] = b[1]2 of dimension (5, 1) and weight of dimension (5, 1) say we have bias b
b[1]3
[1]
b4
matrix of dimension (5, 4). Then calculating for-
ward propagation, we have :
Z[1] = W[1].X + b[1]
A[1] = σ(Z[1])
for the next layer, we have A[1] as input, then we have Z[2] as a (1, 1) dimen-
sion and A[2] = yˆi.e. our predicted output. The matrix as a (1, 5) dimension vector. then we have:
Z[2] = W[2].A[1] + b[2]
A[2] = yˆ = σ(Z[2])
We have our loss function same as above as we have one output unit, and thus propagating backwards we have:
dZ[2] = A[2] − Y
dW[2] = dZ[2] ∗ A[1].T db[2] = dZ[2]
dZ[1] = (W[2].T ∗ dZ[2]) ∗ σ0(Z[1]) dW[1] = dZ[1] ∗ X.T db[1] = dZ[1]
3 Activation functions and derivatives
Below are the activation functions and their derivatives respectively.
3.1 Sigmoid
3.1.1 Function:
3.1.2 Derivative:
3.2 Relu
3.2.1 Function:
i.e.
3.2.2 Derivative:
3.3 Leaky Relu
3.3.1 Function:
In this, we have a factor multiplying with x in case x is negative i.e. f(x) = maximum(ax,x) and this a has to be ¡ 1 else it will pick the x when negative and ax when x is positive. Say a = 0.01, then :
3.3.2 Derivative:
x ≥ 0 x < 0
3.4 Tanh
3.4.1 Function:
3.4.2 Derivative:
3.5 Softmax
3.5.1 Function:
3.5.2 Derivative:
Since softmax is a function, the most general derivative we compute for it is the Jacobian matrix:
Thus the derivative:
i = j i 6= j