To build a machine learning algorithm, usually you'd define an architecture (e.g. Logistic regression, Support Vector Machine, Neural Network) and train it to learn parameters. Here is a common training process for neural networks:
Then, given a new data point, you can use the model to predict its class.
The initialization step can be critical to the model's ultimate performance, and it requires the right method. To illustrate this, consider the three-layer neural network below. You can try initializing this network with different methods and observe the impact on the learning.
Select a training dataset.
This legend details the color scheme for labels, and the values of the weights/gradients.
Select an initialization method for the values of your neural network parameters1.
Select whether to visualize the weights or gradients of the network above.
Observe the cost function and the decision boundary.
Consider this 9-layer neural network.
At every iteration of the optimization loop (forward, cost, backward, update), we observe that backpropagated gradients are either amplified or minimized as you move from the output layer towards the input layer. This result makes sense if you consider the following example.
Assume all the activation functions are linear (identity function). Then the output activation is:
where and are all matrices of size because layers  to [L-1] have 2 neurons and receive 2 inputs. With this in mind, and for illustrative purposes, if we assume the output prediction is (where takes the matrix to the power of L-1, while denotes the Lth matrix).
What would be the outcome of initialization values that were too small, too large or appropriate?
Consider the case where every weight is initialized slightly larger than the identity matrix.
This simplifies to , and the values of increase exponentially with . When these activations are used in backward propagation, this leads to the exploding gradient problem. That is, the gradients of the cost with the respect to the parameters are too big. This leads the cost to oscillate around its minimum value.
Similarly, consider the case where every weight is initialized slightly smaller than the identity matrix.
This simplifies to , and the values of the activation decrease exponentially with . When these activations are used in backward propagation, this leads to the vanishing gradient problem. The gradients of the cost with respect to the parameters are too small, leading to convergence of the cost before it has reached the minimum value.
To prevent the gradients of the network's activations from vanishing or exploding, we will stick to the following rules of thumb:
Under these two assumptions, the backpropagated gradient signal should not be multiplied by values too small or too large in any layer. It should travel to the input layer without exploding or vanishing.
More concretely, consider a layer . Its forward propagation is:
We would like the following to hold:2
Ensuring zero-mean and maintaining the value of the variance of the input of every layer guarantees no exploding/vanishing signal, as we'll explain in a moment. This method applies both to the forward propagation (for activations) and backward propagation (for gradients of the cost with respect to activations). The recommended initialization is Xavier initialization (or one of its derived methods), for every layer
In other words, all the weights of layer
are picked randomly from a normal distribution with mean
where is the number of neuron in layer . Biases are initialized with zeros.
The visualization below illustrates the influence of the Xavier initialization on each layer’s activations for a five-layer fully-connected neural network.
Load 10,000 handwritten digits images (MNIST).
Among the below distributions, select the one to use to initialize parameters. 3.
The grid below refers to the input images, Blue squares represent correctly classified images. Red squares represent misclassified images.
Input batch of 100 images
Output predictions of 100 images
You can find the theory behind this visualization in Glorot et al. (2010). The next section presents the mathematical justification for Xavier initialization and explains more precisely why it is an effective initialization.
In this section, we will show that Xavier Initialization keeps the variance the same across every layer. We will assume that our layer’s activations are normally distributed around zero. Sometimes it helps to understand the mathematical justification to grasp the concept, but you can understand the fundamental idea without the math.
Let’s work on the layer described in part (III) and assume the activation function is . The forward propagation is:
The goal is to derive a relationship between and . We will then understand how we should initialize our weights such that: .
Assume we initialized our network with appropriate values and the input is normalized. Early on in the training, we are in the linear regime of . Values are small enough and thus ,5 meaning that: Moreover, where . For simplicity, let’s assume that (it will end up being true given the choice of initialization we will choose). Thus, looking element-wise at the previous equation now gives:
A common math trick is to extract the summation outside the variance. To do this, we must make the following three assumptions6:
Thus, now we have:
Another common math trick is to convert the variance of a product into a product of variances. Here is the formula for it:
Using this formula with and , we get:
We’re almost done! The first assumption leads to and the second assumption leads to because weights are initialized with zero mean, and inputs are normalized. Thus:
The equality above results from our first assumption stating that:
Similarly the second assumption leads to:
With the same idea:
Wrapping up everything, we have:
Voilà! If we want the variance to stay the same across layers (), we need . This justifies the choice of variance for Xavier initialization.
Notice that in the previous steps we did not choose a specific layer . Thus, we have shown that this expression holds for every layer of our network. Let be the output layer of our network. Using this expression at every layer, we can link the output layer's variance to the input layer's variance:
Depending on how we initialize our weights, the relationship between the variance of our output and input will vary dramatically. Notice the following three cases.
Thus, in order to avoid the vanishing or exploding of the forward propagated signal, we must set by initializing .
Throughout the justification, we worked on activations computed during the forward propagation. The same result can be derived for the backpropagated gradients. Doing so, you will see that in order to avoid the vanishing or exploding gradient problem, we must set by initializing .
In practice, Machine Learning Engineers using Xavier initialization would either initialize the weights as or as . The variance term of the latter distribution is the harmonic mean of and .
This is a theoretical justification for Xavier initialization. Xavier initialization works with tanh activations. Myriad other initialization methods exist. If you are using ReLU, for example, a common initialization is He initialization (He et al., Delving Deep into Rectifiers), in which the weights are initialized by multiplying by 2 the variance of the Xavier initialization. While the justification for this initialization is slightly more complicated, it follows the same thought process as the one for tanh.
Learn more about how to effectively initialize parameters in
Course 2 of the Deep Learning Specialization
↑ Back to top