Batch Normalization | Summary

References

  • Sergey Ioffe, Christian Szegedy (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML 2015. [ICML][arXiv]
  • Lecture 6: Training Neural Networks, Part 1. CS231n:Convolutional Neural Networks for Visual Recognition. 48:52~1:04:39 [YouTube]
  • Choung young jae (2017. 7. 2.). PR-021: Batch Normalization. Youtube. [YouTube]
  • tf.nn.batch_normalization. Tensorflow. [LINK]
  • Rui Shu (27 DEC 2016). A GENTLE GUIDE TO USING BATCH NORMALIZATION IN TENSORFLOW. [LINK]
  • tf.contrib.layers.batch_norm. Tensorflow. [LINK]
  • Deeplearning.ai (2017. 8. 25.). Normalizing Activations in a Network (C2W3L04). [YouTube]
  • Deeplearning.ai (2017. 8. 25.). Fitting Batch Norm Into Neural Networks (C2W3L05). Youtube. [YouTube]
  • Deeplearning.ai (2017. 8. 25.). Why Does Batch Norm Work? (C2W3L06). [YouTube]
  • Deeplearning.ai (2017. 8. 25.). Batch Norm At Test Time (C2W3L07). [YouTube]

Keywords

internal covariate shift, data normalization, scaling, shifting


Summary

How to solve internal covariate shift?
  • Careful initialization: difficult
  • Small learning rate: slow
  • Batch normalization
What problem does batch normalization solve?
  • Inter covariabte shift
Where to apply batch normalization?
  • The previous step of activation operations
  • Apply to each node of the previous
The configurations of batch normalization
  • Scale parameter
  • Shift parameter
Advantages of batch normalization (BN)
  • Faster learning. Normalizing activation inputs to speed up learning.
    • BN helps a model fit faster to data.
  • BN makes the optimizer of a model able to have higher learning rate.
  • BN makes a model less dependent on weight initialization.
  • BN plays regularization role. So, dropout is not necessary.
Batch normalization on forward propagation
  • n^{[l]}: #units at layer l
  • m: #examples in a mini-batch. mini-batch size.
  • i=1,2,...,n^{[l]}
  • j=1,2,...,m
  • w_i^{[l]}\in \mathbb{R}^{n^{[l-1]}\times 1}
  • b_i^{[l]}, \beta_i^{[l]}, \gamma_i^{[l]} \in \mathbb{R}^{n^{[l]}\times 1}

(Step 0: z_{i,j}^{[l]}\leftarrow w_i^{[l]}x^{[l-1]}+ \cancel{b_i^{[l]}})

Step 1: \mu_{B,i}^{[l]}\leftarrow \frac{1}{m}\sum^{m}_{j=1}z_{i,j}^{[l]}

Step 2: (\sigma_{B,i}^{[l]})^2\leftarrow \frac{1}{m}\sum_{j=1}^{m}(z_{i,j}^{[l]}-\mu_{B,i}^{[l]})^2

Step 3: \hat{z}_{i,j}^{[l]} \leftarrow \frac{z_{i,j}^{[l]}-\mu_{B,i}^{[l]}}{\sqrt{(\sigma_{B,i}^{[l]})^2}+\epsilon}

Step 4: y_i^{[l]}\leftarrow\gamma_i^{[l]}\hat{z}_{i,j}^{[l]}+\beta_i^{[l]}=\textup{BN}_{\gamma_i^{[l]},\beta_i^{[l]}}(z_{i,j}^{[l]})

(Step 5: a_i^{[l]} \leftarrow \textup{activation}(y_i^{[l]}))

b_i^{[l]} does not affect anything in terms of computation.
  • That is because b_i^{[l]} is subtracted out while being subtracted by \mu_{B}^{[l]}.

(Step 0: z_{i,j}^{[l]}\leftarrow w_i^{[l]}x^{[l-1]}+b_i^{[l]})

Step 1: \mu_{B,i}^{[l]}\leftarrow \frac{1}{m}\sum^{m}_{j=1}z_{i,j}^{[l]}=\frac{1}{m}\sum^{m}_{j=1}(w_i^{[l]}x^{[l-1]}+b_i^{[l]})=\frac{1}{m}\sum^{m}_{j=1}(w_i^{[l]}x^{[l-1]})+b_i^{[l]}

Step 3:
\hat{z}_{i,j}^{[l]} \leftarrow \frac{z_{i,j}^{[l]}-\mu_{B,i}^{[l]}}{\sqrt{(\sigma_{B,i}^{[l]})^2}+\epsilon}
= \frac{(w_i^{[l]}x^{[l-1]}+b_i^{[l]})-(\frac{1}{m}\sum^{m}_{j=1}(w_i^{[l]}x^{[l-1]})+b_i^{[l]})}{\sqrt{(\sigma_{B,i}^{[l]})^2}+\epsilon}
= \frac{w_i^{[l]}x^{[l-1]}-\frac{1}{m}\sum^{m}_{j=1}(w_i^{[l]}x^{[l-1]})}{\sqrt{(\sigma_{B,i}^{[l]})^2}+\epsilon}

  • \beta_i^{[l]} takes the role of the bias b_i^{[l]}.
Some models does not need batch normalization
  • ELU (Exponential Linear Unit)
  • SELU (Self-normalizing Exponential Linear Unit),
Learning parameters with batch normalization
  • w_i^{[l]}, \cancel{b_i^{[l]}}, \beta_i^{[l]}, \gamma_i^{[l]}
    • b_i^{[l]} does not affect anything in terms of computation.
  • \beta_i^{[l]}, \gamma_i^{[l]} are additional parameters to learn.
Gradient descent with batch normalization

for t=1,2,...,(\textup{mini-batch size})

Compute forward propagation on the mini-batch X_t\in \mathbb{R}^{n^{[l-1]} \times m}.

Compute \frac{\partial J}{\partial w_i^{[l]}}\frac{\partial J}{\partial \beta_i^{[l]}}\frac{\partial J}{\partial \gamma_i^{[l]}} using back propagation.  (J: loss function)

Update parameters.

w_i^{[l]}:=w_i^{[l]}-\alpha \frac{\partial J}{\partial w_i^{[l]}}

\beta_i^{[l]}:=\beta_i^{[l]}-\alpha \frac{\partial J}{\beta_i^{[l]}}

\gamma_i^{[l]}:=\gamma_i^{[l]}-\alpha \frac{\partial J}{\partial \gamma_i^{[l]}}

Leave a Reply

Your email address will not be published. Required fields are marked *