Batch Normalization | Summary

References

  • Sergey Ioffe, Christian Szegedy (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.?ICML 2015. [ICML][arXiv]
  • Lecture 6: Training Neural Networks, Part 1. CS231n:Convolutional Neural Networks for Visual Recognition. 48:52~1:04:39 [YouTube]
  • Choung young jae (2017. 7. 2.). PR-021: Batch Normalization. Youtube. [YouTube]
  • tf.nn.batch_normalization. Tensorflow. [LINK]
  • Rui Shu (27 DEC 2016). A GENTLE GUIDE TO USING BATCH NORMALIZATION IN TENSORFLOW. [LINK]
  • tf.contrib.layers.batch_norm. Tensorflow. [LINK]
  • Deeplearning.ai (2017. 8. 25.). Normalizing Activations in a Network (C2W3L04). [YouTube]
  • Deeplearning.ai (2017. 8. 25.). Fitting Batch Norm Into Neural Networks (C2W3L05). Youtube. [YouTube]
  • Deeplearning.ai (2017. 8. 25.). Why Does Batch Norm Work? (C2W3L06). [YouTube]
  • Deeplearning.ai (2017. 8. 25.). Batch Norm At Test Time (C2W3L07). [YouTube]

Keywords

internal covariate shift, data normalization, scaling, shifting


Summary

How to solve internal covariate shift?
  • Careful initialization: difficult
  • Small learning rate: slow
  • Batch normalization
What problem does batch normalization solve?
  • Inter covariabte shift
Where to apply?batch normalization?
  • The previous step of activation operations
  • Apply to each node of the previous
The configurations of batch normalization
  • Scale parameter
  • Shift parameter
Advantages of batch normalization (BN)
  • Faster learning. Normalizing activation inputs to speed up learning.
    • BN?helps a model fit?faster to data.
  • BN makes the optimizer of a model able to have higher learning rate.
  • BN makes a model less dependent on weight initialization.
  • BN plays regularization role. So, dropout is not necessary.
Batch normalization on forward propagation
  • $n^{[l]}$: #units?at layer $l$
  • $m$: #examples in a mini-batch. mini-batch size.
  • $i=1,2,…,n^{[l]}$
  • $j=1,2,…,m$
  • $w_i^{[l]}\in \mathbb{R}^{n^{[l-1]}\times 1}$
  • $b_i^{[l]}, \beta_i^{[l]}, \gamma_i^{[l]} \in \mathbb{R}^{n^{[l]}\times 1}$

(Step 0: $z_{i,j}^{[l]}\leftarrow w_i^{[l]}x^{[l-1]}+ \cancel{b_i^{[l]}} $)

Step 1: $\mu_{B,i}^{[l]}\leftarrow \frac{1}{m}\sum^{m}_{j=1}z_{i,j}^{[l]}$

Step 2: $(\sigma_{B,i}^{[l]})^2\leftarrow \frac{1}{m}\sum_{j=1}^{m}(z_{i,j}^{[l]}-\mu_{B,i}^{[l]})^2$

Step 3: $\hat{z}_{i,j}^{[l]} \leftarrow \frac{z_{i,j}^{[l]}-\mu_{B,i}^{[l]}}{\sqrt{(\sigma_{B,i}^{[l]})^2}+\epsilon}$

Step 4: $y_i^{[l]}\leftarrow\gamma_i^{[l]}\hat{z}_{i,j}^{[l]}+\beta_i^{[l]}=\textup{BN}_{\gamma_i^{[l]},\beta_i^{[l]}}(z_{i,j}^{[l]})$

(Step 5: $a_i^{[l]} \leftarrow \textup{activation}(y_i^{[l]})$)

$b_i^{[l]}$ does not affect anything in terms of computation.
  • That is because?$b_i^{[l]}$ is subtracted out while being subtracted by $\mu_{B}^{[l]}$.

(Step 0: $z_{i,j}^{[l]}\leftarrow w_i^{[l]}x^{[l-1]}+b_i^{[l]}$)

Step 1: $\mu_{B,i}^{[l]}\leftarrow \frac{1}{m}\sum^{m}_{j=1}z_{i,j}^{[l]}=\frac{1}{m}\sum^{m}_{j=1}(w_i^{[l]}x^{[l-1]}+b_i^{[l]})=\frac{1}{m}\sum^{m}_{j=1}(w_i^{[l]}x^{[l-1]})+b_i^{[l]}$

Step 3:
$\hat{z}_{i,j}^{[l]}
\leftarrow
\frac{z_{i,j}^{[l]}-\mu_{B,i}^{[l]}}{\sqrt{(\sigma_{B,i}^{[l]})^2}+\epsilon}
$
$
=
\frac{(w_i^{[l]}x^{[l-1]}+b_i^{[l]})-(\frac{1}{m}\sum^{m}_{j=1}(w_i^{[l]}x^{[l-1]})+b_i^{[l]})}{\sqrt{(\sigma_{B,i}^{[l]})^2}+\epsilon}
$
$
=
\frac{w_i^{[l]}x^{[l-1]}-\frac{1}{m}\sum^{m}_{j=1}(w_i^{[l]}x^{[l-1]})}{\sqrt{(\sigma_{B,i}^{[l]})^2}+\epsilon}
$

  • $\beta_i^{[l]}$?takes the role of the bias $b_i^{[l]}$.
Some models does not need batch normalization
  • ELU (Exponential Linear Unit)
  • SELU (Self-normalizing Exponential Linear Unit),
Learning parameters with batch normalization
  • $w_i^{[l]}, \cancel{b_i^{[l]}}, \beta_i^{[l]}, \gamma_i^{[l]}$
    • $b_i^{[l]}$ does not affect anything in terms of computation.
  • $\beta_i^{[l]}, \gamma_i^{[l]}$ are additional parameters to learn.
Gradient descent with batch normalization

for $t=1,2,…,(\textup{mini-batch size})$

Compute forward propagation on the mini-batch $X_t\in \mathbb{R}^{n^{[l-1]} \times m}$.

Compute $\frac{\partial J}{\partial w_i^{[l]}}$,?$\frac{\partial J}{\partial \beta_i^{[l]}}$,?$\frac{\partial J}{\partial \gamma_i^{[l]}}$ using back propagation.? ($J$: loss function)

Update parameters.

$w_i^{[l]}:=w_i^{[l]}-\alpha \frac{\partial J}{\partial w_i^{[l]}}$

$\beta_i^{[l]}:=\beta_i^{[l]}-\alpha?\frac{\partial J}{\beta_i^{[l]}}$

$\gamma_i^{[l]}:=\gamma_i^{[l]}-\alpha?\frac{\partial J}{\partial \gamma_i^{[l]}}$

Leave a Reply

Your email address will not be published. Required fields are marked *