Batch Normalization | Summary | Study of Everything

References

Sergey Ioffe, Christian Szegedy (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.?ICML 2015. [ICML][arXiv]
Lecture 6: Training Neural Networks, Part 1. CS231n:Convolutional Neural Networks for Visual Recognition. 48:52~1:04:39 [YouTube]
Choung young jae (2017. 7. 2.). PR-021: Batch Normalization. Youtube. [YouTube]
tf.nn.batch_normalization. Tensorflow. [LINK]
Rui Shu (27 DEC 2016). A GENTLE GUIDE TO USING BATCH NORMALIZATION IN TENSORFLOW. [LINK]
tf.contrib.layers.batch_norm. Tensorflow. [LINK]
Deeplearning.ai (2017. 8. 25.). Normalizing Activations in a Network (C2W3L04). [YouTube]
Deeplearning.ai (2017. 8. 25.). Fitting Batch Norm Into Neural Networks (C2W3L05). Youtube. [YouTube]
Deeplearning.ai (2017. 8. 25.). Why Does Batch Norm Work? (C2W3L06). [YouTube]
Deeplearning.ai (2017. 8. 25.). Batch Norm At Test Time (C2W3L07). [YouTube]

Keywords

internal covariate shift, data normalization, scaling, shifting

Summary

How to solve internal covariate shift?

Careful initialization: difficult
Small learning rate: slow
Batch normalization

What problem does batch normalization solve?

Inter covariabte shift

Where to apply?batch normalization?

The previous step of activation operations
Apply to each node of the previous

The configurations of batch normalization

Scale parameter
Shift parameter

Advantages of batch normalization (BN)

Faster learning. Normalizing activation inputs to speed up learning.
- BN?helps a model fit?faster to data.
BN makes the optimizer of a model able to have higher learning rate.
BN makes a model less dependent on weight initialization.
BN plays regularization role. So, dropout is not necessary.

Batch normalization on forward propagation

$n^{[l]}$: #units?at layer $l$
$m$: #examples in a mini-batch. mini-batch size.
$i=1,2,…,n^{[l]}$
$j=1,2,…,m$
$w_i^{[l]}\in \mathbb{R}^{n^{[l-1]}\times 1}$
$b_i^{[l]}, \beta_i^{[l]}, \gamma_i^{[l]} \in \mathbb{R}^{n^{[l]}\times 1}$

(Step 0: $z_{i,j}^{[l]}\leftarrow w_i^{[l]}x^{[l-1]}+ \cancel{b_i^{[l]}} $)

Step 1: $\mu_{B,i}^{[l]}\leftarrow \frac{1}{m}\sum^{m}_{j=1}z_{i,j}^{[l]}$

Step 2: $(\sigma_{B,i}^{[l]})^2\leftarrow \frac{1}{m}\sum_{j=1}^{m}(z_{i,j}^{[l]}-\mu_{B,i}^{[l]})^2$

Step 3: $\hat{z}_{i,j}^{[l]} \leftarrow \frac{z_{i,j}^{[l]}-\mu_{B,i}^{[l]}}{\sqrt{(\sigma_{B,i}^{[l]})^2}+\epsilon}$

Step 4: $y_i^{[l]}\leftarrow\gamma_i^{[l]}\hat{z}_{i,j}^{[l]}+\beta_i^{[l]}=\textup{BN}_{\gamma_i^{[l]},\beta_i^{[l]}}(z_{i,j}^{[l]})$

(Step 5: $a_i^{[l]} \leftarrow \textup{activation}(y_i^{[l]})$)

$b_i^{[l]}$ does not affect anything in terms of computation.

That is because?$b_i^{[l]}$ is subtracted out while being subtracted by $\mu_{B}^{[l]}$.

(Step 0: $z_{i,j}^{[l]}\leftarrow w_i^{[l]}x^{[l-1]}+b_i^{[l]}$)

Step 1: $\mu_{B,i}^{[l]}\leftarrow \frac{1}{m}\sum^{m}_{j=1}z_{i,j}^{[l]}=\frac{1}{m}\sum^{m}_{j=1}(w_i^{[l]}x^{[l-1]}+b_i^{[l]})=\frac{1}{m}\sum^{m}_{j=1}(w_i^{[l]}x^{[l-1]})+b_i^{[l]}$

Step 3:
$\hat{z}_{i,j}^{[l]}
\leftarrow
\frac{z_{i,j}^{[l]}-\mu_{B,i}^{[l]}}{\sqrt{(\sigma_{B,i}^{[l]})^2}+\epsilon}
$
$
=
\frac{(w_i^{[l]}x^{[l-1]}+b_i^{[l]})-(\frac{1}{m}\sum^{m}_{j=1}(w_i^{[l]}x^{[l-1]})+b_i^{[l]})}{\sqrt{(\sigma_{B,i}^{[l]})^2}+\epsilon}
$
$
=
\frac{w_i^{[l]}x^{[l-1]}-\frac{1}{m}\sum^{m}_{j=1}(w_i^{[l]}x^{[l-1]})}{\sqrt{(\sigma_{B,i}^{[l]})^2}+\epsilon}
$

$\beta_i^{[l]}$?takes the role of the bias $b_i^{[l]}$.

Some models does not need batch normalization

ELU (Exponential Linear Unit)
SELU (Self-normalizing Exponential Linear Unit),

Learning parameters with batch normalization

$w_i^{[l]}, \cancel{b_i^{[l]}}, \beta_i^{[l]}, \gamma_i^{[l]}$
- $b_i^{[l]}$ does not affect anything in terms of computation.
$\beta_i^{[l]}, \gamma_i^{[l]}$ are additional parameters to learn.

Gradient descent with batch normalization

for $t=1,2,…,(\textup{mini-batch size})$

Compute forward propagation on the mini-batch $X_t\in \mathbb{R}^{n^{[l-1]} \times m}$.

Compute $\frac{\partial J}{\partial w_i^{[l]}}$,?$\frac{\partial J}{\partial \beta_i^{[l]}}$,?$\frac{\partial J}{\partial \gamma_i^{[l]}}$ using back propagation.? ($J$: loss function)

Update parameters.

$w_i^{[l]}:=w_i^{[l]}-\alpha \frac{\partial J}{\partial w_i^{[l]}}$

$\beta_i^{[l]}:=\beta_i^{[l]}-\alpha?\frac{\partial J}{\beta_i^{[l]}}$

$\gamma_i^{[l]}:=\gamma_i^{[l]}-\alpha?\frac{\partial J}{\partial \gamma_i^{[l]}}$

Study of Everything

Learning Based Life

Batch Normalization | Summary

References

Keywords

Summary

How to solve internal covariate shift?

What problem does batch normalization solve?

Where to apply?batch normalization?

The configurations of batch normalization

Advantages of batch normalization (BN)

Batch normalization on forward propagation

$b_i^{[l]}$ does not affect anything in terms of computation.

Some models does not need batch normalization

Learning parameters with batch normalization

Gradient descent with batch normalization

Leave a Reply Cancel reply

References

Keywords

Summary

How to solve internal covariate shift?

What problem does batch normalization solve?

Where to apply?batch normalization?

The configurations of batch normalization

Advantages of batch normalization (BN)

Batch normalization on forward propagation

$b_i^{[l]}$ does not affect anything in terms of computation.

Some models does not need batch normalization

Learning parameters with batch normalization

Gradient descent with batch normalization

Related Posts

Leave a Reply Cancel reply