- Sergey Ioffe, Christian Szegedy (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML 2015. [ICML][arXiv]
- Lecture 6: Training Neural Networks, Part 1. CS231n:Convolutional Neural Networks for Visual Recognition. 48:52~1:04:39 [YouTube]
- Choung young jae (2017. 7. 2.). PR-021: Batch Normalization. Youtube. [YouTube]
- tf.nn.batch_normalization. Tensorflow. [LINK]
- Rui Shu (27 DEC 2016). A GENTLE GUIDE TO USING BATCH NORMALIZATION IN TENSORFLOW. [LINK]
- tf.contrib.layers.batch_norm. Tensorflow. [LINK]
- Deeplearning.ai (2017. 8. 25.). Normalizing Activations in a Network (C2W3L04). [YouTube]
- Deeplearning.ai (2017. 8. 25.). Fitting Batch Norm Into Neural Networks (C2W3L05). Youtube. [YouTube]
- Deeplearning.ai (2017. 8. 25.). Why Does Batch Norm Work? (C2W3L06). [YouTube]
- Deeplearning.ai (2017. 8. 25.). Batch Norm At Test Time (C2W3L07). [YouTube]
internal covariate shift, data normalization, scaling, shifting
How to solve internal covariate shift?
- Careful initialization: difficult
- Small learning rate: slow
- Batch normalization
What problem does batch normalization solve?
- Inter covariabte shift
Where to apply batch normalization?
- The previous step of activation operations
- Apply to each node of the previous
The configurations of batch normalization
- Scale parameter
- Shift parameter
Advantages of batch normalization (BN)
- Faster learning. Normalizing activation inputs to speed up learning.
- BN helps a model fit faster to data.
- BN makes the optimizer of a model able to have higher learning rate.
- BN makes a model less dependent on weight initialization.
- BN plays regularization role. So, dropout is not necessary.
Batch normalization on forward propagation
- : #units at layer
- : #examples in a mini-batch. mini-batch size.
(Step 0: )
(Step 5: )
does not affect anything in terms of computation.
- That is because is subtracted out while being subtracted by .
(Step 0: )
- takes the role of the bias .
Some models does not need batch normalization
- ELU (Exponential Linear Unit)
- SELU (Self-normalizing Exponential Linear Unit),
Learning parameters with batch normalization
- does not affect anything in terms of computation.
- are additional parameters to learn.
Gradient descent with batch normalization
Compute forward propagation on the mini-batch .
Compute , , using back propagation. (: loss function)