References
 Sergey Ioffe, Christian Szegedy (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML 2015. [ICML][arXiv]
 Lecture 6: Training Neural Networks, Part 1. CS231n:Convolutional Neural Networks for Visual Recognition. 48:52~1:04:39 [YouTube]
 Choung young jae (2017. 7. 2.). PR021: Batch Normalization. Youtube. [YouTube]
 tf.nn.batch_normalization. Tensorflow. [LINK]
 Rui Shu (27 DEC 2016). A GENTLE GUIDE TO USING BATCH NORMALIZATION IN TENSORFLOW. [LINK]
 tf.contrib.layers.batch_norm. Tensorflow. [LINK]
 Deeplearning.ai (2017. 8. 25.). Normalizing Activations in a Network (C2W3L04). [YouTube]
 Deeplearning.ai (2017. 8. 25.). Fitting Batch Norm Into Neural Networks (C2W3L05). Youtube. [YouTube]
 Deeplearning.ai (2017. 8. 25.). Why Does Batch Norm Work? (C2W3L06). [YouTube]
 Deeplearning.ai (2017. 8. 25.). Batch Norm At Test Time (C2W3L07). [YouTube]
Keywords
internal covariate shift, data normalization, scaling, shifting
Summary
How to solve internal covariate shift?
 Careful initialization: difficult
 Small learning rate: slow
 Batch normalization
What problem does batch normalization solve?
 Inter covariabte shift
Where to apply batch normalization?
 The previous step of activation operations
 Apply to each node of the previous
The configurations of batch normalization
 Scale parameter
 Shift parameter
Advantages of batch normalization (BN)
 Faster learning. Normalizing activation inputs to speed up learning.
 BN helps a model fit faster to data.
 BN makes the optimizer of a model able to have higher learning rate.
 BN makes a model less dependent on weight initialization.
 BN plays regularization role. So, dropout is not necessary.
Batch normalization on forward propagation
 : #units at layer
 : #examples in a minibatch. minibatch size.
(Step 0: )
Step 1:
Step 2:
Step 3:
Step 4:
(Step 5: )
does not affect anything in terms of computation.
 That is because is subtracted out while being subtracted by .
(Step 0: )
Step 1:
Step 3:
 takes the role of the bias .
Some models does not need batch normalization
 ELU (Exponential Linear Unit)
 SELU (Selfnormalizing Exponential Linear Unit),
Learning parameters with batch normalization

 does not affect anything in terms of computation.
 are additional parameters to learn.
Gradient descent with batch normalization
for
Compute forward propagation on the minibatch .
Compute , , using back propagation. (: loss function)
Update parameters.