Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization | Deep Learning Specialization | Coursera

Brief information
  • Course name: Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
  • Instructor: Andrew Ng
  • Institution:
  • Media: Coursera
  • Specialization: Deep Learning
  • Duration: 3 weeks

About this Course

This course will teach you the “magic” of getting deep learning to work well. Rather than the deep learning process being a black box, you will understand what drives performance, and be able to more systematically get good results. You will also learn TensorFlow.

After 3 weeks, you will:

  • Understand industry best-practices for building deep learning applications.
  • Be able to effectively use the common neural network “tricks”, including initialization, L2 and dropout regularization, Batch normalization, gradient checking,
  • Be able to implement and apply a variety of optimization algorithms, such as mini-batch gradient descent, Momentum, RMSprop and Adam, and check for their convergence.
  • Understand new best-practices for the deep learning era of how to set up train/dev/test sets and analyze bias/variance
  • Be able to implement a neural network in TensorFlow.

This is the second course of the Deep Learning Specialization.


Week 1: Practical aspects of Deep Learning
  1. C2W1L01 Video: Train / Dev / Test sets
  2. C2W1L02 Video: Bias / Variance
  3. C2W1L03 Video: Basic Recipe for Machine Learning
  4. C2W1L04 Video: Regularization
  5. C2W1L05 Video: Why regularization reduces overfitting?
  6. C2W1L06 Video: Dropout Regularization
  7. C2W1L07 Video: Understanding Dropout
  8. C2W1L08 Video: Other regularization methods
  9. C2W1L09 Video: Normalizing inputs
  10. C2W1L10 Video: Vanishing / Exploding gradients
  11. C2W1L11 Video: Weight Initialization for Deep Networks
  12. C2W1L12 Video: Numerical approximation of gradients
  13. C2W1L13 Video: Gradient checking
  14. C2W1L14 Video: Gradient Checking Implementation Notes
  15. C2W1L15 Video: Yoshua Bengio interview
  • C2W1Q1 Graded: Practical aspects of deep learning
Programming assignments
  • C2W1P1 Notebook: Initialization
  • C2W1P1 Graded: Initialization
  • C2W1P2 Notebook: Regularization
  • C2W1P2 Graded: Regularization
  • C2W1P3 Notebook: Gradient Checking
  • C2W1P3 Graded: Gradient Checking
Week 2: Optimization algorithms
  1. C2W2L01 Video: Mini-batch gradient descent
  2. C2W2L02 Video: Understanding mini-batch gradient descent
  3. C2W2L03 Video: Exponentially weighted averages
  4. C2W2L04 Video: Understanding exponentially weighted averages
  5. C2W2L05 Video: Bias correction in exponentially weighted averages
  6. C2W2L06 Video: Gradient descent with momentum
  7. C2W2L07 Video: RMSprop
  8. C2W2L08 Video: Adam optimization algorithm
  9. C2W2L09 Video: Learning rate decay
  10. C2W2L10 Video: The problem of local optima
  11. C2W2L11 Video: Yuanqing Lin interview
  • C2W2Q1 Graded: Optimization algorithms
Programming assignment
  • C2W2P1 Notebook: Optimization
  • C2W2P1 Graded: Optimization
Week 3: Hyperparameter tuning, Batch Normalization and Programming Frameworks
  1. Video: Tuning process
  2. Video: Using an appropriate scale to pick hyperparameters
  3. Video: Hyperparameters tuning in practice: Pandas vs. Caviar
  4. Video: Normalizing activations in a network
  5. Video: Fitting Batch Norm into a neural network
  6. Video: Why does Batch Norm work?
  7. Video: Batch Norm at test time
  8. Video: Softmax Regression
  9. Video: Training a softmax classifier
  10. Video: Deep learning frameworks
  11. Video: TensorFlow
  • Graded: Hyperparameter tuning, Batch Normalization, Programming Frameworks
Programming assignments
  • Notebook: Tensorflow
  • Graded: Tensorflow

C2W1 Practical aspects of Deep Learning

Learning objectives
  • Recall that different types of initializations lead to different results.
  • Recognize the importance of initialization in complex neural networks.
  • Recognize the difference between train/dev/test sets.
  • Diagnose the bias and variance issues in your model.
  • Learn when and how to use regularization methods such as dropout or L2 regularization.
  • Understand experimental issues in deep learning such as Vanishing or Exploding gradients and learn how to deal with them.
  • Use gradient checking to verify the correctness of your backpropagation implementation.

C2W1L01 Train / Dev / Test sets

Applied machine learning is a highly iterative process.

  • What I’ve seen is that intuitions from one domain or from one application area often do not transfer to other application areas.
  • So one of the things that determine how quickly you can make progress is how efficiently you can go around this cycle.
Train/dev/test sets
  • Traditionally: Size(TotalSet) = about 100~10,000
    • (train) : dev : test = 70 : 30 : 0 or 60 : 20 : 20
  • Big data: Size(TotalSet) ge 1,000,000
    • (train) : dev : test = 98 : 1 : 1 or 99 : 0.5 : 0.5 or 99.5 : 0.25 : 0.25
  • It might be okay not to have a test set.
    • In this case, the dev set is commonly called ‘test set’.
    • This term is not quite right because we use the dev set to make our models fit to the dev set. The test set should be unseen to the developing models.
Mismatch train and dev distributions
  • Training set: High resolution images from web pages
  • Dev set: Low resolution images from mobile phones
  • Then, the distributions of the training and dev sets are different.

C2W1L02 Bias / Variance

  • Bayes optimal error: the theoretical minimum error any model can attain
  • Bias = (training error)
    • Bias from the true ideal model
  • Variance = (dev error) – (training error)
    • Variance of the models we designed
High/low bias/variance means?
  • High bias
  • Low bias
  • High variance
  • Low variance

C2W1L03 Basic Recipe for Machine Learning

The diagram of the basic recipe for machine learning

(Diagram in this lecture)

Bias-variance tradeoff in the pre-edeep learning era and the deep learning era
  • In the pre-deep learning era, we didn’t have many tools that just reduce bias or that just reduce variance without hurting the other one.
  • Bias-variance tradeoff says that
    • bias increases if and only if variance decreases, and
    • bias decreases if and only if variance increases.
  • In the modern deep learning, big data era, we now have tools to drive down bias and just drive down bias, without really hurting the other thing that much. This has been one of the big reasons that deep learning has been useful for supervised learning.
    • Getting a bigger network almost always just reduces your bias without necessarily hurting your variance.
    • Getting more data pretty much always reduces your variance and doesn’t hurt your bias much.

C2W1L04 Regularization

L2 regularization
Loss with L2 regularization

    \[ J(\textbf{w},b)=\frac{1}{m}\sum^{m}_{i=1} L\left ( \hat{y}^{(i)},{y}^{(i)} \right ) + \frac{\lambda}{2m} \left \| \textbf{w} \right \|_2^2 \]

  • While L2 regularization, the weights of the model are decaying.
  • So, the method of L2 regularization is also called ‘weight decay’.
  • \lambda is called a ‘regularization parameter’.
Gradient descent update rule with L2 regularization

    \[ w:=\left (1-\alpha \frac{\lambda}{m} \right )w-\alpha\left [\sum_{i=1}^{m} \frac{\partial L^{(i)}}{\partial w} \right ] \]

  • 0 < \alpha \frac{\lambda}{m} < 1, \alpha >0, 0<\lambda<1 but mostly close to 0.
  • Mostly, \alpha \frac{\lambda}{m} is close to 0 rather than 1.
L1 regularization
Loss with L1 regularization

    \[ J(\textbf{w},b)=\frac{1}{m}\sum^{m}_{i=1} L\left ( \hat{y}^{(i)},{y}^{(i)} \right ) + \frac{\lambda}{2m} \left \| \textbf{w} \right \|_1^2 \]

  • After L1 regularization, the model ends up with sparse weights, which results in compressing the model.
Gradient descent update rule with L1 regularization

    \[ w:=w-\alpha\left [\sum_{i=1}^{m} \frac{\partial L^{(i)}}{\partial w} \right ] -\alpha \frac{\lambda}{m} \textup{sign}(w) \]

C2W1L05 Why regularization reduces overfitting?

C2W1L06 Dropout regularization

  • Inverted dropout ensures that the expected value of a3 remains the same
Dropout implementation: Inverted dropout

  • Since test-time performance is so critical, it is always preferable to use inverted dropout, which performs the scaling at train time, leaving the forward pass at test time untouched.
  • “we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix”.

from CS231 note

C2W1L07 Understanding dropout

Intuition of why dropout works

Dropout makes a neural network not rely on any one feature.

Dropout prevents a neural network from setting a high weight to any one feature. Thus, dropout makes the weights shrink, which finally results in regularization. Dropout can be thought of as an adaptive form of L2 regularization.

Different keep probability for each layer
  • A layer that has a large weight matrix tends to overfit. For example, set keep_prob about 0.5.
  • A layer that has a small weight matrix tends not to overfit. For example, set keep_prob about 0.7.
  • To the input layer, we do not apply dropout, so that we set keep_prob 1.0. Sometime we set keep_prob 0.9 close to 1.0.
Benefit of dropout
  • There are few reasons not to apply dropout.
Downside of dropout
  • We cannot clearly define the objective function for the training model.
Observe whether the loss decreases
  • While training, to observe the loss, we calculate the loss with keep_prob 1.0. This loss is the thing to observe.

C2W1L08 Other regularization methods

Method 1: Data augmentation
  • Data augmentation ⊂ getting more data
Data augmentation examples
  • Right-left filpping
  • Ratation
  • Zoom in/out
  • Distortion that can happen in the real application


Method 2: Early stopping
Downside of early stopping
  • We cannot follow orthogonalization of reducing bias and variance. The orthogonalization consists of two steps. (Orthogonalization: To think one task at a time)
    • First, reduce bias to Bayes error or human-level performance.
    • Second, reduce variance to perform well on the development set.
Benefit of early stopping
  • It might take short time to train.
  • We do not take long time to find good hyperparameters for regularization.

C2W1L09 Normalizing inputs

Normalizing training sets
  • Normalize the entire data together, not normalize each set: training, development, and test sets.

Why normalize inputs?
  • Normalizing inputs is one method to speed up optimization.
  • If the scale, i.e., variance, of the features are so different, normalizing inputs are important.

  • Normalizing inputs never hurts your optimization. Therefore, normalize inputs in all the cases.

C2W1L10 Vanishing / exploding gradients

When the problems of vanishing gradients or exploding gradients happen?
  • The problems happen when we train a deep neural network.
Why do the problems occur?
  • The gradients of lower layers, which are close to the inputs, are computed from multiplying the gradients of the higher layers. In this setting,
    • if most gradients are between 0 and 1, the gradients of lower layers will vanish (vanishing gradients), and
    • if most gradients are greater than 1, the gradients of lower layers will explode (exploding gradients).
  • There exists a neural network with about 152 layers.
Why are the problems problems?
  • Vanishing gradients slow down optimization.
  • Exploding gradients lead to failing to find a local optimum.

C2W1L11 Weight Initialization for Deep Networks

Weight Initialization is a partial solution to the vanishing/exploding gradients problem.

Initializing the weights
For a tanh activation function,

  • This can be used if E[x_i]=E[w_i]=0 and x_i and w_i
For a LeLU activation function,

For the Xavier initialization,

Initializing the biases
  • It is possible and common to initialize the biases to be zero, since the asymmetry breaking is provided by the small random numbers in the weights.
  • It is common to simply use 0 bias initialization.
Supplementary for understanding the weight initialization methods
  • Weight Initialization – CS231n [LINK]
    • Good material!
  • Understanding Neural Network Weight Initialization – Intoli [LINK]
  • On weight initialization in deep neural networks. S. K. Kumar. arXiv. 2017 [LINK]

C2W1L12 Numerical approximation of gradients

This topic is for gradient checking.

The first-order approximate gradient

    \[ f'(\theta)\approx\frac{f(\theta + \epsilon)-f(\theta)}{\epsilon} \]

  • Error: \left | f'(\theta)-\frac{f(\theta + \epsilon)-f(\theta)}{\epsilon} \right | \in O(\epsilon)
The second-order approximate gradient

    \[ f'(\theta)\approx\frac{f(\theta + \epsilon)-f(\theta - \epsilon)}{2\epsilon} \]

The second-order approximate gradient

  • Error: \left | f'(\theta)-\frac{f(\theta + \epsilon)-f(\theta - \epsilon)}{2\epsilon} \right | \in O(\epsilon^3)
    • The order of the error is \epsilon^3. Thus, the error term diminishes faster as \epsilon close to 0 than the first-order case.
    • Therefore, we use the second-order approximate gradient as numerical approximation of gradients.
Numerical approximation of gradients

    \[ f'(\theta)\approx\frac{f(\theta + \epsilon)-f(\theta - \epsilon)}{2\epsilon} \]

  • My question: How small do we have to set \epsilon?

C2W1L13 Gradient checking

  • Gradient checking (grade check) is used to verify whether your implementation of backpropagation is correct. Andrew Ng said it really helps to find bugs in your implementation of backpropagation.
The metric to measure how the implemented and approximated gradients are different.

    \[ p=\frac{\left \| d\theta - d\theta_{\textup{aprx}} \right \|}{\left \| d\theta \right \|+\left \| d\theta_{\textup{aprx}} \right \|}\in[0,1] \]

derived from

    \[ 0\le{\left \| d\theta - d\theta_{\textup{aprx}} \right \|}\le{\left \| d\theta \right \|+\left \| d\theta_{\textup{aprx}} \right \|} \]

  • p \approx 10^{-7} ⇒ Great! Nothing to do more.
  • p \approx 10^{-5} ⇒ Okay. But check some bugs in the implementation.
  • p \approx 10^{-3} ⇒ Bad. Find bugs in the implementation.

C2W1L14 Gradient checking implementation notes

  • Don’t use gradient checking in training. Use it to debug.
  • If gradients have a problem, then look at the components that construct d\theta in order to try to identify bugs.
  • Don’t forget regularization terms.
  • Don’t use dropout or set keep_prob=1 while gradient checking.
  • (Rarely used) Run at random initialization and do gradient checking.

C2W1Q1 Practical aspects of deep learning

8/10 correct.

C2W1 Programming assignments

Welcome to the first assignment of the hyper parameters tuning specialization. It is very important that you regularize your model properly because it could dramatically improve your results.

By completing this assignment you will:

  • Understand that different regularization methods that could help your model.
  • Implement dropout and see it work on data.
  • Recognize that a model without regularization gives you a better accuracy on the training set but nor necessarily on the test set.
  • Understand that you could use both dropout and regularization on your model.

This assignment prepares you well for the upcoming assignment. Take your time to complete it and make sure you get the expected outputs when working through the different exercises. In some code blocks, you will find a “#GRADED FUNCTION: functionName” comment. Please do not modify it. After you are done, submit your work and check your results. You need to score 80% to pass. Good luck 🙂 !

C2W1P1: Initialization

Initialization 1: Zero initialization
  • The weights W^{[l]} should be initialized randomly to break symmetry.
  • It is however okay to initialize the biases b^{[l]} to zeros. Symmetry is still broken so long as W^{[l]} is initialized randomly.
Initialization 2: Large random weight initialization
  • With large random-valued weights, the last activation (sigmoid) outputs results that are very close to 0 or 1 for some examples. This incurs a very high loss.
  • Initializing weights to very large random values does not work well.
  • Hopefully initializing with small random values does better. The important question is: how small should be these random values be? Lets find out in the next part!
Initialization 3: Small random weight initialization
  • The model with He initialization separates the blue and the red dots very well in a small number of iterations.
Comparing 3 types of initialization
Initialization Train accuracy Problem/Comment
3-layer NN with zeros initialization 50% fails to break symmetry
3-layer NN with large random initialization 83% too large weights
3-layer NN with He initialization 99% recommended method
What you should remember from this notebook:
  • Different ways of initialization lead to different results.
  • Random initialization is used to break symmetry and make sure different hidden units can learn different things.
  • Don’t initialize to values that are too large.
  • He initialization works well for networks with ReLU activations.

30/30. passed.

C2W1P2: Regularization

1 – Non-regularized model
  • \lambda=0
  • keep_prob=1.0
2 – L2 Regularization
What is L2-regularization actually doing?:
  • L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights.
  • Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values.
  • This leads to a smoother model in which the output changes more slowly as the input changes.
3.1 – Forward propagation with dropout

Dropout randomly shuts down some neurons in each iteration

  • The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons.
  • With dropout, your neurons thus become less sensitive to the activation of one other specific neuron.
3.2 – Backward propagation with dropout
What you should remember about dropout:
  • Dropout is a regularization technique.
  • You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.
  • Apply dropout both during forward and backward propagation. You should store the dropped units for backward propagation.
  • [Inverted dropout] During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.
  • Score: 40/40
  • Passed?: Yes

C2W1P3: Gradient checking

1) How does gradient checking work?
  • As described in the previous lectures.
2) 1-dimensional gradient checking
  • \epsilon = 10^{-7}
  • The threshold of difference: 10^{-7}
3) N-dimensional gradient checking
  • \epsilon = 10^{-7}
  • The threshold of difference: 10^{-7}
  • Matrix computation
  • Score: 40/40
  • Passed?: Yes

C2W2 Optimization algorithms

Learning Objectives
  • Remember different optimization methods such as (Stochastic) Gradient Descent, Momentum, RMSProp and Adam.
  • Use random minibatches to accelerate the convergence and improve the optimization
  • Know the benefits of learning rate decay and apply it to your optimization.

C2W2L01 Mini-batch gradient descent

  • x^{(i)}
  • z^{[l]}
  • x^{\{ t \}}
Gradient descent algorithms
  • Batch gradient descent
  • Mini-batch gradient descent
  • Stochastic gradient descent
  • Batch
  • Epoch
Mini-batch gradient descent
  • When we have a large training set, mini-batch gradient descent is faster than batch gradient descent.
  • To fill

C2W2L02 Understanding mini-batch gradient descent

3 types of gradient descent algorithms

Let m be the number of the total training examples.

 Batch  Mini-batch   Stochastic
 Batch size  m  1<(batch size)<m  1
 Vectorizing /
 Distributed computing
 O  X  O
 One iteration time  too long  tolerable  short
 Convergence  O  oscillate around close to the local minimum oscillate around far from the local minimum

Guideline: How to choose mini-batch size
  1. training set is small (m \le 2000) ⇒ batch_size = m
  2. training set is big ⇒ typical batch_size = 64, 128, 256, 512
    • If batch_size is a power of 2, it makes code run fast because of fast memory access.
    • You have to batch_size among those sizes. batch_size is a hyperparameter that effects the speed of training. Try several batch_size and find out which batch_size makes your training faster. Then, use it.
    • Make sure the mini-batch fit in your CPU/GPU memory.

C2W2L03 Exponentially weighted averages

This lecture introduces an optimization algorithm faster than gradient descent.

Exponentially weighted averages: Definition

Let \theta_t be an optimizing parameter at time step t. t=0,1,2,...

Let v_t be the exponentially weighted average at time step tv_t is defined as the following.

    \[ v_t = \beta v_{t-1} + (1-\beta)\theta_t \]

\beta (0<\beta<1) implies how much each previous \theta exponentially decreases. \beta is one of hyperparamters.

    \[ v_t = (1-\beta)\theta_t + \beta (1-\beta) \theta_{t-1} + \beta^2 (1-\beta) \theta_{t-2} + \cdots + \beta^{t - 1} (1-\beta) \theta_{1} + \beta^{t} (1-\beta) \theta_{0} \\ \]

    \[ = (1-\beta) \left [ \theta_t + \beta  \theta_{t-1} + \beta^2 \theta_{t-2} + \cdots + \beta^{t - 1} \theta_{1} + \beta^{t} \theta_{0} \right ] \]

Exponentially weighted averages: Example
  • β = 0.90
    • Moderately adaptive to short-term variation.
  • β = 0.98
    • Adaptive to short-term variation.
    • Too less adaptive.
  • β = 0.50
    • Very adaptive to short-term variation.
    • Too much adaptive.

  • On the graphs, you should think \theta_t is the average temperature at day t.

C2W2L04 Understanding exponentially weighted averages

\frac{1}{1-\beta} is used to estimate how long it takes for \beta^{t}to converge to 1/e.


v_{\theta} is a general form of v_{t}

Repeat {
..Get next \theta_{t}
..v_{\theta} :=  \beta v_{\theta} + (1-\beta) \theta_{t}

C2W2L05 Bias correction in exponentially weighted averages

The problem in the initial states

The purple curve has large \beta, say, 0.98. So, its v_t is very small because 1 - \beta = 0.02 is small.

v_0 = 0

v_1 = (0.98) v_0 + (1 - 0.98) \theta_1 = (0.98) v_0 + (0.02) \theta_1 = (0.02) \theta_1

v_2 = (0.98) v_1 + (1 - 0.98) \theta_2 = (0.98)(0.02)\theta_1 + (0.02) \theta_2 = (0.0196)\theta_1 + (0.02) \theta_2

Bias correction
  • To solve the problem, we multiply \frac{1}{1 - \beta^{t}} on v_t.
  • Bias correction is not prevalent because after a few iterations the problem of initial states will get diminishing.
Exponentially weighted averages with bias correction

Repeat {
..Get next \theta_{t}
..v_{t} :=  \beta v_{t-1} + (1-\beta) \theta_{t}
..v_{t}^{\textup{corrected}} :=  \frac{v_{t}}{1 - \beta^{t}} # bias correction

  • Plot and observe v_{t}^{\textup{corrected}} instead of v_{t}.

C2W2L06 Gradient descent with momentum

There’s an algorithm called momentum, or gradient descent with momentum that almost always works faster than the standard gradient descent algorithm.

Key idea of momentum optimization algorithm

The key idea of momentum optimization algorithm is to replace the current gradient with the exponentially weighted average of the previous gradients.

Momentum algorithm

Repeat {
..Get next dW, db
..v_{dW} := \beta v_{dW} + (1-\beta) dW
..v_{db} := \beta v_{db} + (1-\beta) db

..W := W - \alpha v_{dW}
..b := b - \alpha v_{db}

  • Usually, \beta = 0.9. 0.9 means v_{dW} and v_{db} have approximately the previous 10 gradients.
  • \alpha is the learning rate.
  • Since usually many iterations are required to converge, we do not apply bias correction to the momentum algorithm.
Understanding in the physics point of view
  • dW, db: acceleration
  • v_{dW}, v_{db}: velocity
  • \beta: friction
The momentum algorithm can solve oscillation problems
  • The momentum algorithm solves the oscillation problem since the algorithm averages out oscillating direction and helps move to the non-oscillating direction.
  • As we see below, momentum improves convergence from the blue to the red.

C2W2L07 RMSprop

Key idea of RMSprop

The key idea of RMSprop is to multiply the current gradient by a function of the exponentially weighted  average of the previous squared gradients.

RMSprop algorithm

Repeat {
..Compute dW, db on the current mini-batch.
..S_{dW} := \beta_{2} S_{dW} + (1 - \beta_{2})(dW)^2
..S_{db}  := \beta_{2} S_{db}  + (1 - \beta_{2})(db)^2

..W := W - \alpha \frac{dW}{\sqrt{S_{dW}} + \epsilon} where \epsilon > 0
..b := b - \alpha \frac{db}{\sqrt{S_{db}} + \epsilon} where \epsilon > 0

  • \epsilon is a small positive real number. \epsilon is added to \sqrt{S_{dW}} and \sqrt{S_{db}} so that the latter term will not explode.

C2W2L08 Adam optimization algorithm

Adam = Adaptive momentum estimation

Key idea of Adam optimization algorithm

The key idea of Adam optimization algorithm is to combine momentum and RMSprop together with bias correction.

Adam optimization algorithm

Repeat {
..The current time step is t.
..Compute dW, db on the current mini-batch.

..v_{dW} := \beta_{1} v_{dW} + (1-\beta) dW
..v_{db} := \beta_{1} v_{db} + (1-\beta) db

..S_{dW} := \beta_{2} S_{dW} + (1 - \beta_{2})(dW)^2
..S_{db}  := \beta_{2} S_{db}  + (1 - \beta_{2})(db)^2

..[Bias correction]
..v_{dW}^{\textup{corrected}} := v_{dW} / (1-\beta_{1}^{t})
..v_{db}^{\textup{corrected}} := v_{db} / (1-\beta_{1}^{t})
..S_{dW}^{\textup{corrected}} := S_{dW} / (1-\beta_{2}^{t})
..S_{db}^{\textup{corrected}}  := S_{db}  / (1-\beta_{2}^{t})

..[Parameter update]
..W := W - \alpha \frac{v_{dW}^{\textup{corrected}}}{\sqrt{S_{dW}^{\textup{corrected}}} + \epsilon} where \epsilon > 0
..b := b - \alpha \frac{v_{db}^{\textup{corrected}}}{\sqrt{S_{db}^{\textup{corrected}}} + \epsilon} where \epsilon > 0

Hyperparameter choice
  • \alpha\alpha need to be tuned.
  • \beta_1: 0.9 is the default introduced in the Adam paper.
  • \beta_1: 0.999 is the default introduced in the Adam paper.
  • \epsilon: 10^{-8} is the default introduced in the Adam paper.

The only hyperparameter you should consider is the learning rate \alpha.

C2W2L09 Learning rate decay

  • Small learning rate can allow optimization to approach closer to the local minimum than large learning rate.
  • However, small learning rate can make optimization very slow. Large learning rate can be better than small one if the current point is far from the local minimum.
  • Therefore, making learning rate decay as learning proceeds is the key idea of learning rate decay.
Implementations of learning rate decay
  • \alpha_0: initial learning rate
  • (\textup{epoch num})=0,1,2,...
Implementation 1

    \[ \alpha = \frac{1}{1 + (\textup{decay rate})(\textup{epoch num})} \alpha_0 \]

  • where 0 < (\textup{decay rate})
  • decay rate ↑ ⇔ ‘Learning rate decays faster.’
  • decay rate ↓ ⇔ ‘Learning rate decays slower’
Implementation 2

    \[ \alpha = (\textup{decay rate})^(\textup{epoch num}) \alpha_0 \]

  • where 0 < (\textup{decay rate}) < 1
  • For example, \alpha = (0.95)^(\textup{epoch num})
  • For example, \alpha = (0.5)^(\textup{epoch num})
Implementation 3

    \[ \alpha = \frac{(\textup{decay rate})}{\sqrt{(\textup{epoch num})}} \alpha_0 \]

  • where 0 < (\textup{decay rate})
Implementation 4: Manual decay
  • Observe the trend of loss.
  • Decide whether to decrease learning rate.
Guideline from Andrew Ng
  • “For me, I would say that learning rate decay is usually lower down on the list of things I try.”
    • Andrew considers learning rate decay as a low priority to tune among hyperparameters.

C2W2L10 The problem of local optima

The problem of local optima
  • The problem of local optima is not a real problem in high dimensional optimization, which is common in deep learning. In a high dimensional space, it is rare to have many local optima.
  • In one dimension, let the probability that some point is a local optimum be 0.5. Then, the probability that some point is a local optimum in a d dimensional space becomes (0.5)^d.
  • Let us think d is 50. Then, 0.5^d \approx 8.8817842 \times 10^{-16}.
The problem of plateaus

Plateaus can make learning slow.

Adam and the problem of plateaus

Adam can actually speed up the rate at which you could move down the plateau and then get off the plateau. In short, Adam can solve the problem of plateaus.

Andrew’s thoughts

To be honest, I don’t think anyone has great intuitions about what these high dimensional spaces really look like, and our understanding of them is still evolving.

C2W2L11 Yuanqing Lin interview

C2W2Q1 Optimization algorithms

  • Score: 8/10
  • Passed!

C2W2P1 Optimization


By completing this assignment you will:

  • Understand the intuition between Adam and RMS prop
  • Recognize the importance of mini-batch gradient descent
  • Learn the effects of momentum on the overall performance of your model

This assignment prepares you well for the upcoming assignment. Take your time to complete it and make sure you get the expected outputs when working through the different exercises. In some code blocks, you will find a “#GRADED FUNCTION: functionName” comment. Please do not modify it. After you are done, submit your work and check your results. You need to score 80% to pass. Good luck 🙂 !

1 – Gradient Descent

Nothing special. L = 123

The gradient descent rule is, for l = 1, ..., L:

(1)   \[ W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]} \]

(2)   \[ b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]} \]

2 – Mini-Batch Gradient descent
How to build mini-batches from the training set (X, Y).
  • Shuffle
    • At every epoch, shuffle the training data again.
  • Partition
    • The size of the last mini-batch is not necessary to be the same as the previous one.
The mini-batch size
  • Powers of two are often chosen to be the mini-batch size, e.g., 16, 32, 64, 128.
3 – Momentum

The momentum update rule is, for l = 1, ..., L:

(3)   \[ \begin{cases} v_{dW^{[l]}} = \beta v_{dW^{[l]}} + (1 - \beta) dW^{[l]} \\ W^{[l]} = W^{[l]} - \alpha v_{dW^{[l]}} \end{cases}\]

(4)   \[\begin{cases} v_{db^{[l]}} = \beta v_{db^{[l]}} + (1 - \beta) db^{[l]} \\ b^{[l]} = b^{[l]} - \alpha v_{db^{[l]}} \end{cases}\]

where L is the number of layers, \beta is the momentum and \alpha is the learning rate.

  • Using momentum can reduce these oscillations.
  • Momentum takes into account the past gradients to smooth out the update.
  • Formally, this will be the exponentially weighted average of the gradient on previous steps.
  • If \beta = 0 , then this just becomes standard gradient descent without momentum.
  • Common values for \beta range from 0.8 to 0.999. If you don’t feel inclined to tune this, \beta=0.9 is often a reasonable default.
4 – Adam

The update rule is, for l = 1, ..., L:

    \[\begin{cases} v_{dW^{[l]}} = \beta_1 v_{dW^{[l]}} + (1 - \beta_1) \frac{\partial \mathcal{J} }{ \partial W^{[l]} } \\ v^{corrected}_{dW^{[l]}} = \frac{v_{dW^{[l]}}}{1 - (\beta_1)^t} \\ s_{dW^{[l]}} = \beta_2 s_{dW^{[l]}} + (1 - \beta_2) (\frac{\partial \mathcal{J} }{\partial W^{[l]} })^2 \\ s^{corrected}_{dW^{[l]}} = \frac{s_{dW^{[l]}}}{1 - (\beta_1)^t} \\ W^{[l]} = W^{[l]} - \alpha \frac{v^{corrected}_{dW^{[l]}}}{\sqrt{s^{corrected}_{dW^{[l]}}} + \varepsilon} \end{cases}\]

where: t counts the number of steps taken of Adam

  • L is the number of layers
  • \beta_1 and \beta_2 are hyperparameters that control the two exponentially weighted averages.
  • \alpha is the learning rate
  • \varepsilon is a very small number to avoid dividing by zero
Some advantages of Adam include:
  • Adam converges a lot faster than mini-batch gradient descent and gradient descent with momentum.
  • Less oscillating cost while optimization. (cost-epoch graph)
  • Relatively low memory requirements (though higher than gradient descent and gradient descent with momentum)
  • Usually works well even with little tuning of hyperparameters (except \alpha)
  • Score: 100/100
  • Passed?: Yes

C2W3L1 Tuning process | Hyperparameter tuning

Hyperparameters to tune with their priority

(default values are given.)

  • Priority 1
    • learning rate \alpha=0.001
  • Priority 2
    • momentum  \beta=0.9
    • #(hidden units)
    • mini-batch size=64,128,256
  • Priority 3
    • Learning rate decay
    • #(hidden layers)
  • Priority 4
    • Adam \beta_1 = 0.9, \beta_2 =0.999, \epsilon = 10^{-8}
How to find a good hyperparameter: random search
  • It is reasonable to try all possible combinations of hyperparameters. – grid search
  • However, it generates too many combination to experiment.
Strategy: a coarse to fit search – random search
  1. Experiment for randomly picked combinations of hyperparameters in a particular region A.
  2. Take the combinations performing best.
  3. Is the random search similar to the grid search?
    • No: Select a region A’ to explore that containing the best combination.
      • Let A be A’. Go to Step 1.
    • Yes: End this algorithm.

C2W3L2 Using an appropriate scale to pick hyperparameters | Hyperparameter tuning

Opening remarks

In the last video, you saw how sampling at random, over the range of hyperparameters, can allow you to search over the space of hyperparameters more efficiently. But it turns out that sampling at random doesn’t mean sampling uniformly at random, over the range of valid values. Instead, it’s important to pick the appropriate scale on which to explore the hyperparamaters. In this video, I want to show you how to do that.

Sampling hyperparameters with different scales: linear and log scales
  • Linear scale
    • #(hidden units) n^{[l]}
    • #layers
  • Log scale
    • Learning rate \alpha
      • scale levels: 0.1 – 0.01 – 0.001 – 0.0001
      • \alpha = 10^{-n/m}: n is step number. m is for scaling steps.
    • Exponentially average weights \beta, \beta_1, \beta_2
      • scale levels: 0.99 – 0.999 -0.9999
      • \beta = 1 - 10^{-n/m}
Closing remark

So I hope this helps you select the right scale on which to sample the hyperparameters. In case you don’t end up making the right scaling decision on some hyperparameter choice, don’t worry to much about it. Even if you sample on the uniform scale, where sum of the scale would have been superior, you might still get okay results. Especially if you use a coarse to fine search, so that in later iterations, you focus in more on the most useful range of hyperparameter values to sample. I hope this helps you in your hyperparameter search. In the next video, I also want to share with you some thoughts of how to organize your hyperparameter search process. That I hope will make your workflow a bit more efficient.

C2W3L3 Hyperparameters tuning in practice: Pandas vs. Caviar | Hyperparameter tuning

Two hyperparameter tuning approaches
Approach 1: Panda

Pandas raises their babies one by one with dedication.

  • This approach is usually used in the real world industries, which requires high performance and large models.

Approach 2: Cavier

Like cavier, spread a large number of eggs and then only good eggs are survived.

  • This approach is usually used when the size of expected models are small and computing resource is large enough.

C2W3L4 Normalizing activations in a network | Batch normalization

  • Batch normalization makes your hyperparameter search problem much easier.
  • Batch normalization makes your neural network much more robust.
  • Can we normalize either a^{[l-1]} or z^{[l-1]} so as to train w^{[3]}b^{[3]} faster?
    • Normalizing z^{[l-1]} is more often than a^{[l-1]}.
Implementing batch normalization

Suppose there are m mini-bathes.

z^{(i)} is a unit of the i-th mini-batch (1 \leq i \leq m).

For every unit z,

\mu = \frac{1}{m} \sum_{i=1}^{m}{z^{(i)}}

\sigma^2 = \frac{1}{m} \sum_{i=1}^{m}{(z^{(i)}-\mu)^2}

z_{norm}^{(i)} = \frac{z^{(i)}-\mu}{\sqrt{\sigma^2 + \epsilon}}

\tilde{z}^{(i)} = \gamma z_{norm}^{(i)} + \beta

  • \epsilon > 0 and \epsilon is for the nonzero denominator.
  • \gamma and \beta exist for each unit .
  • \gamma and \beta are learning parameters.
  • \tilde{z}^{(i)} is the batch normalized value of z^{(i)}.

C2W3L5 Fitting Batch Norm into a neural network | Batch normalization

Where to put batch norm operations?
  • Before activation functions. As an input of activation functions.
Gradient descent with batch norm
  • \beta and bias b are added together. Thus, it is okay to delete one of them. Delete the biases.
  • Update rules for 3 parameter sets
    • w
    • \beta
    • \gamma

C2W3L6 Why does Batch Norm work? | Batch normalization

Covariate shift

Andrew Ng does not use batch norm as a regularizer. The regularization effect of batch norm is not intent of batch norm.

Batch norm handles one mini-batch at a time. It computes mean and variance on each mini-batch. At test time, we compute mean and variance not on mini-batches but on the whole test data.

C2W3L7 Batch Norm at test time | Batch normalization

At test time, estimate \mu, \sigma on the training set. Usually, we use exponentially weighted average to estimate them. That means we estimate them with putting high importance on the lastest \mu, \sigma. There are other ways to estimate them.

C2W3L8 Softmax Regression | Multi-class classification

  • The equations of the last layers
  • How the softmax regression classifies multiple classes. With diagrams.
    • Generalized version of logistic regression. True?

C2W3L9 Training a softmax classifier | Multi-class classification

  • Hard-max [1 0 … 0]
  • Soft-max [0.8 0.1 … 0.05]
  • Softmax regression generalizes logistic regression to C classes.
    • If C=2, softmax regression reduces to logistic regression. This can be proved.
How to train a softmax classifier
  • Maximum likelihood estimation

Leave a Reply

Your email address will not be published. Required fields are marked *