Brief information
- Course name: Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization
- Instructor: Andrew Ng
- Institution: deeplearning.ai
- Media: Coursera
- Specialization: Deep Learning
- Duration: 3 weeks
About this Course
This course will teach you the “magic” of getting deep learning to work well. Rather than the deep learning process being a black box, you will understand what drives performance, and be able to more systematically get good results. You will also learn TensorFlow.
After 3 weeks, you will:
- Understand industry best-practices for building deep learning applications.
- Be able to effectively use the common neural network “tricks”, including initialization, L2 and dropout regularization, Batch normalization, gradient checking,
- Be able to implement and apply a variety of optimization algorithms, such as mini-batch gradient descent, Momentum, RMSprop and Adam, and check for their convergence.
- Understand new best-practices for the deep learning era of how to set up train/dev/test sets and analyze bias/variance
- Be able to implement a neural network in TensorFlow.
This is the second course of the Deep Learning Specialization.
Syllabus
Week 1: Practical aspects of Deep Learning
Vidoes
- C2W1L01 Video: Train / Dev / Test sets
- C2W1L02 Video: Bias / Variance
- C2W1L03 Video: Basic Recipe for Machine Learning
- C2W1L04 Video: Regularization
- C2W1L05 Video: Why regularization reduces overfitting?
- C2W1L06 Video: Dropout Regularization
- C2W1L07 Video: Understanding Dropout
- C2W1L08 Video: Other regularization methods
- C2W1L09 Video: Normalizing inputs
- C2W1L10 Video: Vanishing / Exploding gradients
- C2W1L11 Video: Weight Initialization for Deep Networks
- C2W1L12 Video: Numerical approximation of gradients
- C2W1L13 Video: Gradient checking
- C2W1L14 Video: Gradient Checking Implementation Notes
- C2W1L15 Video: Yoshua Bengio interview
Quiz
- C2W1Q1 Graded: Practical aspects of deep learning
Programming assignments
- C2W1P1 Notebook: Initialization
- C2W1P1 Graded: Initialization
- C2W1P2 Notebook: Regularization
- C2W1P2 Graded: Regularization
- C2W1P3 Notebook: Gradient Checking
- C2W1P3 Graded: Gradient Checking
Week 2: Optimization algorithms
Videos
- C2W2L01 Video: Mini-batch gradient descent
- C2W2L02 Video: Understanding mini-batch gradient descent
- C2W2L03 Video: Exponentially weighted averages
- C2W2L04 Video: Understanding exponentially weighted averages
- C2W2L05 Video: Bias correction in exponentially weighted averages
- C2W2L06 Video: Gradient descent with momentum
- C2W2L07 Video: RMSprop
- C2W2L08 Video: Adam optimization algorithm
- C2W2L09 Video: Learning rate decay
- C2W2L10 Video: The problem of local optima
- C2W2L11 Video: Yuanqing Lin interview
Quiz
- C2W2Q1 Graded: Optimization algorithms
Programming assignment
- C2W2P1 Notebook: Optimization
- C2W2P1 Graded: Optimization
Week 3: Hyperparameter tuning, Batch Normalization and Programming Frameworks
Videos
- Video: Tuning process
- Video: Using an appropriate scale to pick hyperparameters
- Video: Hyperparameters tuning in practice: Pandas vs. Caviar
- Video: Normalizing activations in a network
- Video: Fitting Batch Norm into a neural network
- Video: Why does Batch Norm work?
- Video: Batch Norm at test time
- Video: Softmax Regression
- Video: Training a softmax classifier
- Video: Deep learning frameworks
- Video: TensorFlow
Quiz
- Graded: Hyperparameter tuning, Batch Normalization, Programming Frameworks
Programming assignments
- Notebook: Tensorflow
- Graded: Tensorflow
C2W1 Practical aspects of Deep Learning
Learning objectives
- Recall that different types of initializations lead to different results.
- Recognize the importance of initialization in complex neural networks.
- Recognize the difference between train/dev/test sets.
- Diagnose the bias and variance issues in your model.
- Learn when and how to use regularization methods such as dropout or L2 regularization.
- Understand experimental issues in deep learning such as Vanishing or Exploding gradients and learn how to deal with them.
- Use gradient checking to verify the correctness of your backpropagation implementation.
C2W1L01 Train / Dev / Test sets
Applied machine learning is a highly iterative process.
- What I’ve seen is that intuitions from one domain or from one application area often do not transfer to other application areas.
- So one of the things that determine how quickly you can make progress is how efficiently you can go around this cycle.
Train/dev/test sets
- Traditionally: Size(TotalSet) = about 100~10,000
- (train) : dev : test = 70 : 30 : 0 or 60 : 20 : 20
- Big data: Size(TotalSet) 1,000,000
- (train) : dev : test = 98 : 1 : 1 or 99 : 0.5 : 0.5 or 99.5 : 0.25 : 0.25
- It might be okay not to have a test set.
- In this case, the dev set is commonly called ‘test set’.
- This term is not quite right because we use the dev set to make our models fit to the dev set. The test set should be unseen to the developing models.
Mismatch train and dev distributions
- Training set: High resolution images from web pages
- Dev set: Low resolution images from mobile phones
- Then, the distributions of the training and dev sets are different.
C2W1L02 Bias / Variance
Definitions
- Bayes optimal error: the theoretical minimum error any model can attain
Hi - Bias = (training error)
- Bias from the true ideal model
- Variance = (dev error) – (training error)
- Variance of the models we designed
High/low bias/variance means?
- High bias
- Low bias
- High variance
- Low variance
C2W1L03 Basic Recipe for Machine Learning
The diagram of the basic recipe for machine learning
(Diagram in this lecture)
Bias-variance tradeoff in the pre-edeep learning era and the deep learning era
- In the pre-deep learning era, we didn’t have many tools that just reduce bias or that just reduce variance without hurting the other one.
- Bias-variance tradeoff says that
- bias increases if and only if variance decreases, and
- bias decreases if and only if variance increases.
- In the modern deep learning, big data era, we now have tools to drive down bias and just drive down bias, without really hurting the other thing that much. This has been one of the big reasons that deep learning has been useful for supervised learning.
- Getting a bigger network almost always just reduces your bias without necessarily hurting your variance.
- Getting more data pretty much always reduces your variance and doesn’t hurt your bias much.
C2W1L04 Regularization
L2 regularization
Loss with L2 regularization
- While L2 regularization, the weights of the model are decaying.
- So, the method of L2 regularization is also called ‘weight decay’.
- is called a ‘regularization parameter’.
Gradient descent update rule with L2 regularization
- , , but mostly close to 0.
- Mostly, is close to 0 rather than 1.
L1 regularization
Loss with L1 regularization
- After L1 regularization, the model ends up with sparse weights, which results in compressing the model.
Gradient descent update rule with L1 regularization
C2W1L05 Why regularization reduces overfitting?
C2W1L06 Dropout regularization
- Inverted dropout ensures that the expected value of a3 remains the same
Dropout implementation: Inverted dropout
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
""" Inverted Dropout: Recommended implementation example. We drop and scale at train time and don't do anything at test time. """ p = 0.5 # probability of keeping a unit active. higher = less dropout def train_step(X): # forward pass for example 3-layer neural network H1 = np.maximum(0, np.dot(W1, X) + b1) U1 = (np.random.rand(*H1.shape) < p) / p # first dropout mask. Notice /p! H1 *= U1 # drop! H2 = np.maximum(0, np.dot(W2, H1) + b2) U2 = (np.random.rand(*H2.shape) < p) / p # second dropout mask. Notice /p! H2 *= U2 # drop! out = np.dot(W3, H2) + b3 # backward pass: compute gradients... (not shown) # perform parameter update... (not shown) def predict(X): # ensembled forward pass H1 = np.maximum(0, np.dot(W1, X) + b1) # no scaling necessary H2 = np.maximum(0, np.dot(W2, H1) + b2) out = np.dot(W3, H2) + b3 |
- Since test-time performance is so critical, it is always preferable to use inverted dropout, which performs the scaling at train time, leaving the forward pass at test time untouched.
- “we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix”.
from CS231 note
C2W1L07 Understanding dropout
Intuition of why dropout works
Dropout makes a neural network not rely on any one feature.
Dropout prevents a neural network from setting a high weight to any one feature. Thus, dropout makes the weights shrink, which finally results in regularization. Dropout can be thought of as an adaptive form of L2 regularization.
Different keep probability for each layer
- A layer that has a large weight matrix tends to overfit. For example, set keep_prob about 0.5.
- A layer that has a small weight matrix tends not to overfit. For example, set keep_prob about 0.7.
- To the input layer, we do not apply dropout, so that we set keep_prob 1.0. Sometime we set keep_prob 0.9 close to 1.0.
Benefit of dropout
- There are few reasons not to apply dropout.
Downside of dropout
- We cannot clearly define the objective function for the training model.
Observe whether the loss decreases
- While training, to observe the loss, we calculate the loss with keep_prob 1.0. This loss is the thing to observe.
C2W1L08 Other regularization methods
Method 1: Data augmentation
- Data augmentation ⊂ getting more data
Data augmentation examples
- Right-left filpping
- Ratation
- Zoom in/out
- Distortion that can happen in the real application
Method 2: Early stopping
Downside of early stopping
- We cannot follow orthogonalization of reducing bias and variance. The orthogonalization consists of two steps. (Orthogonalization: To think one task at a time)
- First, reduce bias to Bayes error or human-level performance.
- Second, reduce variance to perform well on the development set.
Benefit of early stopping
- It might take short time to train.
- We do not take long time to find good hyperparameters for regularization.
C2W1L09 Normalizing inputs
Normalizing training sets
- Normalize the entire data together, not normalize each set: training, development, and test sets.
Why normalize inputs?
- Normalizing inputs is one method to speed up optimization.
- If the scale, i.e., variance, of the features are so different, normalizing inputs are important.
Guideline
- Normalizing inputs never hurts your optimization. Therefore, normalize inputs in all the cases.
C2W1L10 Vanishing / exploding gradients
When the problems of vanishing gradients or exploding gradients happen?
- The problems happen when we train a deep neural network.
Why do the problems occur?
- The gradients of lower layers, which are close to the inputs, are computed from multiplying the gradients of the higher layers. In this setting,
- if most gradients are between 0 and 1, the gradients of lower layers will vanish (vanishing gradients), and
- if most gradients are greater than 1, the gradients of lower layers will explode (exploding gradients).
- There exists a neural network with about 152 layers.
Why are the problems problems?
- Vanishing gradients slow down optimization.
- Exploding gradients lead to failing to find a local optimum.
C2W1L11 Weight Initialization for Deep Networks
Weight Initialization is a partial solution to the vanishing/exploding gradients problem.
Initializing the weights
For a tanh activation function,
1 |
w = np.random.randn(n) / sqrt(n) |
- This can be used if and and
For a LeLU activation function,
1 |
w = np.random.randn(n) * sqrt(2.0/n) |
For the Xavier initialization,
1 2 3 4 |
w = np.random.randn(n_in ) * sqrt(2.0/ (n_in + n_out)) # x and w in the input layer # n_in: the number of units in the input layer # n_out: the number of units in the output layer |
Initializing the biases
- It is possible and common to initialize the biases to be zero, since the asymmetry breaking is provided by the small random numbers in the weights.
- It is common to simply use 0 bias initialization.
Supplementary for understanding the weight initialization methods
- Weight Initialization – CS231n [LINK]
- Good material!
- Understanding Neural Network Weight Initialization – Intoli [LINK]
- On weight initialization in deep neural networks. S. K. Kumar. arXiv. 2017 [LINK]
C2W1L12 Numerical approximation of gradients
This topic is for gradient checking.
The first-order approximate gradient
- Error:
The second-order approximate gradient
- Error:
- The order of the error is . Thus, the error term diminishes faster as close to 0 than the first-order case.
- Therefore, we use the second-order approximate gradient as numerical approximation of gradients.
Numerical approximation of gradients
- My question: How small do we have to set ?
C2W1L13 Gradient checking
- Gradient checking (grade check) is used to verify whether your implementation of backpropagation is correct. Andrew Ng said it really helps to find bugs in your implementation of backpropagation.
The metric to measure how the implemented and approximated gradients are different.
derived from
Guideline
- ⇒ Great! Nothing to do more.
- ⇒ Okay. But check some bugs in the implementation.
- ⇒ Bad. Find bugs in the implementation.
C2W1L14 Gradient checking implementation notes
- Don’t use gradient checking in training. Use it to debug.
- If gradients have a problem, then look at the components that construct in order to try to identify bugs.
- Don’t forget regularization terms.
- Don’t use dropout or set keep_prob=1 while gradient checking.
- (Rarely used) Run at random initialization and do gradient checking.
C2W1Q1 Practical aspects of deep learning
8/10 correct.
C2W1 Programming assignments
Welcome to the first assignment of the hyper parameters tuning specialization. It is very important that you regularize your model properly because it could dramatically improve your results.
By completing this assignment you will:
- Understand that different regularization methods that could help your model.
- Implement dropout and see it work on data.
- Recognize that a model without regularization gives you a better accuracy on the training set but nor necessarily on the test set.
- Understand that you could use both dropout and regularization on your model.
This assignment prepares you well for the upcoming assignment. Take your time to complete it and make sure you get the expected outputs when working through the different exercises. In some code blocks, you will find a “#GRADED FUNCTION: functionName” comment. Please do not modify it. After you are done, submit your work and check your results. You need to score 80% to pass. Good luck 🙂 !
C2W1P1: Initialization
Initialization 1: Zero initialization
- The weights should be initialized randomly to break symmetry.
- It is however okay to initialize the biases to zeros. Symmetry is still broken so long as is initialized randomly.
Initialization 2: Large random weight initialization
- With large random-valued weights, the last activation (sigmoid) outputs results that are very close to 0 or 1 for some examples. This incurs a very high loss.
- Initializing weights to very large random values does not work well.
- Hopefully initializing with small random values does better. The important question is: how small should be these random values be? Lets find out in the next part!
Initialization 3: Small random weight initialization
- The model with He initialization separates the blue and the red dots very well in a small number of iterations.
Comparing 3 types of initialization
Initialization | Train accuracy | Problem/Comment |
3-layer NN with zeros initialization | 50% | fails to break symmetry |
3-layer NN with large random initialization | 83% | too large weights |
3-layer NN with He initialization | 99% | recommended method |
What you should remember from this notebook:
- Different ways of initialization lead to different results.
- Random initialization is used to break symmetry and make sure different hidden units can learn different things.
- Don’t initialize to values that are too large.
- He initialization works well for networks with ReLU activations.
Score
30/30. passed.
C2W1P2: Regularization
1 – Non-regularized model
- keep_prob=1.0
2 – L2 Regularization
What is L2-regularization actually doing?:
- L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights.
- Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values.
- This leads to a smoother model in which the output changes more slowly as the input changes.
3.1 – Forward propagation with dropout
Dropout randomly shuts down some neurons in each iteration
- The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons.
- With dropout, your neurons thus become less sensitive to the activation of one other specific neuron.
3.2 – Backward propagation with dropout
What you should remember about dropout:
- Dropout is a regularization technique.
- You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.
- Apply dropout both during forward and backward propagation. You should store the dropped units for backward propagation.
- [Inverted dropout] During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.
Score
- Score: 40/40
- Passed?: Yes
C2W1P3: Gradient checking
1) How does gradient checking work?
- As described in the previous lectures.
2) 1-dimensional gradient checking
- The threshold of difference:
3) N-dimensional gradient checking
- The threshold of difference:
- Matrix computation
Score
- Score: 40/40
- Passed?: Yes
C2W2 Optimization algorithms
Learning Objectives
- Remember different optimization methods such as (Stochastic) Gradient Descent, Momentum, RMSProp and Adam.
- Use random minibatches to accelerate the convergence and improve the optimization
- Know the benefits of learning rate decay and apply it to your optimization.
C2W2L01 Mini-batch gradient descent
Notations
Gradient descent algorithms
- Batch gradient descent
- Mini-batch gradient descent
- Stochastic gradient descent
Terms
- Batch
- Epoch
Mini-batch gradient descent
- When we have a large training set, mini-batch gradient descent is faster than batch gradient descent.
Algorithm
- To fill
C2W2L02 Understanding mini-batch gradient descent
3 types of gradient descent algorithms
Let be the number of the total training examples.
Batch | Mini-batch | Stochastic | |
Batch size | m | 1<(batch size)<m | 1 |
Vectorizing / Distributed computing |
O | X | O |
One iteration time | too long | tolerable | short |
Convergence | O | oscillate around close to the local minimum | oscillate around far from the local minimum |
Guideline: How to choose mini-batch size
- training set is small () ⇒ batch_size = m
- training set is big ⇒ typical batch_size = 64, 128, 256, 512
- If batch_size is a power of 2, it makes code run fast because of fast memory access.
- You have to batch_size among those sizes. batch_size is a hyperparameter that effects the speed of training. Try several batch_size and find out which batch_size makes your training faster. Then, use it.
- Make sure the mini-batch fit in your CPU/GPU memory.
C2W2L03 Exponentially weighted averages
This lecture introduces an optimization algorithm faster than gradient descent.
Exponentially weighted averages: Definition
Let be an optimizing parameter at time step .
Let be the exponentially weighted average at time step . is defined as the following.
implies how much each previous exponentially decreases. is one of hyperparamters.
Exponentially weighted averages: Example
- β = 0.90
- Moderately adaptive to short-term variation.
- β = 0.98
- Adaptive to short-term variation.
- Too less adaptive.
- β = 0.50
- Very adaptive to short-term variation.
- Too much adaptive.
- On the graphs, you should think is the average temperature at day .
C2W2L04 Understanding exponentially weighted averages
is used to estimate how long it takes for to converge to .
Algorithm
is a general form of
Repeat {
..Get next
..
}
C2W2L05 Bias correction in exponentially weighted averages
The problem in the initial states
The purple curve has large , say, 0.98. So, its is very small because is small.
Bias correction
- To solve the problem, we multiply on .
- Bias correction is not prevalent because after a few iterations the problem of initial states will get diminishing.
Exponentially weighted averages with bias correction
Repeat {
..Get next
..
.. # bias correction
}
- Plot and observe instead of .
C2W2L06 Gradient descent with momentum
There’s an algorithm called momentum, or gradient descent with momentum that almost always works faster than the standard gradient descent algorithm.
Key idea of momentum optimization algorithm
The key idea of momentum optimization algorithm is to replace the current gradient with the exponentially weighted average of the previous gradients.
Momentum algorithm
Repeat {
..Get next ,
..
..
..
..
}
- Usually, . 0.9 means and have approximately the previous 10 gradients.
- is the learning rate.
- Since usually many iterations are required to converge, we do not apply bias correction to the momentum algorithm.
Understanding in the physics point of view
- , : acceleration
- , : velocity
- : friction
The momentum algorithm can solve oscillation problems
- The momentum algorithm solves the oscillation problem since the algorithm averages out oscillating direction and helps move to the non-oscillating direction.
- As we see below, momentum improves convergence from the blue to the red.
C2W2L07 RMSprop
Key idea of RMSprop
The key idea of RMSprop is to multiply the current gradient by a function of the exponentially weighted average of the previous squared gradients.
RMSprop algorithm
Repeat {
..Compute , on the current mini-batch.
..
..
.. where
.. where
}
- is a small positive real number. is added to and so that the latter term will not explode.
C2W2L08 Adam optimization algorithm
Adam = Adaptive momentum estimation
Key idea of Adam optimization algorithm
The key idea of Adam optimization algorithm is to combine momentum and RMSprop together with bias correction.
Adam optimization algorithm
Repeat {
..The current time step is .
..Compute , on the current mini-batch.
..[Momentum]
..
..
..[RMSprop]
..
..
..[Bias correction]
..
..
..
..
..[Parameter update]
.. where
.. where
}
Hyperparameter choice
- : need to be tuned.
- : 0.9 is the default introduced in the Adam paper.
- : 0.999 is the default introduced in the Adam paper.
- : is the default introduced in the Adam paper.
The only hyperparameter you should consider is the learning rate .
C2W2L09 Learning rate decay
- Small learning rate can allow optimization to approach closer to the local minimum than large learning rate.
- However, small learning rate can make optimization very slow. Large learning rate can be better than small one if the current point is far from the local minimum.
- Therefore, making learning rate decay as learning proceeds is the key idea of learning rate decay.
Implementations of learning rate decay
- : initial learning rate
Implementation 1
- where
- decay rate ↑ ⇔ ‘Learning rate decays faster.’
- decay rate ↓ ⇔ ‘Learning rate decays slower’
Implementation 2
- where
- For example,
- For example,
Implementation 3
- where
Implementation 4: Manual decay
- Observe the trend of loss.
- Decide whether to decrease learning rate.
Guideline from Andrew Ng
- “For me, I would say that learning rate decay is usually lower down on the list of things I try.”
- Andrew considers learning rate decay as a low priority to tune among hyperparameters.
C2W2L10 The problem of local optima
The problem of local optima
- The problem of local optima is not a real problem in high dimensional optimization, which is common in deep learning. In a high dimensional space, it is rare to have many local optima.
- In one dimension, let the probability that some point is a local optimum be 0.5. Then, the probability that some point is a local optimum in a dimensional space becomes .
- Let us think is 50. Then, .
The problem of plateaus
Plateaus can make learning slow.
Adam and the problem of plateaus
Adam can actually speed up the rate at which you could move down the plateau and then get off the plateau. In short, Adam can solve the problem of plateaus.
Andrew’s thoughts
To be honest, I don’t think anyone has great intuitions about what these high dimensional spaces really look like, and our understanding of them is still evolving.
C2W2L11 Yuanqing Lin interview
C2W2Q1 Optimization algorithms
- Score: 8/10
- Passed!
C2W2P1 Optimization
Introduction
By completing this assignment you will:
- Understand the intuition between Adam and RMS prop
- Recognize the importance of mini-batch gradient descent
- Learn the effects of momentum on the overall performance of your model
This assignment prepares you well for the upcoming assignment. Take your time to complete it and make sure you get the expected outputs when working through the different exercises. In some code blocks, you will find a “#GRADED FUNCTION: functionName” comment. Please do not modify it. After you are done, submit your work and check your results. You need to score 80% to pass. Good luck 🙂 !
1 – Gradient Descent
Nothing special.
The gradient descent rule is, for :
(1)
(2)
2 – Mini-Batch Gradient descent
How to build mini-batches from the training set .
- Shuffle
- At every epoch, shuffle the training data again.
- Partition
- The size of the last mini-batch is not necessary to be the same as the previous one.
The mini-batch size
- Powers of two are often chosen to be the mini-batch size, e.g., 16, 32, 64, 128.
3 – Momentum
The momentum update rule is, for :
(3)
(4)
where L is the number of layers, is the momentum and is the learning rate.
- Using momentum can reduce these oscillations.
- Momentum takes into account the past gradients to smooth out the update.
- Formally, this will be the exponentially weighted average of the gradient on previous steps.
- If , then this just becomes standard gradient descent without momentum.
- Common values for range from 0.8 to 0.999. If you don’t feel inclined to tune this, is often a reasonable default.
4 – Adam
The update rule is, for :
where: counts the number of steps taken of Adam
- L is the number of layers
- and are hyperparameters that control the two exponentially weighted averages.
- is the learning rate
- is a very small number to avoid dividing by zero
Some advantages of Adam include:
- Adam converges a lot faster than mini-batch gradient descent and gradient descent with momentum.
- Less oscillating cost while optimization. (cost-epoch graph)
- Relatively low memory requirements (though higher than gradient descent and gradient descent with momentum)
- Usually works well even with little tuning of hyperparameters (except )
Results
- Score: 100/100
- Passed?: Yes
C2W3L1 Tuning process | Hyperparameter tuning
Hyperparameters to tune with their priority
(default values are given.)
- Priority 1
- learning rate
- Priority 2
- momentum
- #(hidden units)
- mini-batch size=64,128,256
- Priority 3
- Learning rate decay
- #(hidden layers)
- Priority 4
- Adam
How to find a good hyperparameter: random search
- It is reasonable to try all possible combinations of hyperparameters. – grid search
- However, it generates too many combination to experiment.
Strategy: a coarse to fit search – random search
- Experiment for randomly picked combinations of hyperparameters in a particular region A.
- Take the combinations performing best.
- Is the random search similar to the grid search?
- No: Select a region A’ to explore that containing the best combination.
- Let A be A’. Go to Step 1.
- Yes: End this algorithm.
- No: Select a region A’ to explore that containing the best combination.
C2W3L2 Using an appropriate scale to pick hyperparameters | Hyperparameter tuning
Opening remarks
In the last video, you saw how sampling at random, over the range of hyperparameters, can allow you to search over the space of hyperparameters more efficiently. But it turns out that sampling at random doesn’t mean sampling uniformly at random, over the range of valid values. Instead, it’s important to pick the appropriate scale on which to explore the hyperparamaters. In this video, I want to show you how to do that.
Sampling hyperparameters with different scales: linear and log scales
- Linear scale
- #(hidden units)
- #layers
- Log scale
- Learning rate
- scale levels: 0.1 – 0.01 – 0.001 – 0.0001
- : is step number. is for scaling steps.
- Exponentially average weights
- scale levels: 0.99 – 0.999 -0.9999
- Learning rate
Closing remark
So I hope this helps you select the right scale on which to sample the hyperparameters. In case you don’t end up making the right scaling decision on some hyperparameter choice, don’t worry to much about it. Even if you sample on the uniform scale, where sum of the scale would have been superior, you might still get okay results. Especially if you use a coarse to fine search, so that in later iterations, you focus in more on the most useful range of hyperparameter values to sample. I hope this helps you in your hyperparameter search. In the next video, I also want to share with you some thoughts of how to organize your hyperparameter search process. That I hope will make your workflow a bit more efficient.
C2W3L3 Hyperparameters tuning in practice: Pandas vs. Caviar | Hyperparameter tuning
Two hyperparameter tuning approaches
Approach 1: Panda
Pandas raises their babies one by one with dedication.
- This approach is usually used in the real world industries, which requires high performance and large models.
Approach 2: Cavier
Like cavier, spread a large number of eggs and then only good eggs are survived.
- This approach is usually used when the size of expected models are small and computing resource is large enough.
C2W3L4 Normalizing activations in a network | Batch normalization
- Batch normalization makes your hyperparameter search problem much easier.
- Batch normalization makes your neural network much more robust.
- Can we normalize either or so as to train , faster?
- Normalizing is more often than .
Implementing batch normalization
Suppose there are mini-bathes.
is a unit of the -th mini-batch ().
For every unit ,
- and is for the nonzero denominator.
- and exist for each unit .
- and are learning parameters.
- is the batch normalized value of .
C2W3L5 Fitting Batch Norm into a neural network | Batch normalization
Where to put batch norm operations?
- Before activation functions. As an input of activation functions.
Gradient descent with batch norm
- and bias are added together. Thus, it is okay to delete one of them. Delete the biases.
- Update rules for 3 parameter sets
C2W3L6 Why does Batch Norm work? | Batch normalization
Covariate shift
Andrew Ng does not use batch norm as a regularizer. The regularization effect of batch norm is not intent of batch norm.
Batch norm handles one mini-batch at a time. It computes mean and variance on each mini-batch. At test time, we compute mean and variance not on mini-batches but on the whole test data.
C2W3L7 Batch Norm at test time | Batch normalization
At test time, estimate , on the training set. Usually, we use exponentially weighted average to estimate them. That means we estimate them with putting high importance on the lastest , . There are other ways to estimate them.
C2W3L8 Softmax Regression | Multi-class classification
- The equations of the last layers
- How the softmax regression classifies multiple classes. With diagrams.
- Generalized version of logistic regression. True?
C2W3L9 Training a softmax classifier | Multi-class classification
- Hard-max [1 0 … 0]
- Soft-max [0.8 0.1 … 0.05]
- Softmax regression generalizes logistic regression to classes.
- If , softmax regression reduces to logistic regression. This can be proved.
How to train a softmax classifier
- Maximum likelihood estimation
Loss
- fill in
In Deep Learning frameworks, you only need to focus on correct forward propagation.
C2W3L10 Deep learning frameworks | Introduction to programming frameworks
Criteria to choose deep learning frameworks
- Ease of programming (development and deployment)
- Running speed
- Truly open (open source with good governance)
C2W3L11 Tensorflow | Introduction to programming frameworks