Structuring Machine Learning Projects | Deep Learning Specialization | Coursera

Brief information
  • Course name: Structuring Machine Learning Projects
  • Instructor: Andrew Ng
  • Institution:
  • Media: Coursera
  • Specialization: Deep Learning
  • Duration: 2 weeks

About this Course

You will learn how to build a successful machine learning project. If you aspire to be a technical leader in AI, and know how to set direction for your team’s work, this course will show you how.

Much of this content has never been taught elsewhere, and is drawn from my experience building and shipping many deep learning products. This course also has two “flight simulators” that let you practice decision-making as a machine learning project leader. This provides “industry experience” that you might otherwise get only after years of ML work experience.

After 2 weeks, you will:

  • Understand how to diagnose errors in a machine learning system, and
  • Be able to prioritize the most promising directions for reducing error
  • Understand complex ML settings, such as mismatched training/test sets, and comparing to and/or surpassing human-level performance
  • Know how to apply end-to-end learning, transfer learning, and multi-task learning.

I’ve seen teams waste months or years through not understanding the principles taught in this course. I hope this two week course will save you months of time.

This is a standalone course, and you can take this so long as you have basic machine learning knowledge. This is the third course in the Deep Learning Specialization.

Week 1: ML Strategy (1)
  1. Video: Why ML Strategy
  2. Video: Orthogonalization
  3. Video: Single number evaluation metric
  4. Video: Satisficing and Optimizing metric
  5. Video: Train/dev/test distributions
  6. Video: Size of the dev and test sets
  7. Video: When to change dev/test sets and metrics
  8. Video: Why human-level performance?
  9. Video: Avoidable bias
  10. Video: Understanding human-level performance
  11. Video: Surpassing human-level performance
  12. Video: Improving your model performance
  13. Reading: Machine Learning flight simulator
  14. Video: Andrej Karpathy interview
  15. Graded: Bird recognition in the city of Peacetopia (case study)
Week 2: ML Strategy (2)
  1. Video: Carrying out error analysis
  2. Video: Cleaning up incorrectly labeled data
  3. Video: Build your first system quickly, then iterate
  4. Video: Training and testing on different distributions
  5. Video: Bias and Variance with mismatched data distributions
  6. Video: Addressing data mismatch
  7. Video: Transfer learning
  8. Video: Multi-task learning
  9. Video: What is end-to-end deep learning?
  10. Video: Whether to use end-to-end deep learning
  11. Video: Ruslan Salakhutdinov interview
  12. Graded: Autonomous driving (case study)

C3W1L01 Why ML Strategy?

For efficient development

C3W1L02 Orthogonalization

  • The concept of orthogonalization refers to that, if you think of one dimension of what you want to do as controlling a steering angle, and another dimension as controlling your speed.
  • [My view] Orthogonalization is a way of thought to separate independent components to achieve some result.
    • orthogonalize = separate independent components
Chain of assumptions in machine learning
  1. Fit training set well on cost function.
    • Tune the optimizer. (e.g. Adam)
    • Tune the size of the network.
  2. Fit dev set well on cost function.
    • Regularization
    • Not good performance on dev set. → Get bigger training set.
  3. Fit test set well on cost function.
    • Not good performance on test set. → Get bigger dev set.
  4. Perform well in real world.
    • Not good performance in the real world. → Change dev set or cost function.
      • That is because it is probable that
      • dev set does not have representative distribution of the real world distribution, or
      • the cost function does not measure the performance in the real world.

When training a network, Andrew Ng does not tend to use early stopping is difficult to think about how early stopping affects traing set and dev set simultaneously.

C3W1L03 Single number evaluation metric

With multiple metrics, it is hard to choose the best model among multiple models. Therefore, we should select one evaluation metric. It is a good approach to create a new metric that has multiple metrics as its factors. For example, F1 score is a metric that has precision and recall as its factors. F1 score was originated to average precision and recall with harmonic mean. Averaging is one of the clever approaches to create a single number evaluation metric.

A well defined dev set and single number evaluation metric help to boost to develop machine learning algorithms.

C3W1L04 Satisficing and Optimizing metric

Satisficing metric

A satisficing metric is a metric that is not of interest if the metric satisfies some conditions. For example, \textup{running time}<100, which is a threshold condition. Threshold conditions are common satisficing metrics.

Optimizing metric

An optimizing metric is a metric that is desired to be maximized or minimized.

C3W1L05 Train/dev/test distributions

  • The dev set and the test set are assumed to have a same distribution.
  • Assumption of the distributions: (entire set) = (dev set) = (test set)
Guideline for dev set and test set

Choose a development set and test set to reflect data you expect to get in the future and consider important to do well.

C3W1L06 Size of the dev and test sets

Old way of splitting data
  • (all data) : (train data) : (test data) = 100 : 70 : 30
  • (all data) : (train data) : (dev data) : (test data) = 100 : 60 : 20 : 20
  • Size(all data) = 100 ~10,000
New way of spitting data for deep learning / big data
  • (all data) : (train data) : (dev data) : (test data) = 100 : 98 : 1 : 1
  • Size(all data) = 1,000,000
  • Size(dev data) = Size(test data) = 10,000

1% of the entire data is enough to validate or test the generality of the model.

How to set the size of the dev set

Set your dev set to be big enough to detect differences in algorithm/models you’re trying out.

How to set the size of the test set

Set your test set to be big enough to give high confidence in the overall performance of your system.

C3W1L07 When to change dev/test sets and metrics

When to change dev/test sets and metrics
  • Case 1: When a model gives out more undesirable results than a model with the higher dev error,
    • To do: Change the metric so that the model gives out desirable results.
  • Case 2: When doing well on your metric + dev/test set does not correspond to doing well on your application,
    • To do: change your metric and/or dev/test set.

Case 2: User images of low quality

  • How to define a metric
  • How to do well on this metric

The two processes above are independent(= orthogonalized).

Guideline from Andrew Ng
  • So my recommendation is, even if you can’t define the perfect evaluation metric and dev set, just set something up quickly and use that to drive the speed of your team iterating. And if later down the line you find out that it wasn’t a good one, you have better idea, change it at that time, it’s perfectly okay.
  • What I do not recommend is to run for too long time without any evaluation metric and dev set. That is because that can slow down the efficiency of what your team can iterate and improve your algorithm.

C3W1L08 Why human-level performance?

Why do we compare the machine learning systems to human level performance?
  • Because of advances in deep learning, machine learning algorithms become competitive with human-level performance.
  • The workflow of designing and building a machine learning system is much more efficient when you’re trying to do something that humans can also do. So in those settings, it becomes natural to talk about comparing, or trying to mimic human-level performance.
The typical performance graph of a machine learning algorithm

  • Bayes optimal error = Bayesian optimal error = Bayes error
  • Human level performance
Why does the performance of a machine learning algorithm get slowly increase after it surpasses the human level performance?
  • The Bayes error and the human level performance is very close. If a ML algorithm surpasses the human level performance, there is little to improve the algorithm.
  • If the performance of a ML algorithm is lower than the human level performance, there are many things to apply to improve the algorithm. For example,
    • to get more labeled data from humans.
    • to get insight from thinking why a person gets this right.
    • to analyze bias/variance.

C3W1L08B Avoidable bias

Case study
Case 1  Case 2 
 Human-level error 1% 7.5%
 Train error 8% 8%
 Dev error 10% 10%
 To do Reduce avoidable bias
[=(Bayes error) – (train error)]because
avoidable bias is high.
Reduce variance
[=(train error) – (dev error)]because
[1] avoidable bias is low and
[2] dev error is high.
  • avoidable bias = (Bayes error) – (train error)
    • bias = (train error)
  • variance = (train error) – (dev error)

C3W1L09 Understanding human-level performance

The definition of the human-level error
  • The lowest error that can be reached by the human
The Bayes error and the human-level error
  • (Bayes error) \leq (human-level error)
  • The human-level error is a proxy of the Bayes error.
  • Mostly, the Bayes error cannot be known.

C3W1L10 Surpassing human-level performance

Case study
Case 1  Case 2 
 Team human error 0.5% 0.5%
 One human error 1% 1%
 Train error 0.6% 0.3%
 Dev error 0.8% 0.4%
 Avoidance bias 0.6 – 0.5 Unknown. Less than 0.3%
  • In Case 2, we do not know whether we can reduce avoidance bias. In other words, we do not know the model can be improved.
Problems where ML significantly surpasses human-level performance
  • Online advertising
  • Product recommendations
  • Logistics: predicting transit time, etc.
  • Loan approvals

These are from structured data.

  • Speech recognition
  • Image recognition
  • Medical tasks: ECG, diagnosing skin cancers

These are from natural perception.

C3W1L11 Improving your model performance

Summary of this week.

The two fundamental assumptions of supervised learning
  1. To have a low avoidable bias: the training set fits well.
  2. To have a low or acceptable variance: the training set performance generalizes well to the development set and test set.

How to reduce avoidable bias and variance

C3W1Quiz Bird recognition in the city of Peacetopia (case study)

The following exercise is a “flight simulator” for machine learning. Rather than you needing to spend years working on a machine learning project before you get to experience certain scenarios, you’ll get to experience them right here.

Personal note from Andrew: I’ve found practicing with scenarios like these to be useful for training PhD students and advanced Deep Learning researchers. This is the first time this type of “airplane simulator” for machine learning strategy has ever been made broadly available. I hope this helps you gain “real experience” with machine learning much faster than even full-time machine learning researchers typically do from work experience.

  • 11/15 = 83% correct

C3W2L01 Carrying out error analysis

Error analysis: to manually examine mistakes that your algorithm is making
  • Get about 100 dev examples that causes errors.
  • Inspect each example and find the cause of the errors.

For example, through error analysis we might know how much percentage of mislabeled examples. Then, we figure out a lower bound of errors with the examples.

In this lecture, cat image classification was the example case. Candidate causes of errors were cat-like dog images, great cats such as lions, and blurry images. While error analysis, from the dev set each image was inspected what was the cause of its error among the candidate causes. After all, we could know which candidates were the real causes. The instructor Andrew Ng usually uses spreadsheets in error analysis. Each row of the spreadsheets has an image ID, true-false cells for respective candidate causes, and a string of comments.

C3W2L02 Cleaning up incorrectly labeled data

Incorrectly labeled data in the training set

Deep learning algorithms are quite robust to random errors in the training set. If errors occurred at random, it would be okay to leave the errors. However, the algorithms are not robust to systematic errors. Therefore, systematic errors should be resolved.

Incorrectly labeled data in the development and test sets
  • Analyze how much the incorrectly labeled error take up in the overall dev set error.
  • If the incorrectly labeled error takes up high ratio , it should be fixed because its dev error is not credible. (dev : incorrect : other = 2% : 0.6% : 1.4%)
  • Otherwise, it is okay to let the incorrect labels for deep learning algorithms. (dev : incorrect : other = 10% : 0.6% : 9.4%)

Correcting incorrect dev/test set examples
  • Apply the same process to your dev and test sets
    • to make sure they continue to come from the same distribution.
  • Consider examining [1] examples that your algorithm got right as well as [2] ones it got wrong.
    • But if correct examples take up most of examples, say, 98%, examining the correct examples will not contribute to improving performance.
  • After correcting mislabeled data, train and dev/test data may now come from slightly different distributions.
Andrew Ng’s advice
  • Be not reluctant to manually examine why errors occur. Manual error analysis really helps.

C3W2L03 Build your first system quickly, then iterate

Andrew Ng’s advice
  • If you’re working on a brand new machine learning application, one of the piece of advice I often give people is that, I think you should build your first system quickly and then iterate.
When you work on a brand new machine learning application
  • Set up dev/test set and metric.
    • Set up a target/goal.
  • Build initial system quickly.
    • Train training set quickly: Fit the parameters.
    • Development set: Tune the parameters.
    • Test set: Assess the performance.
  • Use bias/variance analysis and error analysis to prioritize next steps.

C3W2L04 Training and testing on different distributions

And there’s some subtleties and some best practices for dealing with when you’re training and test distributions differ from each other.

Case study: To develop a mobile application to classify cat or non-cat pictures

To develop a mobile appilcation to classify cat or not-cat

Data set
  1. Web data set
    • Size: 200,000
    • Well framed
    • In high resolution
  2. Mobile data set
    • Size: 10,000
    • In low resolution: blurred images
Build training/dev/test sets
  • Training set:
    1. 200,000 web + 5,000 mobile
    2. 200,000 web
  • Dev set:
    1. 2,500 mobile
    2. 5,000 mobile
  • Test set:
    1. 2,500 mobile
    2. 5,000 mobile

If we just shuffle both the web data and the mobile data, the mobile data in the dev and test sets take up 1/21 of the sets. Then, the distributions of the dev and test sets are not similar to the distribution of the real mobile images.

Data from the mobile app should be in the dev and test sets.

C3W2L05 Bias and Variance with mismatched data distributions

General formulation

Training-development set

A training-development set

  • has the same distribution as the training set, and
  • is not used for training the neural network.
  • can be created by randomly selecting training examples with the same size of the development set.
Variance problem

Variance problems are the problems that occurred when

  • The distributions of the training and training-variance sets are same, and
  • the difference between the training and variance errors is large
    • , which means training does not generalize well.
Data mismatch
  • Data mismatch means mismatch between the distributions of the training-development and development sets.
  • There are not systematic solutions for addressing data mismatch problems. There are some solution to try, which will be given in the next lecture.
Degree of overfitting to the development set
  • This should be very small.

C3W2L06 Addressing data mismatch

General guide for addressing data mismatch
  1. Perform manual error analysis to understand the error differences between training, development/test sets.
  2. Make training data similar to dev/test sets.
    • Artificial data synthesis: To synthesize training examples to become similar to development/test examples.
      • If dev/test examples have car noises, then we can add car noise to the training examples.
      • Make a computer graphic system generate images similar to dev/test sets.
  3. Collect more training data similar to dev/test sets.

C3W2L07 Transfer learning

  • Transfer learning: the concept of machine learning algorithms that one learned model for Task A can be partially used to make a model for Task B learn.
  • Pre-training: training the model for Task A
  • Fine-tuning: training the model for Task B by adapting the model for Task A to the model for Task B
How to do fine-tuning: Guidline
  1. Delete the last layer of the neural network doing for Task A.
  2. Delete weights feeding into the last output layer of the neural network.
  3. If the training data for Task B is small,
    • just add a few layers that
      • adapt the output of the deleted model for Task A and
      • learn the output for Task B.
  4. If the training data for Task B is large,
    • depending on the size of the training set for Task B,
    • train some last layers of the deleted model for Task A while training the layers for Task B.
When to use transfer learning: Guideline

Transferring from Task A to Task B.

  • Task A and B have the same input x.
  • You have a lot more data for Task A than Task B.
  • Low level features from Task A could be helpful for Task B.

C3W2L08 Multi-task learning

Multi-task learning
  • Multi-task learning: to have one neural network learn to do simultaneously several tasks.
  • How often it is used: transfer learning > multi-task learning
When to use multi-task learning
  • When training on a set of tasks that could benefit from having shared lower-level features. [Principle]
  • When Amount of data you have for each task is quite similar.
    • For example, there are 100 tasks and each task has about 1,000 examples. Each task can learn not only from 1,000 examples for its task but also from 99,000 examples for the other tasks.
  • When we can train a big enough neural network to do well on all tasks.
    • If the neural network for multi-task is quite small, separate neural networks for each tasks perform better than a single multi-task neural network.
Example: Simplified autonomous vehicle
  • In this case, the performance of the system is better when one neural network is trained to do four tasks than training four separate neural networks since some of the earlier features in the neural network could be shared between the different types of objects.
  • Use multiple labels at the last layer. Each unit of the last layer represents true or false of each label. So to speak, the last layer consists of multiple logistic classifiers.
Multi-task architecture can be learned with some partially unlabeled examples

Just put 0 on the unlabeled labels, which has the same effect as not adding to the overall loss.

C3W2L09 What is end-to-end deep learning?

End-to-end deep learning

End-to-end deep learning is the concept of machine learning techniques that replace multiple steps of information processes into a single deep neural network.

  • End-to-end is not always great.
  • Breaking down an end-to-end architecture can result in good results.
End-to-end vs. step-by-step
 Step-by-step  End-to-end 
 Overall steps >1 1
  • Requires a little data
  • Able to interpret the processes
  • Should not consider multiple steps of the processes
  • Should manually design each step
  • Should acquire data sets for all steps
  • Requires a lot of data
  • Hard to interpret the process

C3W2L10 Whether to use end-to-end deep learning

Self-question before applying end-to-end deep learning

Do you have enough data to learn a function of the complexity needed to map x and y?

Pros and cons of end-to-end deep learning
  • Let the data speak.
    • It will be able to find which statistics are in the data, rather than being forced to reflect human preconceptions.
  • Less hand-designing of components needed.
  • Requires large amount of labeled data
  • Excludes potentially useful hand-designed component.
    • If the data set is small, then a hand-design system is a way to give manual knowledge into the algorithm
    • BIG  data, little hand-design
    • little data, BIG hand-design
My thoughts
  • I think knowing cognitive models of the human will help to design the entire information processing system, and give me insights of where to put deep learning systems in the entire system.

C3W2Quiz Autonomous driving (case study)

Leave a Reply

Your email address will not be published. Required fields are marked *