Neural Network for Machine Learning | by Geoffrey Hinton | Coursera

Brief Information
  • Course name : Neural Network for Machine Learning
  • Lecturer : Geoffrey Hinton
  • Duration:
  • Syllabus
  • Record
  • Certificate
  • Learning outcome
  • About this course
    • Learn about artificial neural networks and how they’re being used for machine learning, as applied to speech and object recognition, image segmentation, modeling language and human motion, etc. We’ll emphasize both the basic algorithms and the practical tricks needed to get them to work well.
    • [Sungjae’s opinion] This course is not appropriate for students have not learned machine learning at all. Fairly many terms of machine learning appear in this course without their explanation. It is recommended to whoever has already studied machine learning and wants to deepen the understanding of artificial neural networks.

Brief Summary

  • Week 1) Introduction
    • 1-a) Why do we need machine learning?
    • 1-b) What are neural networks?
    • 1-c) Some simple models of neurons
    • 1-d) A simple example of learning
    • 1-e) Three types of learning
    • Quiz 1
  • Week 2) The perceptron learning procedure
    • 2-a) An overview of the main types of neural network architecture
    • 2-b) Perceptrons: the first generation of neural networks
    • 2-c) A geometrical view of perceptrons
    • 2-d) Why the learning works
    • 2-e) What perceptrons can’t do
    • Quiz 2
  • Week 3) The backpropagation learning procedure
    • Reading: D. E. Rumelhart, G. E. Hinton & R. J. Williams (1985) Learning internal representations by error propagation. Parallel distributed processing: explorations in the microstructure of cognition, vol. 1. MIT Press Cambridge: MA. pp.318-362. [LINK]
    • 3-a) Learning the weights of a linear neuron
    • 3-b) The error surface for a linear neuron
    • 3-c) Learning the weights of a logistic output neuron
    • 3-d) The backpropagation algorithm
    • 3-e) How to use the derivatives computed by the backpropagation algorithm
    • Quiz 3
    • Programming Assignment 1: The perceptron learning algorithm
  • Week 4) Learning feature vectors for words
    • Reading: Y. Bengio, R. Ducharme, P. Vincent & C. Janvin (2003) A Neural Probabilistic Language Model. The Journal of Machine Learning Research vol. 3. pp.1137–1155
    • 4-a) Learning to predict the next word
    • 4-b) A brief diversion into cognitive science
    • 4-c) Another diversion: The softmax output function
    • 4-d) Neuro-probabilistic language models
    • 4-e) Ways to deal with the large number of possible outputs
    • Quiz 4
  • Week 5) Object recognition with neural nets
    • Reading: Y. LeCun, L. Bottou, Y. Bengio & P. Haffner (Nov. 1998) Gradient-Based Learning Applied  to Document Recognition. Proceedings of the IEEE, 86(11). pp. 2278-2324
    • Reading: Y. LeCun & L. Bengio (1998) Convolutional Networks for Images Speech and TimeSeries. The handbook of brain theory and neural networks. MIT Press Cambridge. pp.255-258.
    • 5-a) Why object recognition is difficult
    • 5-b) Achieving viewpoint invariance
    • 5-c) Convolutional nets for digit recognition
    • 5-d) Convolutional nets for object recognition
    • Quiz 5
    • Programming Assignment 2: Learning Word Representations
      • Complete the Octave/Matlab code of the Bengio’s neural probabilistic language model, which appeared in the following.
      • Y. Bengio, R. Ducharme, P. Vincent & C. Jauvin (2003) A Neural Probabilistic Language Model. Journal of Machine Learning Research vol. 3. pp. 1137-1155
  • Week 6) Optimization: how to make the learning go faster
    • 6-a) Overview of mini-batch gradient descent
    • 6-b) A bag of tricks for mini-batch gradient descent
    • 6-c) The momentum method
    • 6-d) Adaptive learning rates for each connection
    • 6-e) Rmsprop: Divide the gradient by a running average of its recent magnitude
    • Quiz 6
  • Week 7) Recurrent neural networks
    • 7-a) Modeling sequences: A brief overview
    • 7-b) Training RNNs with back propagation
    • 7-c) A toy example of training an RNN
    • 7-d) Why it is difficult to train an RNN
    • 7-e) Long-term Short-term-memory
    • Quiz 7
  • Week 8) More recurrent neural networks
    • 8-a) Modeling character strings with multiplicative connections
    • 8-b) Learning to predict the next character using HF
    • 8-c) Echo State Networks
    • Quiz 8
  • Week 9) Ways to make neural networks generalize better
    • 9-a) Overview of ways to improve generalization
    • 9-b) Limiting the size of the weights
    • 9-c) Using noise as a regularizer
    • 9-d) Introduction to the full Bayesian approach
    • 9-e) The Bayesian interpretation of weight decay
    • 9-f) MacKay’s quick and dirty method of setting weight costs
    • Quiz 9
    • Programming assignment 3: Optimization and generalization
  • Week 10) Combining multiple neural networks to improve generalization
    • 10-a) Why it helps to combine models
    • 10-b) Mixtures of Experts
    • 10-c) The idea of full Bayesian learning
    • 10-d) Making full Bayesian learning practical
    • 10-e) Dropout
    • Quiz 10
  • Week 11) Hopfield nets and Boltzmann machines
    • 11-a) Hopfield Nets
    • 11-b) Dealing with spurious minima
    • 11-c) Hopfield nets with hidden units
    • 11-d) Using stochastic units to improv search
    • 11-e) How a Boltzmann machine models data
    • Quiz 11
  • Week 12) Restricted Boltzmann machines (RBMs)
    • 12-a) Boltzmann machine learning
    • 12-b) OPTIONAL VIDEO: More efficient ways to get the statistics
    • 12-c) Restricted Boltzmann Machines
    • 12-d) An example of RBM learning
    • 12-e) RBMs for collaborative filtering
    • Quiz 12
  • Week 13) Stacking RBMs to make Deep Belief Nets
    • Video: The ups and downs of back propagation
    • Video: Belief Nets
    • Video: The wake-sleep algorithm
    • Programming Assignment 4: Restricted Boltzmann Machines
    • Quiz 13
  • Week 14) Deep neural nets with generative pre-training
    • Video: Learning layers of features by stacking RBMs
    • Video: Discriminative learning for DBNs
    • Video: What happens during discriminative fine-tuning?
    • Video: Modeling real-valued data with an RBM
    • Video: OPTIONAL VIDEO: RBMs are infinite sigmoid belief nets
    • Quiz 14
  • Week 15) Modeling hierarchical structure with neural nets
    • Video: From PCA to autoencoders
    • Video: Deep auto encoders
    • Video: Deep auto encoders for document retrieval
    • Video: Semantic Hashing
    • Video: Learning binary codes for image retrieval
    • Video: Shallow autoencoders for pre-training
    • Quiz 15
    • Final Exam
  • Week 16) Recent applications of deep neural nets
    • Video: OPTIONAL: Learning a joint model of images and captions
    • Video: OPTIONAL: Hierarchical Coordinate Frames
    • Video: OPTIONAL: Bayesian optimization of hyper-parameters

Week 2) The Perceptron Learning Procedure

2-a) An Overview of the Main Types of Neural Network Architecture
  • Feed-forward neural networks
  • Recurrent neural networks
    • Hard to train RNNs.
    • Can remember information in their hidden state for a long time.
    • Can be used for modeling sequences.
    • Can predict the next character in a sequence.
  • Symmetrically connected networks (= Hopfield networks)
    • Symmetrical connections have the same weight in both directions.
    • Symmetric networks are much easier to analyze than recurrent networks. (‘analyze’?)
  • Symmetrically connected networks with hidden units (= Boltzmann machines)
2-b) Perceptrons: the First Generation of Neural Networks
Binary Threshold Neurons
  • \mathbf{x}=(1,x_1,\cdots,x_n)
  • \mathbf{w}=(b,w_1,\cdots,w_n)
  • z=\mathbf{w}\cdot\mathbf{x}
  • g(z)=1 if  z\geq 0g(z)=0 otherwise.
 Perceptron Learning Algorithm (How to Train Binary Threshold Neurons)
  • If g(z)=y(\mathbf{x}), do nothing.
  • If g(z)\neq y(\mathbf{x})=1, increase z by \mathbf{w}\leftarrow \mathbf{w}+\mathbf{x}. Then, z\leftarrow z + \left \| \mathbf{x} \right \|^2
  • If g(z)\neq y(\mathbf{x})=0, decrease z by \mathbf{w}\leftarrow \mathbf{w}-\mathbf{x}. Then, z\leftarrow z - \left \| \mathbf{x} \right \|^2
2-c) A Geometrical View of Perceptrons
2-d) Why the Learning Works
2-e) What Perceptrons Can’t Do

Week 3) The Backpropagation Learning Procedure

3-a) Learning the Weight of a Linear Neuron
  • There are 2 different method to find the solution: the iterative method and the analytic method.
  • The iterative method is much easier to generalize for problems than the analytic method. Namely, the iterative method can be applied to much more problems than the analytic method.
The Iterative Learning Method
  • If two input dimension are highly correlated, then the iterative method can be very slow.
The Online Delta Rule
  • w_i \leftarrow w_i + \Delta w_i
    • \Delta w_i = \epsilon x_i (t-y)
    • \epsilon >0
The Batch Delta Rule
  • w_i \leftarrow w_i + \Delta w_i
  • \Delta w_i = -\epsilon \frac{\partial E}{\partial w_i}=-\epsilon \frac{\partial y^n}{\partial w_i}\frac{\partial E}{\partial y^n}=\sum_{n}\epsilon x_i(t^n-y^n)
    • (1) \epsilon: learning rate
    • (2) \frac{\partial E}{\partial y^n}= -\sum_{n} (t^n-y^n)
      • where E=\frac{1}{2}\sum_{n\in\textup{training set}}(t^n-y^n)^2: squared residual summed over all training cases
    • (3) \frac{\partial y^n}{\partial w_i} = \frac{\partial z}{\partial w_i} \frac{\partial y^n}{\partial z}
3-b) The Error surface for a Linear Neuron
  • Vertical cross-sections are parabolas.
  • Horizontal cross-sections are ellipses.
 Batch vs. Online Learning
  • Batch learning
    • [+]: Taking a shortcut to get the apex.
    • [-]: Computationally heavy because on each step all examples are required to compute .
  • Online learning
    • [+]: Computationally light because on each step one example is required to compute.
    • [-]: Taking a zig-zag route to get the apex.
3-c) Learning the Weights of a Logistic Output Neuron (Learning the Weight of a Nonlinear Neuron)
  • To extend the learning rule for a linear neuron to the rule for a nonlinear neuron
  • A logistic output neuron is used as an example of nonlinear neurons. In this lesson, the logistic output neuron can be replaced by another nonlinear neuron.
 The Derivatives for Extension
  • \frac{\partial z}{\partial w_i}=x_i: the derivative of the logit z with respect to the weight w_i
  • \frac{\partial z}{\partial x_i}=w_i: the derivative of the logit z with respect to the input x_i
  • \frac{\partial y}{\partial z}=y(1-y): the derivative of the output y with respect to the logit z
  • \frac{\partial y^n}{\partial w_i} = \frac{\partial z}{\partial w_i} \frac{\partial y^n}{\partial z} = x_i y^n(1-y^n)
The Extension of the Delta Rule to a Nonlinear Neuron
  • w_i \leftarrow w_i + \Delta w_i
  • \Delta w_i=-\epsilon \frac{\partial E}{\partial w_i}=-\epsilon \frac{\partial y^n}{\partial w_i}\frac{\partial E}{\partial y^n}\\=-\epsilon x_i y^n(1-y^n)[-\sum_{n}(t^n-y^n)]\\= \sum_{n}\epsilon x_i(t^n-y^n)y^n(1-y^n)

Compared with the delta rule for a linear neuron,  y^n(1-y^n) is an extra term called the slope of logistic.

3-d) The Backpropagation Algorithm
3-e) How to Use the Derivatives Computed by the Backpropagation Algorithm
Optimization Issue 1: How Often to Update the Weights
  • Online learning: after each training case.
  • Full batch learning: after all training cases.
  • Mini-batch learning: update a small number of training cases.
Optimization Issue 2: How Much to Update the Weights (Lecture 6)
  • Use a fixed learning rate?
  • Training error: very low ⇒ very fitted
  • Test error: large ⇒ overly fitted
Ways to Reduce Overfitting (Lecture 7)
  • Weight-decay
  • Weight-sharing
  • Early stopping
  • Model averaging
  • Bayesian fitting of neural nets
  • Dropout
  • Generative pre-training


Week 4) Learning Feature Vectors for Words

Week 10) Combining Multiple Neural Networks to Improve Generalization

10-a) Why It Helps to Combine Models
10-b) Mixtures of Experts
10-c) The Idea of Full Bayesian Learning
10-d) Making Full Bayesian Learning Practical
10-e) Dropout: An Efficient Way to Combine Neural Nets
Two ways to average models
  • Mixture: to combine models by arithmetically averaging output probabilities
  • Product: to combine models by geometrically averaging output probabilities
Dropout: an efficient way to average many large neural nets
  • Consider a neural net with one hidden layer.
Dropout: at training
  • Let H be the number of units in the hidden layer.
  • Each time we are given a training example,
    we randomly omit each hidden with probability p, i.e., 0.5.
  • Randomly omitting some hidden units are equal to randomly sampling from 2^H different architectures.
    • All architectures share weights.
    • 2^H is generally a big number. So a few of the models ever get trained. Almost all the models get trained by only one example.
  • Dropout is a much better regularizer than L2 or L1 penalties
    • L2 or L1 penalties pull the weights towards zero.
    • However, dropout does not pull the weights towards zero.
Dropout: at testing
  • Use the architecture such that the half weights of all the hidden units
Dropout: the input layer
  • It helps to use dropout in the input layer
  • However, you should apply dropout with a higher probability of keeping an input unit.
How well does dropout work?
  • If your deep neural net is overfitting, then dropout will usually reduce errors.
  • If your deep neural net uses ‘early stopping’, then dropout will usually reduce errors.
  • If your deep neural net is NOT overfitting, then you should be using a bigger net.
  • (This seems to be an important concept. But I don’t understand it.)

Week 12) Restricted Boltzmann machines (RBMs)

12-a) Boltzmann machine learning
  • Topic: the learning algorithm of a Boltzmann machine
    • The algorithm turned out that in practice it is extremely slow and noisy, and was not pratical
  • The Boltzmann machine learning algorithm is an unsupervised learning algorithm.
  • What the algorithm is trying to do is build a model of a set of input vectors.
  • The goal of learning
    • To maximize the product of the probabilities that the Boltzmann machine assigns to the binary vectors in the training set.
    • = To maximize the sum of the log probabilities that the Boltzmann machine assigns to the binary vectors in the training set.
12-c) Restricted Boltzmann Machines
  • A restricted Boltzmann machine is a Boltzmann machine that
    • has only one layer of hidden units,
    • no connections between hidden units
  • An RBM looks like a bipartite graph.

Leave a Reply

Your email address will not be published. Required fields are marked *