Neural Network for Machine Learning | by Geoffrey Hinton | Coursera

Brief Information

Course name 😕Neural Network for Machine Learning
Lecturer 😕Geoffrey Hinton
Duration:
Syllabus
Record
Certificate
Learning outcome
About this course
- Learn about artificial neural networks and how they’re being used for machine learning, as applied to speech and object recognition, image segmentation, modeling language and human motion, etc. We’ll emphasize both the basic algorithms and the practical tricks needed to get them to work well.
- [Sungjae’s opinion] This course is not appropriate for students have not learned machine learning at all. Fairly many terms of machine learning appear in this course without their explanation. It is recommended to whoever has already studied machine learning and wants to deepen the understanding of artificial neural networks.

Brief Summary

Week 1) Introduction
- 1-a) Why do we need machine learning?
- 1-b) What are neural networks?
- 1-c) Some simple models of neurons
- 1-d) A simple example of learning
- 1-e) Three types of learning
- Quiz 1
Week 2) The perceptron learning procedure
- 2-a) An overview of the main types of neural?network architecture
- 2-b) Perceptrons:?the first generation of neural networks
- 2-c) A geometrical view of perceptrons
- 2-d) Why the learning works
- 2-e) What perceptrons can’t do
- Quiz 2
Week 3) The backpropagation learning procedure
- Reading: D. E. Rumelhart,?G. E. Hinton & R. J. Williams (1985)?Learning internal representations by error propagation.?Parallel distributed processing: explorations in the microstructure of cognition, vol. 1.?MIT Press Cambridge: MA. pp.318-362. [LINK]
- 3-a) Learning the weights of a linear neuron
- 3-b) The error surface for a linear neuron
- 3-c) Learning the weights of a logistic?output neuron
- 3-d) The backpropagation algorithm
- 3-e) How to use the derivatives computed by the?backpropagation algorithm
- Quiz 3
- Programming Assignment 1: The perceptron learning algorithm
Week 4) Learning feature vectors for words
- Reading: Y. Bengio, R. Ducharme, P. Vincent & C. Janvin (2003)?A Neural Probabilistic Language Model.?The Journal of Machine Learning Research vol. 3. pp.1137?1155
- 4-a) Learning to predict the next word
- 4-b) A brief diversion into cognitive science
- 4-c) Another diversion: The softmax output function
- 4-d) Neuro-probabilistic language models
- 4-e) Ways to deal with the large number of possible outputs
- Quiz 4
Week 5) Object recognition with neural nets
- Reading: Y. LeCun, L. Bottou, Y. Bengio & P. Haffner (Nov. 1998) Gradient-Based Learning Applied ?to Document Recognition.?Proceedings of the IEEE, 86(11). pp. 2278-2324
- Reading: Y. LeCun & L. Bengio (1998) Convolutional Networks for Images Speech and TimeSeries.?The handbook of brain theory and neural networks. MIT Press Cambridge. pp.255-258.
- 5-a)?Why object recognition is difficult
- 5-b)?Achieving viewpoint invariance
- 5-c)?Convolutional nets for digit recognition
- 5-d)?Convolutional nets for object recognition
- Quiz 5
- Programming Assignment 2:?Learning Word Representations
  - Complete the Octave/Matlab code of the Bengio’s neural probabilistic language model, which appeared in the following.
  - Y. Bengio, R.?Ducharme, P.?Vincent & C.?Jauvin (2003)?A Neural Probabilistic Language Model.?Journal of Machine Learning Research vol. 3. pp.?1137-1155
Week 6) Optimization: how to make the learning go faster
- 6-a)?Overview of mini-batch gradient descent
- 6-b)?A bag of tricks for mini-batch gradient descent
- 6-c)?The momentum method
- 6-d)?Adaptive learning rates for each connection
- 6-e)?Rmsprop: Divide the gradient by a running average of its recent magnitude
- Quiz 6
Week 7) Recurrent neural networks
- 7-a) Modeling sequences: A brief overview
- 7-b) Training RNNs with back propagation
- 7-c) A toy example of training an RNN
- 7-d) Why it is difficult to train an RNN
- 7-e) Long-term Short-term-memory
- Quiz 7
Week 8) More recurrent neural networks
- 8-a)?Modeling character strings with multiplicative connections
- 8-b)?Learning to predict the next character using HF
- 8-c)?Echo State Networks
- Quiz 8
Week 9) Ways to make neural networks generalize better
- 9-a)?Overview of ways to improve generalization
- 9-b)?Limiting the size of the weights
- 9-c)?Using noise as a regularizer
- 9-d)?Introduction to the full Bayesian approach
- 9-e)?The Bayesian interpretation of weight decay
- 9-f)?MacKay’s quick and dirty method of setting weight costs
- Quiz 9
- Programming assignment 3: Optimization and generalization
Week 10) Combining multiple neural networks to improve generalization
- 10-a)?Why it helps to combine models
- 10-b)?Mixtures of Experts
- 10-c) The idea of full Bayesian learning
- 10-d)?Making full Bayesian learning practical
- 10-e)?Dropout
- Quiz 10
Week 11) Hopfield nets and Boltzmann machines
- 11-a) Hopfield Nets
- 11-b)?Dealing with spurious minima
- 11-c)?Hopfield nets with hidden units
- 11-d)?Using stochastic units to improv search
- 11-e)?How a Boltzmann machine models data
- Quiz 11
Week 12) Restricted Boltzmann machines (RBMs)
- 12-a)?Boltzmann machine learning
- 12-b)?OPTIONAL VIDEO: More efficient ways to get the statistics
- 12-c)?Restricted Boltzmann Machines
- 12-d)?An example of RBM learning
- 12-e)?RBMs for collaborative filtering
- Quiz 12
Week 13) Stacking RBMs to make Deep Belief Nets
- Video: The ups and downs of back propagation
- Video: Belief Nets
- Video: The wake-sleep algorithm
- Programming Assignment 4: Restricted Boltzmann Machines
- Quiz 13
Week 14) Deep neural nets with generative pre-training
- Video: Learning layers of features by stacking RBMs
- Video: Discriminative learning for DBNs
- Video: What happens during discriminative fine-tuning?
- Video: Modeling real-valued data with an RBM
- Video: OPTIONAL VIDEO: RBMs are infinite sigmoid belief nets
- Quiz 14
Week 15) Modeling hierarchical structure with neural nets
- Video: From PCA to autoencoders
- Video: Deep auto encoders
- Video: Deep auto encoders for document retrieval
- Video: Semantic Hashing
- Video: Learning binary codes for image retrieval
- Video: Shallow autoencoders for pre-training
- Quiz 15
- Final Exam
Week 16) Recent applications of deep neural nets
- Video: OPTIONAL: Learning a joint model of images and captions
- Video: OPTIONAL: Hierarchical Coordinate Frames
- Video: OPTIONAL: Bayesian optimization of hyper-parameters

Week 2)?The Perceptron Learning Procedure

2-a) An Overview of the Main Types of Neural Network Architecture

Feed-forward neural networks
Recurrent neural networks
- Hard to train RNNs.
- Can remember information in their hidden state for a long time.
- Can be used for modeling sequences.
- Can predict the next character in a sequence.
Symmetrically connected networks (= Hopfield networks)
- Symmetrical connections have the same weight in both directions.
- Symmetric networks are much easier to analyze than recurrent networks. (‘analyze’?)
Symmetrically connected networks with hidden units (= Boltzmann machines)

2-b) Perceptrons: the First Generation of Neural Networks

Binary Threshold Neurons

$latex \mathbf{x}=(1,x_1,\cdots,x_n)$
$latex \mathbf{w}=(b,w_1,\cdots,w_n)$
$latex z=\mathbf{w}\cdot\mathbf{x}$
$latex g(z)=1$ if ?$latex z\geq 0$.?$latex g(z)=0$?otherwise.

?Perceptron Learning Algorithm (How to Train Binary Threshold Neurons)

If $latex g(z)=y(\mathbf{x})$, do nothing.
If?$latex g(z)\neq y(\mathbf{x})=1$, increase $latex z$ by $latex \mathbf{w}\leftarrow \mathbf{w}+\mathbf{x}$. Then, $latex z\leftarrow z + \left \| \mathbf{x} \right \|^2$
If?$latex g(z)\neq y(\mathbf{x})=0$, decrease $latex z$ by $latex \mathbf{w}\leftarrow \mathbf{w}-\mathbf{x}$. Then, $latex z\leftarrow z – \left \| \mathbf{x} \right \|^2$

2-c) A Geometrical View of Perceptrons

2-d) Why the Learning Works

2-e) What Perceptrons Can’t Do

Week 3)?The Backpropagation Learning Procedure

3-a) Learning the Weight of a Linear Neuron

There are 2 different method?to find the solution: the iterative method and the analytic method.
The iterative?method is much easier to generalize for problems than the analytic method. Namely, the iterative method can be applied to much more problems than the analytic method.

The Iterative Learning Method

If two input dimension are highly correlated, then the iterative method can be very slow.

The Online?Delta Rule

$latex w_i \leftarrow w_i + \Delta w_i$
- $latex \Delta w_i = \epsilon x_i (t-y)$
- $latex \epsilon >0$

The Batch Delta Rule

$latex w_i \leftarrow w_i + \Delta w_i$
$latex \Delta w_i = -\epsilon \frac{\partial E}{\partial w_i}=-\epsilon \frac{\partial y^n}{\partial w_i}\frac{\partial E}{\partial y^n}=\sum_{n}\epsilon x_i(t^n-y^n)$
- (1) $latex \epsilon$: learning rate
- (2) $latex \frac{\partial E}{\partial y^n}= -\sum_{n} (t^n-y^n)$
  - where $latex E=\frac{1}{2}\sum_{n\in\textup{training set}}(t^n-y^n)^2$:?squared residual summed over all training cases
- (3) $latex \frac{\partial y^n}{\partial w_i} = \frac{\partial z}{\partial w_i} \frac{\partial y^n}{\partial z}$

3-b) The Error surface for a Linear Neuron

Cross-sections

Vertical cross-sections are parabolas.
Horizontal cross-sections are ellipses.

?Batch vs. Online?Learning

Batch learning
- [+]: Taking a shortcut to get the apex.
- [-]: Computationally heavy because on each step all examples are required to compute .
Online learning
- [+]: Computationally light because on each step one example is required to compute.
- [-]: Taking a zig-zag route to get the apex.

3-c) Learning the Weights of a Logistic Output Neuron (Learning the Weight of a Nonlinear Neuron)

To extend the learning rule for a linear neuron to the rule for a nonlinear neuron
A logistic output neuron is used as an example of nonlinear neurons. In this lesson, the logistic output neuron can be replaced by another nonlinear neuron.

?The Derivatives for Extension

$latex \frac{\partial z}{\partial w_i}=x_i$: the derivative of the logit $latex z$ with respect to the weight $latex w_i$
$latex \frac{\partial z}{\partial x_i}=w_i$: the derivative of the logit $latex z$ with respect to the input $latex x_i$
$latex \frac{\partial y}{\partial z}=y(1-y)$: the derivative of the output $latex y$ with respect to the logit $latex z$
⇒ $latex \frac{\partial y^n}{\partial w_i} = \frac{\partial z}{\partial w_i} \frac{\partial y^n}{\partial z} = x_i y^n(1-y^n)$

The Extension of the Delta Rule to?a Nonlinear Neuron

$latex w_i \leftarrow w_i + \Delta w_i$
$latex \Delta w_i=-\epsilon \frac{\partial E}{\partial w_i}=-\epsilon \frac{\partial y^n}{\partial w_i}\frac{\partial E}{\partial y^n}\\=-\epsilon x_i y^n(1-y^n)[-\sum_{n}(t^n-y^n)]\\= \sum_{n}\epsilon x_i(t^n-y^n)y^n(1-y^n)$

Compared with the delta rule for a linear neuron, $latex?y^n(1-y^n)$ is an extra term called the slope of logistic.

3-d) The Backpropagation Algorithm

3-e) How to Use the Derivatives Computed by the Backpropagation Algorithm

Optimization Issue 1: How Often to Update the Weights

Online learning: after each training case.
Full batch learning: after all training cases.
Mini-batch learning: update a small number of training cases.

Optimization Issue 2: How Much to Update the Weights (Lecture 6)

Use a fixed learning rate?

Overfitting

Training error: very low ⇒?very fitted
Test error: large?⇒ overly fitted

Ways to Reduce Overfitting (Lecture 7)

Weight-decay
Weight-sharing
Early stopping
Model averaging
Bayesian fitting of neural nets
Dropout
Generative pre-training

Week 4) Learning Feature Vectors for Words

Week 10)?Combining Multiple Neural Networks to Improve Generalization

10-a) Why It Helps to Combine Models

10-b) Mixtures of Experts

10-c) The Idea of Full Bayesian Learning

10-d) Making Full Bayesian Learning Practical

10-e) Dropout: An Efficient Way to Combine Neural Nets

Two ways to average models

Mixture: to combine models by arithmetically averaging output probabilities
Product: to combine models by geometrically averaging output?probabilities

Dropout: an efficient way to average many large neural nets

Consider a neural net with one hidden layer.

Dropout: at training

Let $latex H$ be the number of units in the hidden layer.
Each time we are given a training example,
we randomly omit each hidden with probability $latex p$, i.e., 0.5.
Randomly omitting some hidden units are equal to randomly sampling from $latex 2^H$ different architectures.
- All architectures share weights.
- $latex 2^H$ is generally a big number. So a few of the models ever get trained. Almost all the models get trained by only one example.
Dropout is a much better regularizer than L2 or L1 penalties
- L2 or L1 penalties pull the weights towards zero.
- However, dropout does not pull the weights towards zero.

Dropout: at testing

Use the architecture such that the half weights of all the hidden units

Dropout: the input layer

It helps to use dropout in the input layer
However, you should apply dropout with a higher probability of keeping an input unit.

How well does dropout work?

If your deep neural net is overfitting, then dropout will usually reduce errors.
If your deep neural net uses ‘early stopping’, then dropout will?usually reduce errors.
If your deep neural net is NOT overfitting, then you should be using a bigger net.

Co-adaptations

(This seems to be an important concept. But I don’t understand it.)

Week 12)?Restricted Boltzmann machines (RBMs)

12-a) Boltzmann machine learning

Topic: the learning algorithm of a Boltzmann machine
- The algorithm turned out that in practice it is extremely slow and noisy, and was not pratical
The Boltzmann machine learning algorithm is an unsupervised learning algorithm.
What the algorithm is trying to do is build a model of a set of input vectors.
The goal of learning
- To maximize the product of the probabilities that the Boltzmann machine assigns to the binary vectors in the training set.
- = To maximize the sum of the log probabilities that?the Boltzmann machine assigns to the binary vectors in the training set.

12-c) Restricted Boltzmann Machines

A restricted Boltzmann machine is a Boltzmann machine that
- has only one layer of hidden units,
- no connections between hidden units
An RBM looks like a?bipartite graph.

Brief Information

Brief Summary

Week 2)?The Perceptron Learning Procedure

2-a) An Overview of the Main Types of Neural Network Architecture

2-b) Perceptrons: the First Generation of Neural Networks

Binary Threshold Neurons

?Perceptron Learning Algorithm (How to Train Binary Threshold Neurons)

2-c) A Geometrical View of Perceptrons

2-d) Why the Learning Works

2-e) What Perceptrons Can’t Do

Week 3)?The Backpropagation Learning Procedure

3-a) Learning the Weight of a Linear Neuron

The Iterative Learning Method

The Online?Delta Rule

The Batch Delta Rule

3-b) The Error surface for a Linear Neuron

Cross-sections

?Batch vs. Online?Learning

3-c) Learning the Weights of a Logistic Output Neuron (Learning the Weight of a Nonlinear Neuron)

?The Derivatives for Extension

The Extension of the Delta Rule to?a Nonlinear Neuron

3-d) The Backpropagation Algorithm

3-e) How to Use the Derivatives Computed by the Backpropagation Algorithm

Optimization Issue 1: How Often to Update the Weights

Optimization Issue 2: How Much to Update the Weights (Lecture 6)

Overfitting

Ways to Reduce Overfitting (Lecture 7)

Week 4) Learning Feature Vectors for Words

Week 10)?Combining Multiple Neural Networks to Improve Generalization

10-a) Why It Helps to Combine Models

10-b) Mixtures of Experts

10-c) The Idea of Full Bayesian Learning

10-d) Making Full Bayesian Learning Practical

10-e) Dropout: An Efficient Way to Combine Neural Nets

Two ways to average models

Dropout: an efficient way to average many large neural nets

Dropout: at training

Dropout: at testing

Dropout: the input layer

How well does dropout work?

Co-adaptations

Week 12)?Restricted Boltzmann machines (RBMs)

12-a) Boltzmann machine learning

12-c) Restricted Boltzmann Machines

Related Posts

Leave a Reply Cancel reply