Brief Information
 Course name : Neural Network for Machine Learning
 Lecturer : Geoffrey Hinton
 Duration:
 Syllabus
 Record
 Certificate
 Learning outcome
 About this course
 Learn about artificial neural networks and how they’re being used for machine learning, as applied to speech and object recognition, image segmentation, modeling language and human motion, etc. We’ll emphasize both the basic algorithms and the practical tricks needed to get them to work well.
 [Sungjae’s opinion] This course is not appropriate for students have not learned machine learning at all. Fairly many terms of machine learning appear in this course without their explanation. It is recommended to whoever has already studied machine learning and wants to deepen the understanding of artificial neural networks.
Brief Summary
 Week 1) Introduction
 1a) Why do we need machine learning?
 1b) What are neural networks?
 1c) Some simple models of neurons
 1d) A simple example of learning
 1e) Three types of learning
 Quiz 1
 Week 2) The perceptron learning procedure
 2a) An overview of the main types of neural network architecture
 2b) Perceptrons: the first generation of neural networks
 2c) A geometrical view of perceptrons
 2d) Why the learning works
 2e) What perceptrons can’t do
 Quiz 2
 Week 3) The backpropagation learning procedure
 Reading: D. E. Rumelhart, G. E. Hinton & R. J. Williams (1985) Learning internal representations by error propagation. Parallel distributed processing: explorations in the microstructure of cognition, vol. 1. MIT Press Cambridge: MA. pp.318362. [LINK]
 3a) Learning the weights of a linear neuron
 3b) The error surface for a linear neuron
 3c) Learning the weights of a logistic output neuron
 3d) The backpropagation algorithm
 3e) How to use the derivatives computed by the backpropagation algorithm
 Quiz 3
 Programming Assignment 1: The perceptron learning algorithm
 Week 4) Learning feature vectors for words
 Reading: Y. Bengio, R. Ducharme, P. Vincent & C. Janvin (2003) A Neural Probabilistic Language Model. The Journal of Machine Learning Research vol. 3. pp.1137–1155
 4a) Learning to predict the next word
 4b) A brief diversion into cognitive science
 4c) Another diversion: The softmax output function
 4d) Neuroprobabilistic language models
 4e) Ways to deal with the large number of possible outputs
 Quiz 4
 Week 5) Object recognition with neural nets
 Reading: Y. LeCun, L. Bottou, Y. Bengio & P. Haffner (Nov. 1998) GradientBased Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11). pp. 22782324
 Reading: Y. LeCun & L. Bengio (1998) Convolutional Networks for Images Speech and TimeSeries. The handbook of brain theory and neural networks. MIT Press Cambridge. pp.255258.
 5a) Why object recognition is difficult
 5b) Achieving viewpoint invariance
 5c) Convolutional nets for digit recognition
 5d) Convolutional nets for object recognition
 Quiz 5
 Programming Assignment 2: Learning Word Representations
 Complete the Octave/Matlab code of the Bengio’s neural probabilistic language model, which appeared in the following.
 Y. Bengio, R. Ducharme, P. Vincent & C. Jauvin (2003) A Neural Probabilistic Language Model. Journal of Machine Learning Research vol. 3. pp. 11371155
 Week 6) Optimization: how to make the learning go faster
 6a) Overview of minibatch gradient descent
 6b) A bag of tricks for minibatch gradient descent
 6c) The momentum method
 6d) Adaptive learning rates for each connection
 6e) Rmsprop: Divide the gradient by a running average of its recent magnitude
 Quiz 6
 Week 7) Recurrent neural networks
 7a) Modeling sequences: A brief overview
 7b) Training RNNs with back propagation
 7c) A toy example of training an RNN
 7d) Why it is difficult to train an RNN
 7e) Longterm Shorttermmemory
 Quiz 7
 Week 8) More recurrent neural networks
 8a) Modeling character strings with multiplicative connections
 8b) Learning to predict the next character using HF
 8c) Echo State Networks
 Quiz 8
 Week 9) Ways to make neural networks generalize better
 9a) Overview of ways to improve generalization
 9b) Limiting the size of the weights
 9c) Using noise as a regularizer
 9d) Introduction to the full Bayesian approach
 9e) The Bayesian interpretation of weight decay
 9f) MacKay’s quick and dirty method of setting weight costs
 Quiz 9
 Programming assignment 3: Optimization and generalization
 Week 10) Combining multiple neural networks to improve generalization
 10a) Why it helps to combine models
 10b) Mixtures of Experts
 10c) The idea of full Bayesian learning
 10d) Making full Bayesian learning practical
 10e) Dropout
 Quiz 10
 Week 11) Hopfield nets and Boltzmann machines
 11a) Hopfield Nets
 11b) Dealing with spurious minima
 11c) Hopfield nets with hidden units
 11d) Using stochastic units to improv search
 11e) How a Boltzmann machine models data
 Quiz 11
 Week 12) Restricted Boltzmann machines (RBMs)
 12a) Boltzmann machine learning
 12b) OPTIONAL VIDEO: More efficient ways to get the statistics
 12c) Restricted Boltzmann Machines
 12d) An example of RBM learning
 12e) RBMs for collaborative filtering
 Quiz 12
 Week 13) Stacking RBMs to make Deep Belief Nets
 Video: The ups and downs of back propagation
 Video: Belief Nets
 Video: The wakesleep algorithm
 Programming Assignment 4: Restricted Boltzmann Machines
 Quiz 13
 Week 14) Deep neural nets with generative pretraining
 Video: Learning layers of features by stacking RBMs
 Video: Discriminative learning for DBNs
 Video: What happens during discriminative finetuning?
 Video: Modeling realvalued data with an RBM
 Video: OPTIONAL VIDEO: RBMs are infinite sigmoid belief nets
 Quiz 14
 Week 15) Modeling hierarchical structure with neural nets
 Video: From PCA to autoencoders
 Video: Deep auto encoders
 Video: Deep auto encoders for document retrieval
 Video: Semantic Hashing
 Video: Learning binary codes for image retrieval
 Video: Shallow autoencoders for pretraining
 Quiz 15
 Final Exam
 Week 16) Recent applications of deep neural nets
 Video: OPTIONAL: Learning a joint model of images and captions
 Video: OPTIONAL: Hierarchical Coordinate Frames
 Video: OPTIONAL: Bayesian optimization of hyperparameters
Week 2) The Perceptron Learning Procedure
2a) An Overview of the Main Types of Neural Network Architecture
 Feedforward neural networks
 Recurrent neural networks
 Hard to train RNNs.
 Can remember information in their hidden state for a long time.
 Can be used for modeling sequences.
 Can predict the next character in a sequence.
 Symmetrically connected networks (= Hopfield networks)
 Symmetrical connections have the same weight in both directions.
 Symmetric networks are much easier to analyze than recurrent networks. (‘analyze’?)
 Symmetrically connected networks with hidden units (= Boltzmann machines)
2b) Perceptrons: the First Generation of Neural Networks
Binary Threshold Neurons
 if . otherwise.
Perceptron Learning Algorithm (How to Train Binary Threshold Neurons)
 If , do nothing.
 If , increase by . Then,
 If , decrease by . Then,
2c) A Geometrical View of Perceptrons
2d) Why the Learning Works
2e) What Perceptrons Can’t Do
Week 3) The Backpropagation Learning Procedure
3a) Learning the Weight of a Linear Neuron
 There are 2 different method to find the solution: the iterative method and the analytic method.
 The iterative method is much easier to generalize for problems than the analytic method. Namely, the iterative method can be applied to much more problems than the analytic method.
The Iterative Learning Method
 If two input dimension are highly correlated, then the iterative method can be very slow.
The Online Delta Rule
The Batch Delta Rule

 (1) : learning rate
 (2)
 where : squared residual summed over all training cases
 (3)
3b) The Error surface for a Linear Neuron
Crosssections
 Vertical crosssections are parabolas.
 Horizontal crosssections are ellipses.
Batch vs. Online Learning
 Batch learning
 [+]: Taking a shortcut to get the apex.
 []: Computationally heavy because on each step all examples are required to compute .
 Online learning
 [+]: Computationally light because on each step one example is required to compute.
 []: Taking a zigzag route to get the apex.
3c) Learning the Weights of a Logistic Output Neuron (Learning the Weight of a Nonlinear Neuron)
 To extend the learning rule for a linear neuron to the rule for a nonlinear neuron
 A logistic output neuron is used as an example of nonlinear neurons. In this lesson, the logistic output neuron can be replaced by another nonlinear neuron.
The Derivatives for Extension
 : the derivative of the logit with respect to the weight
 : the derivative of the logit with respect to the input
 : the derivative of the output with respect to the logit
 ⇒
The Extension of the Delta Rule to a Nonlinear Neuron
Compared with the delta rule for a linear neuron, is an extra term called the slope of logistic.
3d) The Backpropagation Algorithm
3e) How to Use the Derivatives Computed by the Backpropagation Algorithm
Optimization Issue 1: How Often to Update the Weights
 Online learning: after each training case.
 Full batch learning: after all training cases.
 Minibatch learning: update a small number of training cases.
Optimization Issue 2: How Much to Update the Weights (Lecture 6)
 Use a fixed learning rate?
Overfitting
 Training error: very low ⇒ very fitted
 Test error: large ⇒ overly fitted
Ways to Reduce Overfitting (Lecture 7)
 Weightdecay
 Weightsharing
 Early stopping
 Model averaging
 Bayesian fitting of neural nets
 Dropout
 Generative pretraining
Week 4) Learning Feature Vectors for Words
Week 10) Combining Multiple Neural Networks to Improve Generalization
10a) Why It Helps to Combine Models
10b) Mixtures of Experts
10c) The Idea of Full Bayesian Learning
10d) Making Full Bayesian Learning Practical
10e) Dropout: An Efficient Way to Combine Neural Nets
Two ways to average models
 Mixture: to combine models by arithmetically averaging output probabilities
 Product: to combine models by geometrically averaging output probabilities
Dropout: an efficient way to average many large neural nets
 Consider a neural net with one hidden layer.
Dropout: at training
 Let be the number of units in the hidden layer.
 Each time we are given a training example,
we randomly omit each hidden with probability , i.e., 0.5.  Randomly omitting some hidden units are equal to randomly sampling from different architectures.
 All architectures share weights.
 is generally a big number. So a few of the models ever get trained. Almost all the models get trained by only one example.
 Dropout is a much better regularizer than L2 or L1 penalties
 L2 or L1 penalties pull the weights towards zero.
 However, dropout does not pull the weights towards zero.
Dropout: at testing
 Use the architecture such that the half weights of all the hidden units
Dropout: the input layer
 It helps to use dropout in the input layer
 However, you should apply dropout with a higher probability of keeping an input unit.
How well does dropout work?
 If your deep neural net is overfitting, then dropout will usually reduce errors.
 If your deep neural net uses ‘early stopping’, then dropout will usually reduce errors.
 If your deep neural net is NOT overfitting, then you should be using a bigger net.
Coadaptations
 (This seems to be an important concept. But I don’t understand it.)
Week 12) Restricted Boltzmann machines (RBMs)
12a) Boltzmann machine learning
 Topic: the learning algorithm of a Boltzmann machine
 The algorithm turned out that in practice it is extremely slow and noisy, and was not pratical
 The Boltzmann machine learning algorithm is an unsupervised learning algorithm.
 What the algorithm is trying to do is build a model of a set of input vectors.
 The goal of learning
 To maximize the product of the probabilities that the Boltzmann machine assigns to the binary vectors in the training set.
 = To maximize the sum of the log probabilities that the Boltzmann machine assigns to the binary vectors in the training set.
12c) Restricted Boltzmann Machines
 A restricted Boltzmann machine is a Boltzmann machine that
 has only one layer of hidden units,
 no connections between hidden units
 An RBM looks like a bipartite graph.