Lecture 6 | Training Neural Networks I
Sigmoid
- Problems of the sigmoid activation function
- Problem 1: Saturated neurons kill the gradients.
- Problem 2: Sigmoid outputs are not zero-centered.
- Suppose a given feed-forward neural network has hidden layers and all activation functions are sigmoid.
- Then, except the first layer, the other layers get only positive inputs.
- If $\forall i, x_i>0$, then all the gradients are positive.
- $\frac{\partial \sigma}{\partial w_i} = \frac{\partial \sigma}{\partial (\sum_{i}x_i w_i+b)}x_i = (+)(+)>0$
- If the gradients are only positive, then the update direction gets very constrained.
- Problem 3: exp() is a bit expensive computation. – (a minor problem)
- Numerical methods now well solves this problem.
tanh (tangent hyperbolic)
- Zero centered
- The problem 2 has been solved.
- The problem 1 and 3 are still remained.
ReLU (rectified linear unit)
- The problem 1 has been solved in the positive region.
- Actually more biologically plausible?than sigmoid. The detail was not introduced in this lecture.
- AlexNet used ReLU.
- Problems
- Problem 1: Not zero-centered
- The gradient of each weight is zero or positive.
- The update direction is always the combination of zeros or positives.
- The update direction is restricted. This effects inefficient optimization.
- Problem 2: dead ReLU
- 20% of units are never active nor updated, which are called dead ReLUs.
- Problem 1: Not zero-centered
- Initialization
- People like to initialize?ReLU neurons with slightly?positive biases (e.g. 0.01)
- Leaky ReLU
- PReLU (Parametric Rectifier)
- ELU (Exponential Linear Unit)
- Between leaky ReLU and ReLU
Maxout
- Nonlinear
- a generalized form of ReLU and leaky ReLU
- Benefits
- Linear regimes
- Its output does not saturate.
- Its gradient does not die.
- Drawback
- Double the number of?weights.
In practice
- Use ReLU first.
- Try out Leakey ReLU, Maxout, and ELU.
- Try out tanh but don’t expect much.
- Don’t use sigmoid.
Lecture 11: Detection and segmentation
- segmentation, localization, detection
- semantic segmentation, instance segmentation
- downsampling, upsampling
- unpooling by nearest neighbor, unpooling by ‘Bed of Neils’
- max unpooling
- tranpose convolution, upconvolution, fractionally strided convolution, backward convolution
- upsampling: unpooling, strided transpose convolution
- Treat localization as a regression problem!
- Use L2 loss for localization.
Object dectection
Sliding window
- Apply a CNN to many different crops of the image. The CNN classifies each crop as object or background.
- Sliding window is very computationally expensive!
Region proposal
- Find “blobby” image regions that are likely to contain objects.
- Relatively fast to run; e.g. Selective Search gives 1000 region proposals.
- R-CNN, Fast R-CNN, Faster R-CNN
R-CNN
- Ad hoc training objectives
- Training is slow and takes a lot of disk space.
- Inference (detection) is slow.
Fast R-CNN
- Detect all regions by one ConvNet in parallel.
- Problem: Runtime dominated by region proposal
Faster R-CNN
- Make a CNN do proposal!
- Insert Region Proposal Network (RPN) to predict proposals from features.
Detection without Proposals
- YOLO / SSD
- Use grid cells
- Faster R-CNN is slower but more accurate.
Dense captioning
- Dense Captioning = object detection + captioning
Instance segmentation
- Mask R-CNN
- Very good result!
- Also do pose detection!