#### Lecture 6 | Training Neural Networks I

###### Sigmoid

- Problems of the sigmoid activation function
- Problem 1: Saturated neurons kill the gradients.
- Problem 2: Sigmoid outputs are not zero-centered.
- Suppose a given feed-forward neural network has hidden layers and all activation functions are sigmoid.
- Then, except the first layer, the other layers get only positive inputs.
- If , then all the gradients are positive.
- If the gradients are only positive, then the update direction gets very constrained.

- Problem 3: exp() is a bit expensive computation. – (a minor problem)
- Numerical methods now well solves this problem.

###### tanh (tangent hyperbolic)

- Zero centered
- The problem 2 has been solved.

- The problem 1 and 3 are still remained.

###### ReLU (rectified linear unit)

- The problem 1 has been solved in the positive region.
- Actually more biologically plausible than sigmoid. The detail was not introduced in this lecture.
- AlexNet used ReLU.
- Problems
- Problem 1: Not zero-centered
- The gradient of each weight is zero or positive.
- The update direction is always the combination of zeros or positives.
- The update direction is restricted. This effects inefficient optimization.

- Problem 2: dead ReLU
- 20% of units are never active nor updated, which are called dead ReLUs.

- Problem 1: Not zero-centered
- Initialization
- People like to initialize ReLU neurons with slightly positive biases (e.g. 0.01)

- Leaky ReLU
- PReLU (Parametric Rectifier)
- ELU (Exponential Linear Unit)
- Between leaky ReLU and ReLU

###### Maxout

- Nonlinear
- a generalized form of ReLU and leaky ReLU
- Benefits
- Linear regimes
- Its output does not saturate.
- Its gradient does not die.

- Drawback
- Double the number of weights.

###### In practice

- Use ReLU first.
- Try out Leakey ReLU, Maxout, and ELU.
- Try out tanh but don’t expect much.
- Don’t use sigmoid.

#### Lecture 11: Detection and segmentation

- segmentation, localization, detection
- semantic segmentation, instance segmentation
- downsampling, upsampling
- unpooling by nearest neighbor, unpooling by ‘Bed of Neils’
- max unpooling
- tranpose convolution, upconvolution, fractionally strided convolution, backward convolution
- upsampling: unpooling, strided transpose convolution
- Treat localization as a regression problem!
- Use L2 loss for localization.

##### Object dectection

###### Sliding window

- Apply a CNN to many different crops of the image. The CNN classifies each crop as object or background.
- Sliding window is very computationally expensive!

###### Region proposal

- Find “blobby” image regions that are likely to contain objects.
- Relatively fast to run; e.g. Selective Search gives 1000 region proposals.
- R-CNN, Fast R-CNN, Faster R-CNN

###### R-CNN

- Ad hoc training objectives
- Training is slow and takes a lot of disk space.
- Inference (detection) is slow.

###### Fast R-CNN

- Detect all regions by one ConvNet in parallel.
- Problem: Runtime dominated by region proposal

###### Faster R-CNN

- Make a CNN do proposal!
- Insert Region Proposal Network (RPN) to predict proposals from features.

###### Detection without Proposals

- YOLO / SSD
- Use grid cells
- Faster R-CNN is slower but more accurate.

###### Dense captioning

- Dense Captioning = object detection + captioning

##### Instance segmentation

- Mask R-CNN
- Very good result!
- Also do pose detection!