Detection & Segmentation
Click on a tile to change the color scheme:
1. Classification & Localization
Difference between object detection and classification & localization:
- For classification & localization, you know there are objects that you are looking for, and you know the number of it
Basic structure to tackle the task of classification & localization:
Question:
- Is it ok to do the two subtasks (classification & localization) together?
- Some people may compute the loss for each class separately, but generally speaking it works well.
- Multi-task loss (two kinds of loss)
- Use hyper parameters to generate weighted total loss (it is difficult)
- Or, use some final performance metric rather than just the value of loss to make choices
- How to do it based on pre-trained models like ImageNet?
- Freeze the pre-trained models first
- Train your specific model
- Train them together
2. Object detection
2.1 Sliding window
Object detection as classification
Big problem: how to choose the location to perform the classification
Brute force: computational expensive
2.2 Region Proposals
Use proposals instead of searching for all regions.
How to propose?
2.2.1 R-CNN
Taking crops from the convolutional feature map.
2.2.2 YOLO / SSD
without proposals
Single-shot Detection: do all of the detections with a single forward pass (compared with performing detections for each proposals in R-CNN)
3. Instance Segmentation
3.1 Mask R-CNN
4. Semantic Segmentation
Paired training data: for each training image, each pixel is labeled with a semantic category.
4.1 Convolution
An intuitive idea: encode the entire image with conv net, and do semantic segmentation on top
Problem: classification architectures often reduce feature spatial sizes to go deeper, but semantic segmentation requires the output size to be the same as input size.
4.2 Fully Convolutional
Design a network with only convolutional layers without downsampling operators to make predictions for pixels all at once!
Problem: convolutions at original image resolution will be very expensive ...
Solution: Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network!
Unpooling:
4.3 Transpose Convolution (Learnable Sampling)
Issue: checkerboard artifacts