Detection & Segmentation

Click on a tile to change the color scheme:

1. Classification & Localization

Difference between object detection and classification & localization:

For classification & localization, you know there are objects that you are looking for, and you know the number of it

Basic structure to tackle the task of classification & localization:

Screen Shot 2021-04-19 at 4.35.57 PM

Question:

Is it ok to do the two subtasks (classification & localization) together?
Some people may compute the loss for each class separately, but generally speaking it works well.
Multi-task loss (two kinds of loss)
Use hyper parameters to generate weighted total loss (it is difficult)
Or, use some final performance metric rather than just the value of loss to make choices
How to do it based on pre-trained models like ImageNet?
Freeze the pre-trained models first
Train your specific model
Train them together

2. Object detection

2.1 Sliding window

Object detection as classification

Big problem: how to choose the location to perform the classification

Brute force: computational expensive

2.2 Region Proposals

Screen Shot 2021-04-19 at 4.51.35 PM

Use proposals instead of searching for all regions.

How to propose?

2.2.1 R-CNN

Screen Shot 2021-04-19 at 4.56.54 PM

Screen Shot 2021-04-19 at 5.05.10 PM

Taking crops from the convolutional feature map.

Screen Shot 2021-04-19 at 5.07.05 PM

Screen Shot 2021-04-19 at 5.14.32 PM

2.2.2 YOLO / SSD

without proposals

Single-shot Detection: do all of the detections with a single forward pass (compared with performing detections for each proposals in R-CNN)

3. Instance Segmentation

3.1 Mask R-CNN

4. Semantic Segmentation

Paired training data: for each training image, each pixel is labeled with a semantic category.

4.1 Convolution

An intuitive idea: encode the entire image with conv net, and do semantic segmentation on top

Problem: classification architectures often reduce feature spatial sizes to go deeper, but semantic segmentation requires the output size to be the same as input size.