Skip to content

Detection & Segmentation

Click on a tile to change the color scheme:

1. Classification & Localization

Difference between object detection and classification & localization:

  • For classification & localization, you know there are objects that you are looking for, and you know the number of it

Basic structure to tackle the task of classification & localization:

Screen Shot 2021-04-19 at 4.35.57 PM

Question:

  • Is it ok to do the two subtasks (classification & localization) together?
  • Some people may compute the loss for each class separately, but generally speaking it works well.
  • Multi-task loss (two kinds of loss)
  • Use hyper parameters to generate weighted total loss (it is difficult)
  • Or, use some final performance metric rather than just the value of loss to make choices
  • How to do it based on pre-trained models like ImageNet?
  • Freeze the pre-trained models first
  • Train your specific model
  • Train them together

2. Object detection

2.1 Sliding window

Object detection as classification

Big problem: how to choose the location to perform the classification

Brute force: computational expensive

2.2 Region Proposals

Screen Shot 2021-04-19 at 4.51.35 PM

Use proposals instead of searching for all regions.

How to propose?

2.2.1 R-CNN

Screen Shot 2021-04-19 at 4.56.54 PM

Screen Shot 2021-04-19 at 5.05.10 PM

Taking crops from the convolutional feature map.

Screen Shot 2021-04-19 at 5.07.05 PM

Screen Shot 2021-04-19 at 5.14.32 PM

2.2.2 YOLO / SSD

without proposals

Single-shot Detection: do all of the detections with a single forward pass (compared with performing detections for each proposals in R-CNN)

3. Instance Segmentation

3.1 Mask R-CNN

4. Semantic Segmentation

Paired training data: for each training image, each pixel is labeled with a semantic category.

4.1 Convolution

An intuitive idea: encode the entire image with conv net, and do semantic segmentation on top

Problem: classification architectures often reduce feature spatial sizes to go deeper, but semantic segmentation requires the output size to be the same as input size.

4.2 Fully Convolutional

Design a network with only convolutional layers without downsampling operators to make predictions for pixels all at once!

Problem: convolutions at original image resolution will be very expensive ...

Solution: Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network!

Screen Shot 2021-04-19 at 5.54.15 PM

Unpooling:

Screen Shot 2021-04-19 at 6.01.33 PM

Screen Shot 2021-04-19 at 6.02.45 PM

4.3 Transpose Convolution (Learnable Sampling)

Screen Shot 2021-04-19 at 6.26.34 PM

Issue: checkerboard artifacts


Last update: June 16, 2023
Authors: Colin