$30
In this project, you will design and train deep convolutional networks for semantic segmentation.
Dataset
The dataset to be used in this assignment is the Camvid dataset, a small dataset of 701 images for self-driving perception. It was first introduced in 2008 by researchers at the University of Cambridge [1]. You can read more about it at the original dataset page or in the paper describing it. The images have a typical size of around 720 by 960 pixels. We’ll downsample them for training though since even at 240 x 320 px, most of the scene detail is still recognizable.
Today there are much larger semantic segmentation datasets for self-driving, like Cityscapes, WildDashV2, Audi A2D2, but they are too large to work with for a homework assignment.
The original Camvid dataset has 32 ground truth semantic categories, but most evaluate on just an 11class subset, so we’ll do the same. These 11 classes are ‘Building’, ‘Tree’, ‘Sky’, ‘Car’, ‘SignSymbol’, ‘Road’, ‘Pedestrian’, ‘Fence’, ‘Column Pole’, Sidewalk’, ‘Bicyclist’. A sample collection of the Camvid images can be found below:
(a) Image A, RGB (b) Image A, Ground Truth (c) Image B, RGB (d) Image B, Ground Truth
Figure 1: Example scenes from the Camvid dataset. The RGB image is shown on the left, and the corresponding ground truth “label map” is shown on the right.
1 Implementation
For this project, the majority of the details will be provided into two separate Jupyter notebooks. The first, proj6_local.ipynb includes unit tests to help guide you with local implementation. After finishing that, upload proj6_colab.ipynb to Colab. Next, zip up the files for Colab with our script zip_for_colab.py, and upload these to your Colab environment.
We will be implementing the PSPNet [3] architecture. You can read the original paper here. This network uses a ResNet [2] backbone, but uses dilation to increase the receptive field, and aggregates context over different portions of the image with a “Pyramid Pooling Module” (PPM).
Figure 2: PSPNet architecture. The Pyramid Pooling Module (PPM) splits the H×W feature map into KxK grids. Here, 1×1, 2×2, 3×3, and 6×6 grids are formed, and features are average-pooled within each grid cell. Afterwards, the 1 × 1, 2 × 2, 3 × 3, and 6 × 6 grids are upsampled back to the original H×W feature map resolution, and are stacked together along the channel dimension.
You can read more about dilated convolution in the Dilated Residual Network here, which PSPNet takes some ideas from. Also, you can watch a helpful animation about dilated convolution here.
(a) (b) (c)
Figure 3: Dilation convolution. Figure source: https://github.com/vdumoulin/conv_ arithmetic#dilated-convolution-animations
Suggested order of experimentation
1. Start with just ResNet-50, without any dilation or PPM, end with a 7×7 feature map, and add a 1×1 convolution as a classifier. Report the mean intersection over union (mIoU).
2. Now, add in data augmentation. Report the mIoU. (you should get around 48% mIoU in 50 epochs, or 56% mIoU in 100 epochs, or 58-60% in 200 epochs).
3. Now add in dilation. Report the mIoU.
4. Now add in PPM module. Report the mIoU.
5. Try adding in auxiliary loss. Report the mIoU (you should get around 65% mIoU over 100 epochs, or 67% in 200 epochs).