$50
PART 1
The goal of this project is to create stereo depth estimation algorithms, both classical and deep learning based. For classical stereo depth estimation algorithms, you will be using deterministic functions to compare patches and compute a disparity map. For deep learning based algorithms, you will be using a learning method to estimate the disparity map. There will be two parts in this project, the first of which is described in this handout. You will implement functions in part1_*.py to generate random patches, evaluate the similarity of those patches, and then compute the disparity map for several images. The corresponding notebook for this section is part1_simple_stereo.ipynb. Part 2 of this project (including its corresponding handout) will be released separately.
1 Simple stereo by matching patches
Introduction
We know that there is some encoding of depth when images are captured using a stereo rig, much like human eyes. You can try a simple experiment to see the stereo effect in action. Try seeing a scene with only your left eye. Then close your left eye and see using your right eye. Make the transition quickly. You should notice a horizontal shift in the image perceived. Can you comment on the difference in shift for different objects when you do this experiment? Is it related to the depth of the objects in some way?
In this section, we will generate a disparity map, which is the map of horizontal shifts estimated at each pixel. We will start working on a simple algorithm, which will then be improved to calculate more accurate disparity maps.
The notebook corresponding to this part is part1_simple_stereo.ipynb.
1.1 Random dot stereogram
It was once believed that in order to perceive depth, one must either match feature points (like SIFT) between left and right images, or rely upon clues such as shadows.
A random dot stereogram eliminates all other depth cues, and hence proves that a stereo setup is sufficient to get an idea of the depth of the scene. A random dot stereogram is generated by the following steps:
Create the left image with random dots at each pixel (0/1 values).
Create the right image as a copy of the left image.
Select a region in the right image and shift it horizontally.
Add a random pattern in the right image in the empty region created after the shift.
In part1a_random_stereogram.py, you will implement generate_random_stereogram() to generate a random dot stereogram for the given image size.
1.2 Similarity measure
To compare patches between left and right images, we will need two kinds of similarity functions:
Sum of squared differences (SSD):
SSD(A,B) = X (Aij − Bij)2 (1)
i∈[0,H),j∈[0,W)
Sum of absolute differences (SAD):
SAD(A,B) = X |Aij − Bij| (2)
i∈[0,H),j∈[0,W)
where A and B are two patches of height H and width W.
In part1b_similarity_measures.py, you will implement the following:
ssd_similarity_measure(): Calculate SSD distance.
sad_similarity_measure(): Calculate SAD distance.
1.3 Disparity maps
We are now ready to write code for a simple algorithm for stereo matching. You will need to follow the steps visualized in Figure 1:
Figure 1: Example of a stereo algorithm.
Pick a patch in the left image (red block), P1.
Place the patch in the same (x,y) coordinates in the right image (red block). As this is binocular stereo, we will need to search for P1 on the left side starting from this position. Make sure you understand this point well before proceeding further.
Slide the block of candidates to the left (indicated by the different pink blocks). The search area is restricted by the parameter max_search_bound in the code. The candidates will overlap.
We will pick the candidate patch with the minimum similarity error (green block). The horizontal shift from the red block to the green block in this image is the disparity value for the center of P1 in the left image.
Note: the images have already been rectified, so we can search only a single horizontal scan line.
In part1c_disparity_map.py, you will implement calculate_disparity_map() (please read the documentation carefully!) to calculate the disparity value at each pixel by searching a small patch around a pixel from the left image in the right image.
(a) Convex profile. (b) Non-convex profile.
Figure 2
1.4 Error profile analysis
Before computing the full disparity map, we will first analyze the similarity error distribution between patches. You will have to find two examples which display a close-to-convex error profile, and a highly non-convex profile, respectively. For reference, we provide the plots we obtained (see Figure 2). Based on your output visualizations and understanding of the process, answer the reflection questions in the report.
1.5 Real life stereo images
You will iterate through pairs of images from the dataset and calculate the disparity maps for images. The code is already given to you. You just need to compare the disparity maps and answer the reflection questions in the report.
1.6 Smoothing
One issue with the above results is that they aren’t very smooth. Pixels next to each other on the same surface can have vastly different disparities, making the results look very noisy and patchy in some areas. Intuitively, pixels next to each other should have a smooth transition in disparity (unless at an object boundary or occlusion).
In this part, we try to improve our results. One way of doing this is through the use of a smoothing constraint. The smoothing method we use is called semi-global matching (SGM) or semi-global block matching. Before, we picked the disparity for a pixel based on the minimum matching cost of the block using some metric (SSD or SAD). Now, the basic idea of SGM is to penalize disparity computations which are very different than their pixel-wise neighbors by adding a penalty term on top of the matching cost term. SGM approximately minimizes the global (over the entire image) energy function.
E(D) ≤ X(C(p,Dp) + XPT(|Dp − Dq|))
p q
C(p,Dp) is the matching cost for a pixel with disparity Dp, q is a neighboring pixel, and PT(·) is some penalty function penalizing the difference in disparities. You can read more about how this method works and is optimized here: Semi-Global Matching - Motivation, Developments, and Applications and Stereo Processing by Semi-Global Matching and Mutual Information.
You will not need to implement SGM by yourself. But to help understand SGM, you will implement a function which computes the cost volume. You have already written code to compute disparity map. Now you will extend that code to compute the cost volume. Instead of taking the argmin of the similarity error profile, we will store the tensor of error profiles at each pixel location along the third dimension. If we have an input image of dimension (H,W,C) and max search bound of D, the cost_volume will be a tensor of dimension (H,W,D). The cost volume at (i,j) pixel is the error profile obtained for the patch in the left image centered at (i,j).
In part1c_disparity_map.py, you will implement calculate_cost_volume() to calculate the disparity map.
PART 2
The goal of this project is to create stereo depth estimation algorithms, both classical and deep learning based. In part 1, you implemented classical stereo depth estimation algorithms using a deterministic function to evaluate patches and then get disparity map. In part 2, you will implement deep learning based algorithms to estimate the disparity map. Specifically, you will 1) implement the part for generating patch and architecture of MC-CNN model in part2_*.py and go through part2_disparity.ipynb. Make sure you pass all the sanity checks for part 2 before starting training. 2) use part2_mc_cnn.ipynb to go through the training and visualize the results of your model.
2 Learning-based stereo matching
In the previous section, you saw how we can use simple concepts like SAD and SSD to compute matching costs between two patches and produce disparity maps. Now let’s try something different – instead of using SAD or SSD to measure similarity, we will train a neural network and learn from the data directly.
Introduction
You’ll implement what has been proposed in the paper [Zbontar & LeCun, 2015], and evaluate how it performs compared to classical cost matching approaches. The paper proposes several network architectures, but what we will be using is the accurate architecture for the Middlebury stereo dataset. This dataset provides a ground truth disparity map for each stereo pair, which means we know exactly where the match is supposed to be on the epipolar line. This allows us to extract many such matches and train the network to identify what type of patches should be matches and what shouldn’t. You should definitely read the paper in more details if you’re curious about how it works.
You don’t have to worry about the dataset – we provide images in a ready-to-use format (with rectification). In fact, you won’t be doing much coding in this part. Rather, you should focus on experimenting and thinking about why. Your report will have a lot of weight in this part, so try to be as clear as possible.
Note: The network in Part 2.2.1 can take around 15-30 mins to train on Colab. We suggest you start early and don’t wait until the last minute.
2.1 PyTorch functions on CPU
In this part, we will implement an MCNET network architecture as described in the paper (See Figure 1), generate patches for the training process, and calculate disparity for MCNET.
The corresponding notebook for this part is part2_disparity.ipynb.
2.1.1 Network architecture
MCNET
We will follow the description of the “accurate” network for Middlebury dataset. The inputs to the network are 2 image patches, coming from left and right images. Each will pass through a series of convolution + ReLU layers. The extracted features are then concatenated and passed through additional fully connected + ReLU layers. The output is a single real number between 0 and 1, indicating the similarity between the two input images [Zbontar & LeCun, 2015]. In this case, since training from scratch will take a really long time to converge, you’ll train from our pre-trained network instead. In order to load up the pre-trained network, you must first implement the architecture exactly as described below:
Figure 1: Visualization of network architecture.
For efficiency we will convolve both input images in the same batch (this means that the input to the network will be 2×batch size). After the convolutional layers, we will then reshape them into [batch size,conv out] where conv out is the flattened output size of the convolutional layers. This will then be passed through a series of fully connected layers and finally a sigmoid layer to bound the output value to [0,1].
Here is an example of a network with num conv layers = 1 and num fc layers = 2:
conv_layers = nn.Sequential( nn.Conv2d(in_channel, num_feature_map, kernel_size=kernel_size, stride=1, padding=( kernel_size // 2)), nn.ReLU(), )
fully_connected_layers = nn.Sequential( nn.Linear(conv_out, num_hidden_units), nn.ReLU(), nn.Linear(num_hidden_units, 1), nn.Sigmoid()
)
conv_feature_batch = conv_layers(input_batch) conv_feature_batch.reshape((batch_size, conv_out) output_batch = fully_connected_layers(conv_feature_batch)
In part2a_network.py, you will implement the following network architecture:
MCNET: Implement the network architecture as described in the paper.
2.1.2 Patch generation
In part2b_patch.py, you will implement gen_patch() to extract a patch from an image.
2.1.3 Disparity map calculation with MCNET
The core logic for calculating disparity for MCNET will remain the same, but we will have to do a few things differently. It will take around 1-2 mins to generate the disparity map if implemented correctly. The steps required here are as follows:
We will operate on convolutional features instead of raw pixels. Pass the images through the convolutional block of MCNET to obtain the features.
Pick a patch in the left image features, P1.
Calculate the search-space of corresponding patches in the right image features:As before, place the patch in the corresponding location in the right image features, and slide it to obtain a sequence of window patches.
Concatenate these patches at the 0th dimension to form a batch of patches.
Compute the similarity values over the entire window using the similarity function provided to you. All the similarity values over the window will be present as a (k × 1) tensor.
Pick the patch with the minimum similarity error.
Note: It is important that the similarity calculation happens in parallel over the entire search window. Otherwise, the disparity calculation will take a really long time in the subsequent part.
In part2c_disparity.py, you will implement mc_cnn_similarity() and calculate_mccnn_cost_volume() to calculate the disparity value at each pixel using MCNET.
Note: Before proceeding to the next part, you need to ensure that all sanity checks for this part are passing by running part2_disparity.ipynb with jupyter notebook, and running pytest proj4_unit_tests
2.2 Train and evaluation on Google Colab
In this part, we will train the MCNET architecture and evaluate the overall performance.
Setup
We will be using Google Colab, which is a cloud-based Jupyter notebook environment. You can choose to run this section locally as well, especially if you have a good GPU, but the assignment is designed to run on Colab with GPU (this project is doable without a GPU, but a GPU makes the process much faster and frustration free.). These are the steps we follow:
Upload ipynb to Google Colab
Zip semiglobalmatching and proj4_code into zip and upload them to the Colab runtime.
Unzip the uploaded zip using !unzip -qq uploaded\_file.zip -d ./
In Colab, make sure you select “GPU” in the menu (“Runtime” → “Change runtime time” → “Hardware accelerator”).
You will need to follow the instructions in Setup, Compute Requirements, and DataLoader in the notebook to download the necessary data and set up the environment in Google Colab.
2.2.1 Train MCNET
In this part, we will train a neural network that learns how to classify 2 patches as positive vs negative match. Your task is to train a best network by experimenting with the learning parameters. The following shows the experiments you need to complete for this part:
Experiment with the learning rate: try using large ( 1) vs. small (< 1e − 5) values. Based on your output visualizations, answer the reflection questions in the report.
Experiment with the window size: In the previous part, we use window size of 11 as suggested in the paper, meaning that the input to the network will be patches of size 11x11. This corresponds to the block size that will be used when perform stereo matching later on. You can experiment with other window size, namely 5x5, 9x9, and 15x15 and compare the performance.
Tune the training parameters and pick the best combination of hyperparameters with the best disparity map visualization. You should show the training loss plot in the report and answer the reflection questions in the report. Typically, models with average error of around 20 tend to pass the Gradescope tests.
2.2.2 Evaluate stereo matching
In this part, we will again generate the disparity map but this time from our newly trained matching cost network. We will use calculate_mc_cnn_disparity from part2c_disparity.py for this.
Note that all the required functions in Part 2.1 need to be implemented correctly before starting this part.
Hint: You don’t have to re-train the network every time you want to evaluate, as long as your saved model is in Colab file system. Don’t forget to change load_path to your best model.
Then we will evaluate your trained network as a stereo matching cost with the metrics used in the Middlebury leaderboard for stereo matching. For the bicycle image, you should see the improvement in using the trained network vs. SAD cost matching.
avgerr: average absolute error in pixels (lower is better)
bad1: percentage of bad pixels whose error is 1 (lower is better)
bad2: percentage of bad pixels whose error is 2 (lower is better)
bad4: percentage of bad pixels whose error is 4 (lower is better)