$30
The goal of this project is to introduce you to camera and scene geometry. Specifically we will estimate the camera projection matrix, which maps 3D world coordinates to image coordinates, as well as the fundamental matrix, which relates points in one scene to epipolar lines in another. The camera projection matrix and fundamental matrix can each be estimated using point correspondences. To estimate the projection matrix (camera calibration), the input is corresponding 3D and 2D points. To estimate the fundamental matrix the input is corresponding 2D points across two images. You will start out by estimating the projection matrix and the fundamental matrix for a scene with ground truth correspondences. Then you will move on to estimating the fundamental matrix using point correspondences that are obtained using SIFT.
Remember these challenging images of Gaudi’s Episcopal Palace from project 2? By using RANSAC to find the fundamental matrix with the most inliers, we can filter away spurious matches and achieve near perfect point-to-point matching as shown below:
Figure 2: Gaudi’s Episcopal Palace.
1 Part 1: Camera projection matrix
Introduction
The goal is to compute the projection matrix that goes from world 3D coordinates to 2D image coordinates. Recall that using homogeneous coordinates the equation for moving from 3D world to 2D camera coordinates is:
(1)
Another way of writing this equation is:
(2)
→ (m31X + m32Y + m33Z + m34)u = m11X + m12Y + m13Z + m14
→ 0 = m11X + m12Y + m13Z + m14 − m31uX − m32uY − m33uZ − m34u
(3)
→ (m31X + m32Y + m33Z + m34)v = m21X + m22Y + m23Z + m24
→ 0 = m21X + m22Y + m23Z + m24 − m31vX − m32vY − m33vZ − m34v
At this point, you’re almost able to set up your linear regression to find the elements of the matrix M. There’s only one problem–the matrix M is only defined up to a scale. Therefore, these equations have many different possible solutions (in particular M = all zeros is a solution, which is not very helpful in our context). The way around this is to first fix a scale, and then do the regression. There are several options for doing this: 1) You can fix the last element, m34 = 1, and then find the remaining coefficients, or 2) you can use the singular value decomposition to directly solve the constrained optimization problem:
arg min kAxk
x (4)
s.t. kxk = 1
To make sure that your code is correct, we are going to give you a set of “normalized points” in the files pts2d-norm-pic_a.txt and pts3d-norm.txt. If you solve for M using all the points, you should get a matrix that is a scaled equivalent of the following:
−0.4583 0.2947 0.0139 −0.0040
For example, this matrix will take the last normalized 3D point, < 1.2323,1.4421,0.4506,1.0 , and project it to < u,v of < 0.1419,0.4518 , converting the homogeneous 2D point < us,vs,s to its in homogeneous version (the transformed pixel coordinate in the image) by dividing by s.
First, you will need to implement the least squares regression to solve for M given the corresponding normalized points. The starter code will load 20 corresponding normalized 2D and 3D points. You have to write the code to set up the linear system of equations, solve for the unknown entries of M, and reshape it into the estimated projection matrix. To validate that you’ve found a reasonable projection matrix, we’ve provided evaluation code which computes the total “residual” between the projected 2D location of each 3D point and the actual location of that point in the 2D image. The residual is just the distance (square root of the sum of squared differences in u and v). This should be very small.
Once you have an accurate projection matrix M, it is possible to tease it apart into the more familiar and more useful matrix K of intrinsic parameters and matrix [R|T] of extrinsic parameters. For this project we will only ask you to estimate one particular extrinsic parameter: the camera center in world coordinates. Let us define M as being composed of a 3 × 3 matrix, Q, and a 4th column, m4:
(5)
From class we said that the center of the camera C could be found by:
C = −Q−1m4 (6)
To debug your code, if you use you the normalized 3D points to get the M given above, you would get a camera center of:
CnormA =< −1.5125,−2.3515,0.2826
We’ve also provided a visualization which will show the estimated 3D location of the camera with respect to the normalized 3D point coordinates.
In part1_projection_matrix.py, you will implement the following:
• projection(): Projects homogeneous world coordinates [X,Y,Z,1] to non-homogeneous image coordinates (u,v). Given projection matrix M, the equations that accomplish this are (2) and (3).
• calculate_projection_matrix(): Solves for the camera projection matrix using a system of equations set up from corresponding 2D and 3D points.
• calculate_camera_center(): Computes the camera center location in world coordinates.
2 Part 2: Fundamental matrix
Figure 3: Two-camera setup. Reference: Szeliski, p. 682.
The next part of this project is estimating the mapping of points in one image to lines in another by means of the fundamental matrix. This will require you to use similar methods to those in part I. We will make use of the corresponding point locations listed in pts2d-pic_a.txt and pts2d-pic_b.txt. Recall that the definition of the fundamental matrix is:
= 0 (7)
for a point (u,v,1) in image A, and a point (u0,v0,1) in image B. See Appendix A for the full derivation. Note: the fundamental matrix is sometimes defined as the transpose of the above matrix with the left and right image points swapped. Both are valid fundamental matrices, but the visualization functions in the starter code assume you use the above form.
Another way of writing this matrix equations is:
= 0 (8)
Which is the same as:
= 0 (9)
Starting to see the regression equations? Given corresponding points you get one equation per point pair. With 8 or more points you can solve this (why 8?). Similar to part I, there’s an issue here where the matrix is only defined up to scale and the degenerate zero solution solves these equations. So you need to solve using the same method you used in part I of first fixing the scale and then solving the regression.
The least squares estimate of F is full rank; however, a proper fundamental matrix is a rank 2. As such we must reduce its rank. In order to do this, we can decompose F using singular value decomposition into the matrices UΣV 0 = F. We can then construct a rank 2 matrix by setting the smallest singular value in Σ to zero thus generating Σ2 . The fundamental matrix is then easily calculated as F = UΣ2V 0. You can check your fundamental matrix estimation by plotting the epipolar lines using the plotting function provided in the starter code.
Coordinate normalization
As discussed in lecture, your estimate of the fundamental matrix can be improved by normalizing the coordinates before computing the fundamental matrix (see [1] by Hartley). It is suggested for this project you perform the normalization through linear transformations as described below to make the mean of the points zero and the average magnitude 1.0.
(10)
The transform matrix T is the product of the scale and offset matrices. cu and cv are the mean coordinates. To compute a scale s you could estimate the standard deviation after subtracting the means. Then the scale factor s would be the reciprocal of whatever estimate of the scale you are using. You could use one scale matrix based on the statistics of the coordinates from both images or you could do it per image.
In part2_fundamental_matrix.py you will need to use the scaling matrices to adjust your fundamental matrix so that it can operate on the original pixel coordinates. This is performed as follows:
Forig = TbT ∗ Fnorm ∗ Ta (11)
In part2_fundamental_matrix.py, you will implement the following:
• normalize_points(): Normalizes the 2D coordinates.
• unnormalize_F(): Adjusts the fundamental matrix to account for the normalized coordinates. See Appendix B.
• estimate_fundamental_matrix(): Calculates the fundamental matrix.
3 Part 3: Fundamental matrix with RANSAC
For two photographs of a scene it’s unlikely that you’d have perfect point correspondence with which to do the regression for the fundamental matrix. So, next you are going to compute the fundamental matrix with point correspondences computed using SIFT. As discussed in class, least squares regression alone is not appropriate in this scenario due to the presence of multiple outliers. In order to estimate the fundamental matrix from this noisy data you’ll need to use RANSAC in conjunction with your fundamental matrix estimation.
You’ll use these putative point correspondences and RANSAC to find the “best” fundamental matrix. You will iteratively choose some number of point correspondences (8, 9, or some small number), solve for the fundamental matrix using the function you wrote for part II, and then count the number of inliers. Inliers in this context will be point correspondences that “agree” with the estimated fundamental matrix. In order to count how many inliers a fundamental matrix has, you’ll need a distance metric based on the fundamental matrix. (Hint: For a point correspondence (x,x0) what properties does the fundamental matrix have?). You’ll need to pick a threshold between inliers and outliers and your results are very sensitive to this threshold, so explore a range of values. You don’t want to be too permissive about what you consider an inlier, nor do you want to be too conservative. Return the fundamental matrix with the most inliers.
Recall from lecture the expected number of iterations of RANSAC to find the “right” solution in the presence of outliers. For example, if half of your input correspondences are wrong, then you have a (0.5)8 = 0.39% chance to randomly pick 8 correspondences when estimating the fundamental matrix. Hopefully that correct fundamental matrix will have more inliers than one created from spurious matches, but to even find it you should probably do thousands of iterations of RANSAC.
For many real images, the fundamental matrix that is estimated will be “wrong” (as in it implies a relationship between the cameras that must be wrong, e.g., an epipole in the center of one image when the cameras actually have a large translation parallel to the image plane). The estimated fundamental matrix can be wrong because the points are co-planar or because the cameras are not actually pinhole cameras and have lens distortions. Still, these “incorrect” fundamental matrices tend to do a good job at removing incorrect SIFT matches (and, unfortunately, many correct ones).
For this part, you will implement the following methods in part3_ransac.py:
• calculate_num_ransac_iterations(): Calculates the number of RANSAC iterations needed for a given guarantee of success.
• ransac_fundamental_matrix(): Uses RANSAC to find the best fundamental matrix.
4 Part 4: Performance comparison
In this part, you will compare the performance of using the direct least squares method against RANSAC. You can use the code in the notebook for visualization of the fundamental matrix estimation. The first one uses a subset of the matches and the direct linear method (which corresponds to the estimate_fundamental_matrix () in part 2) to compute the fundamental matrix. The second method uses RANSAC, which corresponds to ransac_fundamental_matrix() in part 3. Based on your output visualizations, answer the reflection questions in the report.
5 Part 5: Visual odometry
Visual odometry (VO) is an important part of the simultaneous localization and mapping (SLAM) problem. In this part, you can observe the implementation of VO on a real-world example from Argoverse. VO will allow us to recreate most of the ego-motion of a camera mounted on a robot – the relative translation (but only up to an unknown scale), and the relative rotation. See [2] for a more detailed understanding. Based on the output and your understanding of the process, answer the reflection questions in the report.