$30
Several estimation algorithms in computer vision are called “linear algorithms”. These are typically computationally cheap and can provide an estimate of the quantity of interest without an initial guess. On the other hand, the estimate is usually not optimal in a geometrically meaningful sense, and may therefore be suboptimal in the presence of noise. Because of this, linear algorithms are most often used to obtain a rough initial estimate, possibly in combination with RANSAC, which is then further refined by non-linear optimization of a geometric and/or probabilistic objective function.
In this assignment you will learn how to derive linear algorithms using the direct linear transform, and implement an algorithm to estimate the pose of a planar object from 2D-3D point correspondences. As a simple test case, you will apply the algorithm to a paper sheet with fiducial markers. These markers have known positions on the paper and are designed to be easily detected and uniquely identified, which simplifies the problem of establishing correspondences.
Figure 1: Output from the algorithm you will implement, showing the estimated object coordinate frame, along with the detected and predicted location of the markers.
Background
Linear algorithms appear several places in Szeliski (2010), including 6.1 (Feature-based alignment), 7.1 (Triangulation) and 7.2 (Two-frame structure from motion). The pose estimation algorithm in this assignment is not presented in Szeliski, however its derivation and implementation is similar to other linear algorithms in the book. For the curious, the algorithm here is based on the camera calibration paper by Zhengyou Zhang, which has had a heavy influence on the development of calibration software:
I Zhengyou Zhang. A Flexible New Technique for Camera Calibration. 2000. (link)
Linear algorithms are based on solving linear systems of equations. A good overview of this topic can be found in Appendix 5 (Least-squares minimization) of Multiple View Geometry by Hartley and Zisserman, which is available on Blackboard.
The first step of the pose estimation algorithm is to estimate a homography between the object plane and the image. Homographies (or projective transformations) are briefly introduced in Szeliski 2.1.2. A more detailed treatment can be found in Hartley and Zisserman, but is not necessary for completing this assignment. In summary, a homography is a mapping between two (homogeneous) 2D coordinates:
x˜0 = H˜x (1)
where H is an arbitrary 3 × 3 matrix. Besides the perspective image of a planar object, some other transformations that can be described by a homography is the mapping between two images of a planar scene and between two images taken by a rotating camera.
The assignments and projects in this course are copyrighted by Simen Haugo (haugo.simen@gmail.com) and may not be copied, redistributed, reused or repurposed without written permission.
Part 1 Transformation between a plane and its image
Consider a perspective image of a planar object. Without loss of generality, let points on the object have coordinates of the form X = (X,Y,0), arbitrarily placing the object plane at Z = 0. The pixel coordinates of a given point on the object are
u = cx + sxfXc/Zc (2) v = cy + syfY c/Zc (3)
where
(4)
is the point transformed from object to camera coordinates by R and t—which we refer to as the “pose”. For convenience, we define the calibrated image coordinates x = (x,y) as
x := (u − cx)/sxf (5) y := (v − cy)/syf (6)
orgenerallyx˜ = K−1u˜foran arbitraryintrinsicmatrixK,whichcan bethoughtofas“camera-agnostic” coordinates. Combining the above, we find the following non-linear relationship between points on the object (X,Y ) and calibrated image coordinates (x,y):
, (7)
(8)
If we use homogeneous coordinates, this relationship can be written in the linear form
x˜ X r11 r12 tx
y˜ = HY where H = r21 r22 ty
z˜ 1 r31 r32 tz
(9)
with x˜ = (x,˜ y,˜ z˜) being the homogeneous form of (x,y).
Task 1.1: We say that a homography is “unique up to a scaling factor” because any non-zero scalar multiple of the homography matrix defines the same mapping between the 2D coordinates (after dehomogenization). Show that this is the case for the matrix H in Eq. (9).
Task 1.2: A general 3 × 3 homography has eight degrees of freedom (nine arbitrary entries minus scale ambiguity). However, explain why H as defined in Eq. (9) has fewer than eight degrees of freedom, i.e. why the entries of H are more restricted than in an arbitrary homography.
Part 2 The direct linear transform
The direct linear transform (DLT) is a general technique for transforming a set of non-linear equations into a mathematically equivalent linear system that can be solved more easily. Here we will use the DLT to estimate H from a set of correspondences (xi,yi) ↔ (Xi,Yi). Note that although H as defined in Eq. (9) is more restricted, the algorithm here can be used to estimate an arbitrary homography.
The key idea of the DLT is to rearrange Eq. (7)-(8) by multiplying by the denominator on both sides, such that the pair of equations becomes
(r31Xi + r32Yi + tz)xi = r11Xi + r12Yi + tx,
(10)
(r31Xi + r32Yi + tz)yi = r21Xi + r22Yi + ty.
(11)
Notably, these equations are linear in the elements of H. Hence, if we collect all the unknowns that we wish to estimate into a vector, e.g.
h iT
h = r11 r12 tx r21 r22 ty r31 r32 tz
then you may verify that Eq. (10)-(11) can be written as the linear system
,
(12)
Aih = 0
where
(13)
A . (14)
Each point correspondence gives rise to one pair of equations. By vertically stacking the equations from n correspondences, we obtain a linear system Ah = 0, where A is a 2n×9 matrix. This is called a homogeneous system. Unlike inhomogeneous systems (e.g. Ah = b where b 6= 0), homogeneous systems always have the trivial solution h = 0, which is of no interest. In order to have a non-trivial solution, A must have a null-space of dimension one or higher. There will then be an infinite number of solutions given by linear combinations of the null-space vectors. If the null-space is 1-dimensional, then there is a non-trivial solution that is also unique up to a scaling factor. (This reflects the scale ambiguity you showed in Task 1.1, and will be addressed in Part 3.)
The null-space can be made 1-dimensional by stacking the equations from n ≥ 4 correspondences. If n = 4, then there is always a unique (up to scale) exact non-trivial solution, as long as no three points on either side are collinear. If n 4, then the system is over-determined, and will generally not have an exact solution apart from h = 0 unless the point locations are free of noise. In absence of an exact solution, we normally seek the “best” solution in the least-squares sense
min||Ah||2 subject to ||h||2 = 1 (15)
h
where we arbitrarily impose the scale constraint ||h||2 = 1 to avoid the trivial solution. The leastsquares solution h can be obtained from the singular value decomposition (SVD) of A = UΣVT as the column of V corresponding to the smallest singular value. (A short proof of this is in section A5.3 of the Hartley and Zisserman chapter on Blackboard.)
The following data is included in this assignment:
- K.txt: Camera intrinsic matrix K.
- XY.txt: Markers’ paper coordinates (X,Y ).
- image0000.jpg...image0022.jpg: Image sequence.
- detections.txt: Detected markers (one row per image). Each row contains 24 tuples of the form (di,ui,vi), where di = 1 if the i’th marker was detected and (ui,vi) are its detected pixel coordinates. Note that only one corner of each marker is used.
The main.py/m script has some helper code for loading the data and generating the requested figures. Stub functions are provided in equivalently-named files (Matlab) orin common.py (Python).
Task 2.1: Implement estimate_H. This should build the matrix A, solve for the vector h and reshape the entries of h into the 3×3 matrix H. Compute the predicted marker locations using H and run the main script, which should visualize the marker locations as in Fig. 1. (You will also get a 3D plot visualizing the camera and the object, but this will not work until Task 3.) Check that the predicted locations are close to the detected locations on all images and include the figure for image number 4.
Tip: Remember to convert the detected marker locations into calibrated image coordinates for use in estimate_H. These can be computed as x˜i = K−1u˜i, where u˜i = (ui,vi,1). After estimating H, the predicted marker locations can be computed using Eq. (9), followed by conversion from calibrated image coordinates back to pixel coordinates: u˜i,predicted = K˜xi,predicted. Note that either conversion produces a homogeneous 3-vector, which must be divided by the last component (and sliced) to obtain the actual 2D coordinates that we measure in the image.
Task 2.2: The distance between a marker’s detected and predicted location is its reprojection error. It is measured in pixels and is defined as the Euclidean distance between pixel coordinates:
q
ei = ||ui − ui,predicted||2 = (ui − ui,predicted)T (ui − ui,predicted) (16)
where ui are the detected pixel coordinates of the i’th marker and ui,predicted are computed as above. Modify your script to compute the average, minimum and maximum reprojection error over all markers in each image. Include the numbers for image number 4 in your writeup. The average reprojection error should be less than 1 pixel on this image.
Task 2.3: The matrices K and H define a mapping from the object plane to the image. When this mapping is invertible, it is possible to go from pixel coordinates back to object coordinates. Use this to extract the texture of the object, i.e. the pattern of markers, into its own image. This is also called a “perspective-free” or “fronto-parallel” image and is highly useful for image processing. For example, parallel lines remain parallel, and circles and other shapes are preserved.
Part 3 Recover the pose
We can recover the pose (R and t) of the planar object by observing from Eq. (9) that H contains the translation vector and two of the rotation matrix columns. The last column of the rotation matrix is not present, but if we know any two columns, the third can always be obtained by the cross product, e.g.
r3 = r1 × r2 (17)
where ri = (r1i,r2i,r3i) is the i’th column of R. However, recall that the solution from Part 2 is only unique up to a scaling factor. (We chose the scale ||h||2 = 1 arbitrarily.) This means that the entries in the solved-for H are generally not equal to the rotation matrix and translation vector elements. Instead, there is an unknown scaling factor k such that
h11
H = h21
h31
h12 h22 h32
h13 r11 h23 = k r21
h33 r31
r12 r22 r32
tx ty.
tz
(18)
We can determine the scaling factor k by imposing the constraint that the columns of the rotation matrix should be of unit length. Hence,
. (19)
Because we can’t recover the sign of k, there are two possible poses (corresponding to +|k| or −|k|).
Task 3.1: Implement decompose_H. It should take in the result from estimate_H and return its two possible pose decompositions as 4×4 transformation matrices. If you run the main script, the generated figure should now draw the object coordinate frame into the image and visualize the camera and the object in 3D. Include the figure for both possible poses on image number 4.
Task 3.2: Only one of the poses is physically plausible in the sense that it places the object in front of the camera. Specifically, every detected marker should have a positive Z coordinate when transformed into the camera frame. Use this criterion to automatically choose the correct pose. Verify that the correct pose is chosen on all images, and include the figure for image number 4, 5 and 21.
Task3.3: Due to noise,the matrixformedbyr1,r2 andr3 maynotexactlysatisfythe properties of a rotation matrix. In Appendix C of his paper (see Blackboard), Zhang suggests to replace this matrix with the “closest” valid rotation matrix in the sense of the Frobenius norm. Read Appendix C and implement the function closest_rotation_matrix. Modify decompose_H using this function to return a valid rotation matrix for both poses.
Describe the properties of a rotation matrix, and suggest a way to numerically quantify how well the properties are satisfied by a given 3 × 3 matrix. Quantify how well the properties are satisfied by the rotation matrix of the chosen pose, with and without the above correction, on image number 4.
Part 4 Derive your own linear algorithm
Sometimes we have partial information about the pose, either through other sensors or by assumptions about the physically achievable motions of our robot. For example, aerial vehicles can often provide two rotational degrees of freedom from a gyroscope and accelerometer, and, if the ground is flat, one translational degree of freedom from a laser range finder. In such cases, we can derive specialized linear algorithms which may require fewer point correspondences than in the general case.
Task 4.1: Suppose you have a set of 2D-3D point correspondences ui ↔ Xi between an image and a general object (not necessarily planar), which satisfy the pinhole model:
u˜i = K(RXi + t),i = 1...n. (20)
Assume thatK andt are bothknown andshow(using the DLT) thatn correspondences can be combined into a linear system of equations Am = b, where both A ∈ R2n×9 and b ∈ R2n are derived purely from known variables or constants, and m contains the unknown entries of R.
Task 4.2: Is the system homogeneous or inhomogeneous?
Task 4.3: Suggest a way to solve the system for m and recover R.