Starting from:

$30

TK4255-Assignment 3 Geometric Image Formation Solved

The pinhole camera model
A common model of geometric image formation is the perspective projection. The simplest imaging device that can be described by this model is the pinhole camera: a box with a small opening (the pinhole) at one end, through which light enters and reaches an imaging surface at the opposite end. The ideal pinhole camera is a theoretical device in which the opening is made infinitely small. This has the consequence that every point on the imaging surface receives light from a unique direction, all of

which intersect at the pinhole. The resulting mapping from scene to image is a perspective projection.

Letting f denote the distance between the pinhole and the imaging surface in an ideal pinhole camera, the relationship between a point (X,Y,Z) in the scene and its projected location (X0,Y 0,Z0) on the imaging surface can be derived by a consideration of similar triangles (see Fig. 1):

                                                                                                            (1)

                                                                                                             (2)

and finally Z0 = −f. These equations will differ based on how you place the camera’s axes. In this course we will follow the convention in which positive Z is in front of the camera, positive X is to the right and positive Y is downward, forming a right-handed coordinate system. Regardless of the chosen convention, these equations imply that the image will appear to be inverted (i.e. rotated 180 degrees), compared with what you would see through the pinhole. This inversion also occurs in a modern digital camera, but the camera firmware and your operating system work together to ensure that the rotation is undone when the image is presented on your screen.

A modern digital camera similarly consists of a box with an opening at one end, which allows light to enter and hit an imaging surface (usually a CCD or CMOS sensor). The difference is that the opening is not infinitely small. Instead, it is made large to collect more light, and a lens is placed in front of the opening to focus the light. What is important to note is that the geometric transformation of points in the scene to points on the sensor can be—or at least, is often—modeled by the same equations (perspective projection). This is somewhat surprising when it is considered that modern cameras have complex optics which bend light at multiple stages. The fact that the model continues to be applicable is in part because camera and lens engineers strive to produce a perspective projection, as the resulting images appear most natural and pleasing to a human observer.

The assignments and projects in this course are copyrighted by Simen Haugo (haugo.simen@gmail.com) and may not be copied, redistributed, reused or repurposed without written permission.

In a high-quality camera,the agreement can be good enough to not require further modeling. Otherwise, smaller errors can often be modeled during camera calibration, and corrected to produce an equivalent perspective projection image. Because of this, the perspective projection model is applicable for the majority of commercially available cameras. However, it is worth mentioning that some cameras, such as fisheye lens cameras, do not have as a goal to produce natural-looking images, and may deviate substantially from a perspective projection. In such cases, a different camera model may be warranted.

When working with a digital camera, we also need to model the transformation from locations on the physical sensor surface (in metric units) to the pixel coordinates (rows and columns) used to access the digital image. With few exceptions, the pixel coordinate system is normally placed in the upper left corner of the noninverted image, with the horizontal axis pointing to the right in the scene and the vertical axis pointing down in the scene. Letting (u,v) denote the horizontal and vertical pixel coordinates, these are then related to (X0,Y 0) by a negation, scale and offset:

                                                                 u = cx − sxX0           and     v = cy − syY 0,

which can be simplified and written in terms of the original (X,Y,Z) coordinates as:
(3)
                                                                  and     .                          (4)

Note that the negation above is mathematically equivalent to placing the imaging surface in front of the camera (i.e. at Z = f). However, this makes no sense physically, and misleadingly suggests that points with Z ∈ [0,f] are behind the imaging surface and therefore not visible, which is false.

The parameters cx,cy,sx,sy,f are given the following names and have the following geometric interpretation in the ideal pinhole camera model:

•    cx,cy: Principalpoint - the intersection point between the optical axis and the imaging surface (in pixels). The optical axis is the line that is orthogonal to the imaging surface and passes through the camera origin. In our chosen convention, this is the Z~ -axis.

•    sx,sy: Pixel density - the number of discrete sensing elements horizontally and vertically per unit length of the imaging surface (often in pixels per micrometer).

•    f: Focal length - the distance between the pinhole and the imaging surface (often in millimeters).

The principal point should not be confused with the optical center or center of projection, which is the 3D origin of the camera coordinate system, where the incoming light rays intersect. This distinction is made in the majority of computer vision literature, but Szeliski uses the term optical center for both quantities. In the ideal pinhole camera, the center of projection coincides with the pinhole. The center of projection in a real camera is less obvious, and may be behind, within or in front of the lens system.

You will more commonly see sx and sy as their inverse quantities, which represent the physical distance between horizontally and vertically adjacent pixels, respectively. In a sensor datasheet, these are usually specified as the pixel size or pitch in micrometers (microns). In modern digital cameras, pixels are usually square (sx = sy), so you may only see a single number listed.

 

Figure 2: A camera moving over an approximately planar surface, as described in Task 1.1-1.2. Note that the camera is drawn as a triangle with the imaging surface in front of the camera, which, as mentioned before, makes no sense. However, its use as a visual symbol is somewhat standard in robotics and computer vision literature. In 3D it is often drawn as a pyramid.

Part 1              Choosing a sensor and lens
You may at some point have to choose a camera sensor and lens combination to satisfy the imaging requirements of a particular application. Many camera vendors offer “calculators” that can help with this process. These are often based on the pinhole camera model, so it is possible to do this analysis yourself, using the equations presented previously.

Some key parameters for the sensor are its shutter speed, resolution and pixel size, which are specified in its datasheet. For the lens, the primary parameter is its focal length. The pixel size corresponds to 1/sx and 1/sy in the pinhole camera model. The focal length of a lens can to a first approximation be considered as equivalent to the identically named parameter f in the pinhole camera model. (Szeliski 2.2.3 provides a brief motivation for when and why this approximation is reasonable.)

Task 1.1: Imagine you are designing an aerial vehicle to capture images of an agricultural crop as in Fig. 2. Suppose that you have a camera sensor with square pixels of size 10×10 microns and suppose that the camera is looking straight down from a height of 50 meters. Assume that the ground is flat and determine what focal length you need in order for one pixel in the image to cover a ground distance of one centimeter. Give your answer in millimeters.

Task 1.2:  Having overlap between images is important for image stitching (e.g. creating one large image of the entire crop). Suppose that the camera captures 5 images per second at a resolution of 1024×1024 pixels, and suppose that the camera is moving 50 meters per second along one of the lateral camera axes (e.g. X~ as in Fig. 2). Determine the area of overlap between two consecutive images, using the focal length you found above. Give your answer as a percentage of the total image area.

Hint: Compute the distance (in pixels) by which any pixel in the image gets displaced from one image to the next, and divide the remainder by the image size (1024).

 

                                                  u (pixels)                                                                                                                    u (pixels)

Figure 3: Projection of points in Task 2.2 (left) and transformed points in Task 3.2 (right).

Part 2             Implementing the pinhole camera model
Although the pinhole camera model involves a non-linear operation (division by Z), it is often written as the following linear relationship:

  

u˜                 sxf

v˜ =  0

   w˜            0
0 syf

0
cxX

cyY 

 

1          Z
(5)
 

|    {z     } K

or simply

                                                                                             u˜ = KX                                                           (6)

where K is called the calibration- or intrinsic matrix. The vector u˜ = (u,˜ v,˜ w˜) is the homogeneous form of the pixel coordinate vector u = (u,v). These are implicitly related by

                                                                            wu˜ = u˜     and   wv˜ = v.˜                                          (7)

Like Szeliski, we denote homogeneous coordinates using the tilde “∼” symbol. The motivation for writing the equations in this form is probably only truly appreciated after taking a course in projective geometry. Nevertheless, you will frequently encounter the linear formulation in practice.

Task 2.1: Dividing a homogeneous vector by its last component is called dehomogenization. Show that the dehomogenized coordinates u/˜ w˜ and v/˜ w˜ are equal to u and v in Eq. 4, when Z 6= 0.

Task 2.2: The hand-out code has a stub function called project in project.m (Matlab) and common.py (Python). The function should take a calibration matrix K and a 3×N array of points (X,Y,Z), and return a 2×N array of pixel coordinates (u,v), computed with the pinhole model.

Implement the function and test it on the points and calibration matrix provided in task2points.txt and task2K.txt. Include a figure showing the pinhole projection of the 3D points. The task2 script provides helper code to load the data and generate the required figure. Your figure should look similar to the left figure in Fig. 3 (minus the grid and labels).

Part 3             Homogeneous coordinates and transformations
So far, (X,Y,Z) has referred to a point’s coordinates in the camera coordinate system. However, the point may be expressed in a different coordinate system, perhaps attached to the environment or to an object. If so, we need to include an additional transformation that relates the two coordinate systems. This transformation is called a rigid body motion or Euclidean transformation, and is described by a 3D rotation and 3D translation. Given a point Xo expressed in a coordinate system (or frame) “o”, its coordinate vector in the camera frame “c” is

                                                                                         Xc = RXo + t,                                                      (8)

where R is a rotation matrix and t is a translation vector. Note that we distinguish between coordinates in different frames with a superscript. Sometimes R and t are grouped into a single 4×4 matrix, such that the above can be written as

                                                                                                                   (9)

with a scalar 1 appended to both vectors. More concisely we may write

                                                                                          X˜                                          (10)

where Tco is read as “the transformation from o to c”. Using 4×4 matrices lets us combine sequential transformations by matrix multiplication. For example, given the transformation from a to b, and from b to c, the composite transformation from a to c is T .

Task 3.1: The vectors in Eq. 10 are generally homogeneous with the last component not necessarily equal to 1. They must therefore be divided by the last component to obtain the actual 3D coordinates. However, show that this step can be skipped when computing the projection (u,v) of a homogeneous vector X˜ = (X,˜ Y ,˜ Z,˜ W˜ ), i.e. show that the last component (W˜ ) can simply be dropped.

Task 3.2: The points in Task 2 were in camera coordinates and had been pre-translated 6 units along the camera Z~ -axis to be in front of the camera. The points in task3points.txt are instead given in the object coordinate frame indicated in Fig. 3 (right). Here, the object-to-camera transformation

Tco is a product of one translation by 6 units and two elemental rotations by 15◦ and 45◦, respectively.

Find an expression for Tco. Check your answer by modifying task2 to load the new points and apply your transformation before projection. The generated figure should look identical to Fig. 3 (right). Note: You don’t need to expand the matrix product; you can give your answer in terms of the elemental rotation matrices Rx(·),Ry(·),Rz(·), and a translation matrix Tz(·).

Tip: You may want to modify the project function to take a 4×N array of homogeneous coordinates. You can find definitions of the elemental rotation matrices on Wikipedia. Finally, you may use the provided function draw_frame to visualize the coordinate frame associated with Tco, as in the figure.

 

Figure 4: Quanser 3-DOF helicopter and its three degrees of freedom. The arrows indicate the direction of increasing yaw, pitch and roll.

Part 4              Image formation model for the Quanser helicopter
The Quanser 3-DOF helicopter consists of an arm with a counterweight at the back and a rotor carriage with two rotors at the front. It has three rotational degrees of freedom (Fig. 4):

•    Yaw (ψ): Rotation around an axis perpendicular to the mounting platform

•    Pitch (θ): Rotation of the arm up or down

•    Roll (φ): Rotation of the rotor carriage around the arm

In the midterm project, you’ll estimate these angles from markers attached to the helicopter. To do this, you need a mathematical model that relates points on the helicopter to pixel coordinates in the image. A reasonable solution is to model the helicopter as a piecewise rigid body and attach a coordinate frame to each part of interest. The coordinate frames you should use are shown in Fig. 5. Your main task here is to define the 4 × 4 transformation matrices between the different coordinate frames, using the measurements indicated in Fig. 6.

Task 4.1: A square platform with four screw holes is shown in Fig. 7. The distance between adjacent screws is 11.45 cm. Define four coordinate vectors corresponding to the screw locations. The coordinate vectors should be expressed in the coordinate frame shown in Fig. 7a, which is aligned with the sides of the platform and has its origin on the closest screw.

Task 4.2: Verify that your answers above are correct by writing a script to reproduce Fig. 7a:

1.    Load and draw the image quanser.jpg. Adjust the limits of the figure to zoom in on the platform.

2.    Use the transformation Tcameraplatform in platform_to_camera.txt to transform your coordinate vectors into camera coordinates. Assume that the camera satisfies a pinhole model with the matrix K in heli_K.txt and project the transformed coordinate vectors into the image.

Include the figure in your writeup. The drawn points should coincide with the screws. You don’t have to reproduce Fig. 7 exactly, e.g. the labels and stippled lines can be omitted.

As in Task 3.2, for the following tasks you don’t need to multiply and write out the entries of the 4×4 matrix. You can express your answer as a product of the elemental rotation matrices Rx(·),Ry(·),Rz(·) and translation matrices Tx(·),Ty(·),Tz(·).

Task 4.3: Yaw motion is enabled by a shaft passing through the platform center. The “base” frame follows the shaft: Its z-axis is parallel to the platform z-axis and its origin is in the middle of the platform. The remaining axes are obtained by a counter-clockwise rotation by ψ around z. Find an expression for the transformation Tplatformbase (ψ) from the base frame to the platform frame.

Task 4.4: Pitch motion is enabled by a hinge that rotates with the shaft. The “hinge” frame is obtained by translating 32.5 cm along the base frame z-axis (i.e. up the shaft), and rotating counterclockwise by θ around the base frame y-axis. Find an expression for Tbasehinge(θ).

Task 4.5: The arm is rigidly attached to the hinge. As indicated in Fig. 5, the “arm” frame is centered in the cross-section of the arm with its x-axis pointing down the arm toward the rotors. It is obtained by translating −5 cm along the hinge frame z-axis. Find an expression for Thingearm .

Task 4.6: Finally, the rotor carriage is allowed to rotate around a shaft parallel to the arm, located slightly below the arm centerline. The “rotors” frame is obtained from the arm frame by translating 65 cm along the x-axis, −3 cm along the z-axis, and rotating counter-clockwise by φ around the x-axis. Find an expression for Tarmrotors(φ).

Task 4.7: Verify your answers by writing a script to reproduce Fig. 8:

1.   Load and draw the image quanser.jpg.

2.   Instantiate the transformation matrices using ψ = 11.6◦, θ = 28.9◦ and φ = 0◦. Use the draw_frame function to draw the coordinate frames. (Set the scale argument to about 0.05.)

3.   Load the marker coordinate vectors in heli_points.txt. The first three are given in the arm frame, and the last four are given in the rotors frame. Draw the points by applying the correct sequence of transformations, followed by projection.

Each projected point is meant to lie exactly on one of the corners of its associated marker. However, if you have done everything correctly, the projected points will be off by up to 10 pixels. This is due to measurement and modeling inaccuracies. (In the midterm project you will learn how to calibrate the model itself by tracking the helicopter through a long motion sequence.)

More products