Starting from:

$30

CMU15463-Assignment 4 Solved

The purpose of this assignment is to explore lightfields, focal stacks, and depth from defocus. As we discussed in class, having access to the full lightfield of a scene allows creating images that correspond to different viewpoint, aperture, and focus settings. We also discussed how having a focal stack of a scene allows creating an all-in-focus image and a depth map using depth from defocus.

Here, you will combine the two into a single pipeline: Instead of capturing a focal stack, you will synthesize one from a lightfield image captured with a plenoptic camera. Then, from the synthetic focal stack, you will compute an all-in-focus image and a depth map.

In the first part of the homework you will be using a lightfield image captured by us. In the second part, you will capture and refocus a lightfied of your own using a standard camera.

You are strongly encouraged to read the handheld plenoptic camera paper by Ng et al. [3] that we discussed in class. Sections 3.3 and 4, in particular, will be very helpful for understanding how to solve this homework assignment. Additionally, as always, there is a “Hints and Information” section at the end of this document that is likely to help.

1. Lightfield rendering, focal stacks, and depth from defocus (80 points)
For the first part of the homework, you will use a lightfield image of a chess board scene, obtained from the Stanford Light Field Archive [1]. (We also used this scene for related examples during the lightfield lecture.) The lightfield is available as file ./data/chessboard  lightfield.png in the homework ZIP archive. This image file is formatted in the same way as images captured by a plenoptic camera, with the pixels under neighboring lenslets corresponding to neighboring patches in the image. This format is described in detail in [3]. Figure 1 shows a crop from the center of the lightfield image, as well as a regular pinhole-camera view of the chessboard scene.

 

Figure 1: The chessboard scene lightfield. Left: Crop of the lightfield image. Right: A pinhole camera view of the scene.

Initials (5 points). Load the lightfield image in Matlab, and create from it a 5-dimensional array L(u,v,s,t,c). The first four dimensions correspond to the 4-dimensional lightield representation we discussed in class, with u and v being the coordinates on the aperture, and s and t the coordinates on the lenslet array. The fifth dimension c = 3 corresponds to the 3 color channels. When creating this structure, you can use the fact each lenslet covers a block of 16 × 16 pixels.

Sub-aperture views (15 points). As we discussed in class, by rearranging the pixels in the lightfield image, we can create images that correspond to views of the scene through a pinhole placed at different points on the camera aperture (a “sub-aperture”). This is equivalent to taking a slice of the lightfield of the form L(u = uo,v = vo,s,t,c), for some values of uo and vo corresponding to the point on the aperture where we place the pinhole. For the chessboard lightfield, we can generate 16 × 16 such images.

Create all of these sub-aperture views, and arrange them into a 2D mosaic, where the vertical dimension will correspond to increasing u values, and the horizontal dimension to increasing v values. Figure 2 shows the expected result. Submit the mosaic with your solution.

 

Figure 2: Mosaic of sub-aperture views.

Refocusing and focal-stack generation (30 points). A different effect that can be achieved by appropriately combining parts of the lightfield is refocusing at different depths. In particular, averaging all of the sub-aperture views results in an image that is focused at a depth near the top of the chess board. This corresponds to creating an image as:

Z Z

                                                                                              L(u,v,s,t,c) dv du.                                                                             (1)

                                                                                  u      v

The resulting image is shown to the left of Figure 3. As we discussed in class and explained in detail in Section 4 of [3], focusing at different depths requires shifting the sub-aperture images before averaging them, with the shift of each image depending on the desired focus depth and the location of its sub-aperture. More concretely, to focus at depth d, we need to combine the sub-aperture images as:

Z Z

I (s,t,c,d) =         L(u,v,s + du,t + dv,c) dv du.          (2) u            v

For d = 0, the image we obtain is the same as in Equation (1). Figure 3 shows refocused images for two more settings of d.

Implement Equation (2), and use it to generate a focal stack I (s,t,c,d) for a range of values d. Make sure that your focal stack is long enough so that each part of the scene is in focus in at least one image in the stack. In your solution, make sure to show at least five different refocusings.

 

Figure 3: Refocusing at different depths. The left image corresponds to using Equation (2) with d = 0 (or equivalently, using Equation (1)).

All-focus image and depth from defocus (30 points). As we saw in class, when we have access to a focal stack, we can merge the images into a new images where all of the scene is in focus. In the process of doing so, we also obtain depth estimates for each part of the scene, a procedure that is known as depth from defocus.

To merge the focal stack into a single all-focus image, we first need to determine per-pixel and per-image weights. This is similar to the procedure used in Homework 4 for high-dynamic range imaging, with the difference that the weights here are very different. In particular, the weights in this case correspond to how “sharp” the neighborhood of each pixel is at each image in the focal stack.

There are many possible sharpness weights. Here you will implement the following:

1. For every image in the focal stack, first convert it to the XYZ colorspace (making sure to account for tonemapping), and extract the luminance channel:

Iluminance (s,t,d) = get luminance(rgb2xyz(I (s,t,c,d))). (3) 2. Create a low-frequency component by blurring it with a Gaussian kernel of standard deviation σ1:

Ilow-freq (s,t,d) = Gσ1 (s,t) ∗ Iluminance (s,t,d).

3. Compute a high-frequency component by subtracting the blurry image from the original:
(4)
Ihigh-freq (s,t,d) = Iluminance (s,t,d) − Ilow-freq (s,t,d).
(5)
4. Compute the sharpness weight by blurring the square of high-frequency component with another Gaussian kernel of standard deviation σ2:

wsharpness (s,t,d) = Gσ2 (s,t) ∗ (Ihigh-freq (s,t,d))2 .

Note that the weights are the same for each of the color channels.

Once you have the sharpness weights, you can compute the all-focus image as:
(6)
P wsharpness (s,t,d)I (s,t,c,d) d

                                                           Iall-focus (s,t,c) =      .
(7)
Pd wsharpness (s,t,d)

In addition, you can create a per-pixel depth map by using the weights to merge depth values instead of pixel intensities, that is:

Pd wsharpness (s,t,d)d

                                                                      Depth(s,t) =     .                                                              (8)

Pd wsharpness (s,t,d)

Figure 4 shows the all-focus image and depth map resulting from one set of σ1 and σ2 values used for sharpness evaluation. You should experiment with different values and report which ones work best, as well as show the corresponding all-focus image and depth map.

 

Figure 4: Left: All focus image for one set of σ1 and σ2 values. Right: Corresponding depth map.

2. Bonus: Better blending and depth map (20 points)
In the past several lectures, we have discussed a variety of techniques for computing “sharpness” and blending images. In this bonus question, you can experiment with different techniques for blending a focal stack into an all-in-focus image, as well as extracting a depth map for it.

This bonus part is intentionally left open-ended, to give you an opportunity to experiment and explore. You are also welcome to look into related literature for ideas—the references at the end of the lecture on focal stacks and lightfields should be a good starting point. How many points you will be awarded for this bonus part will depend on three factors: 1) the magnitude of the experiments you perform (e.g., just replacing the weighting method of Part 1 with running a Laplacian will not get you many points); 2) the novelty and soundness of the blending pipeline you come up with; and 3) the improvement in the resulting all-in-focus image and depth map.

4. Capture and refocus your own lightfield (70 points)
You will now capture and refocus your own lightfield. For this, you can use either the Nikon D3300 camera you borrowed at the start of the class, or your own cell phone camera.

Capturing an unstructured lightfield. (20 points) As we saw in class, in the absence of a plenoptic camera and a camera array, an easy way to capture a lightfield is to use a camera that we move around to capture multiple images. Ideally, we would move the camera at constant x-y intervals, to create a regularly sampled measurement of the true underlying lightfield. However, this is hard to do in two dimensions without specialized equipment.

Instead, here you will capture an unstructured lightfield [2]. The procedure for doing this is shown in Figure 5: Use a camera to capture video, while moving the camera along a plane. Doing so corresponds to sampling the aperture plane in an unstructured way, at irregular values (u,v) that depend on the trajectory of the camera, instead of the structured grid sampling performed by a camera moving at regular x-y intervals. In your homework submission, make sure to include the video you end up using.

Refocusing an unstructured lightfield. (50 points) In the first part of the homework, the fact that we knew the amount of shift corresponding to each viewpoint in the lightfield greatly simplified the refocusing process: We could refocus by aligning images based on the part of the scene we wanted to be in focus, with this alignment corresponding to shifts determined by the rectangular grid structure we used to sample the lightfield.

In the unstructured lightfield you captured, the sampling grid structure is not known. Therefore, you will need to infer the amount of shift that needs to be applied for images to be aligned at a specific point.

To see how to do this, let’s first look at Figure 6, which shows a few frames from an unstructured lightfield. Let’s say that we want to create an image that is focused on the bug eye marked with a red box. We will determine how to shift images using a template matching procedure.

 

Figure 5: Capturing an unstructured lightfield.

 

Figure 6: Frames from an unstructured lightfield.

In particular, in the middle frame of your video sequence, select a small square neighborhood around the part of the image that you want to be in focus. Then, use the corresponding image patch as a template, with which you will perform template matching in all other frames of the video. Use this template to perform template matching on all other frames of your video, using normalized cross-correlation method: Let your template be g [i,j] and a video frame be It [i,j], where t is used to index video frames. Then the normalized cross-correlation is computed as:

                                             ,                                       (9)

where I¯t [i,j] is a version of It [i,j] filtered with a box filter of the same size as the template g [i,j]. The left part of Figure 7 shows a visualization of this. Note that template matching should be performed using a grayscale template and on grayscale video frames (you can use the luminance channel of the template and frames for this).

Once you have computed the cross-correlation, then for each frame of the video you can compute a shift as:

                                                                                   [st,x,st,y] = argmaxht [i,j].                                                                          (10)

i,j

Shift each frame by its corresponding amount [st,x,st,y], then sum all frames. The result should be an image that is focused around the template you used. The right part of Figure 7 shows the focusing result.

 

Figure 7: Focusing by template matching. Left: Template matching procedure for one frame. Right: Refocusing result

Implement the above procedure, and show a few results of focusing at different parts of your captured video.

4. Bonus: Capture and process your own focal stack (60 points)
As we discussed in class, you do not need a lightfield camera to do depth from defocus or create all-in-focus images. Instead, you can do the same by using your regular camera to capture a focal stack, simply by taking photographs at a sequence of different focus settings.

Use a camera with manual settings (e.g., the Nikon D3300 you borrowed for the class) to capture a dense focal stack, then process it to create a depth map and an all-in-focus image. You should make sure to capture a scene where there is significant depth variation (see the example focal stack in the lecture slides). Additionally, you should use the largest aperture available to your lens, as the shallow depth of field this will create will increase your depth resolution. We strongly recommend that you capture your focal stack by tethering your camera to your laptop, and using gphoto2 or some other software to control it remotely.

Additionally, depending on the depth range captured in your focal stack, you will likely need to first perform an alignment step on your focal stack, to account for the change in magnification as the focus changes. We recommend that you perform this alignment using a simple global scaling, as discussed in class. You can use the Gaussian lens formula and the aperture and focusing distance settings reported by the camera to figure out the exact scaling you need to use.

More products