Starting from:

$25

16-385- Assignment 5: Scene recognition with bag of words Solved

Instructions
Integrity and collaboration: Students are encouraged to work in groups but each student must submit their own work. If you work as a group, include the names of your collaborators in your write-up. Code should NOT be shared or copied. Please DO NOT use external code unless permitted. Plagiarism is strongly prohibited and may lead to failure of this course.
Start early! Especially those not familiar with Matlab. Constructing the dictionary in 3, as well as converting all the images to visual words in Q2.1 can easily take up to 30 mins to run (if implemented well!). If you start late, you will be pressed for time while debugging.
Questions: If you have any question, please look at Piazza first. Other students may have encountered the same problem, and it might have been solved already. If not, post your question on the discussion board. TAs will respond as soon as possible.
Write-up: Items to be included in the write-up are mentioned in each question, and summarized in the Writeup section. Please note that we DO NOT accept handwritten scans for your write-up in this assignment. Please type your answers to theory questions and discussions for experiments electronically.
Handout: The handout zip file contains 3 items. pdf is the assignment handout. data/ contains 1491 image files from the SUN image database, divided into 8 category folders. it also contains traintest.mat, which contains all the training and testing image paths, as well as the labels. matlab/ contains 3 scripts that you will make use of in this project.
Code: Stick to the function prototypes mentioned in the handout. This makes verifying code easier for the TAs. If you do want to change a function prototype or add an extra parameter, please talk to the TAs.
Any other helper functions you need
Overview
Perhaps the most common problem in computer vision is classification. Given an image that comes from a few fixed categories, can you determine which category it belongs to? For this

Figure 1: The eight categories you have been given from the SUN Image database

assignment, you will be developing a system for scene classification. A system like this might be used by services that have a lot of user uploaded photos, like Flickr or Instagram, that wish to automatically categorize photos based on the type of scenery. It could also be used by a robotic system to determine what type of environment it is in, and perhaps change how it moves around.

You have been given a subset of the SUN Image database consisting of eight scene categories. See Figure 1. In this assignment, you will build an end to end system that will, given a new scene image, determine which type of scene it is.

This assignment is based on an approach to document classification called Bag of Words. This approach to document classification represents a document as a vector or histogram of counts for each word that occurs in the document, as shown in Figure 2. The hope is that different documents in the same class will have a similar collection and distribution of words, and that when we see a new document, we can find out which class it belongs to by comparing it to the histograms already in that class. This approach has been very successful in Natural Language Processing, which is surprising due to its relative simplicity. We will be taking the same approach to image classification. However, one major problem arises. With text documents, we actually have a dictionary of words. But what words can we use to represent an image with?

This assignment has 3 major parts. Part 1 involves building a dictionary of visual words from the training data. Part 2 involves building the recognition system using our visual word dictionary and our training images. In Part 3, you will evaluate the recognition system using the test images.

Figure 2: Bag of words representation of a text document

In Part 1, you will use the provided filter bank to to convert each pixel of each image into a high dimensional representation that will capture meaningful information, such as corners, edges etc... This will take each pixel from being a 3D vector of color values, to an nD vector of filter responses. You will then take these nD pixels from all of the training images and and run K-means clustering to find groups of pixels. Each resulting cluster center will become a visual word, and the whole set of cluster centers becomes our dictionary of visual words. In theory, we would like to use all pixels from all training images, but this is very computationally expensive. Instead, we will only take a small sample of pixels from each image. One option is to simply select α pixels from each one uniformly at random. Another option is to use some feature detector (Harris Corners for example), and take α feature points from each image. You will do both to produce two dictionaries, so that we can compare their relative performances. See Figure 3.

Figure 3: Flow chart of the first part of the project that involves building the dictionary of visual words

In Part 2, the dictionary of visual word you produced will be applied to each of the training images to convert them into a word map. This will take each of the nD pixels in all of the filtered training images and assign each one a single integer label, corresponding to the closest cluster center in the visual words dictionary. Then each image will be converted to a “bag of words”; a histogram of the visual words counts. You will then use these to build the classifier (Nearest Neighbors, and SVM for extra credit). See Figure 4.

Figure 4: Flow chart of the second part of the project that involves building the recognition system

In Part 3, you will evaluate the recognition system that you built. This will involve taking the test images and converting them to image histograms using the visual words dictionary and the function you wrote in Part 2. Next, for nearest neighbor classification, you will use a histogram distance function to compare the new test image histogram to the training image histograms in order to classify the new test image. Doing this for all the test images will give you an idea of how good your recognition system. See Figure 5.

Figure 5: Flow chart of the third part of the project that involves evaluating the recognition system

In your write-up, we will ask you to include various intermediate result images, as well as metrics on the performance of your system.

Programming
Part 1: Build Visual Words Dictionary
Q1.1 Extract Filter Responses      (5 + 15 points) Write a function to extract filter responses.

function [filterResponses] = extractFilterResponses(I, filterBank)

We have provided the function createFilterBank(). This function will generate a set of image convolution filters. See Figure 6. There are 4 filters, and each one is made at 5 different scales, for a total of 20 filters. The filters are:

Gaussian: Responds strongly to constant regions, and suppresses edges, corners and noise
Laplacian of Gaussian: Responds strongly to blobs of similar size to the filter
X Gradient of Gaussian: Responds strongly to vertical edges
Y Gradient of Gaussian: Responds strongly to horizontal edges
Figure 6: The four types of filters we have provided you.

Pass this cell array of image convolution filters to your extractFilterResponses function, along with an image of size H × W × 3. Your function should use the provided RGB2Lab(..) function to convert the color space of Im from RGB to Lab. Then it should apply all of the n filters on each of the 3 color channels of the input image. You should end up with 3n filter responses for the image. The final matrix that you return should have size H × W × 3n. Figure 7

In your write-up: Show an image from the data set and 3 of its filter responses. Explain any artifacts you may notice. Also, briefly describe the CIE Lab color space, and why we would like to use it. We did not cover the CIE Lab color space in class, so you will need to look it up online.

Figure 7: Sample desert image and a few filter responses (in CIE Lab color space)

Q1.2 Collect sample of points from image         (5+15 points) Write two functions that return a list of points in an image, that will then clustered to generate visual words.

First, write a simple function that takes an image and an α, and returns a matrix of size α × 2 of random pixels locations inside the image.

[points] = getRandomPoints(I, alpha)

Next, write a function that uses the Harris corner detection algorithm to select key points from an input image, using a similar algorithm as shown in class.

[points] = getHarrisPoints(I, alpha, k)

Recall from class that the Harris corner detector finds corners by building a covariance matrix of edge gradients within a region around a point in the image. The eigenvectors of this matrix point in the two directions of greatest change. If they are both large, then this indicates a corner. See class slides for more details.

This function takes the input image I. I may either be a color or grayscale image, but the following operations should be done on the grayscale representation of the image. For each pixel, you need to to compute the covariance matrix

where Iab = ∂a∂I ∂I∂b. You can use a 3×3 or 5×5 window. It is a good idea to precompute the image’s X and Y gradients, instead of doing them in every iteration of the loop. For the sum, also think about how you could do it using a convolution filter.

You then want to detect corners by finding pixels whose covariance matrix eigenvalues are large. Since its expensive to compute the eigenvalues explicitly, you should instead compute the response function with

R = λ1λ2 − k(λ1 + λ2)2 = det(M) − k tr(M)2

Where det represents the determinant and tr denotes the trace of the matrix. Recall that when detecting corners with the Harris corner detector, corners are points where both eigenvalues are large. This is in contrast to edges (one eigenvalue is large, while the other is small), and flat regions (both eigenvalues are small). In the response function, the first term becomes very small if one of the eigenvalues are small, thus making R < 0. Larger values of R indicate similarly large eigenvalues.

Instead of thresholding the response function, simply take the top α response as the corners, and return their coordinates. A good value for the k parameter is 0.04 - 0.06.

Note: This Harris can be implemented without any for loops, using basic matlab functions and vectorization, particularly the sums in computing M, as well as the response function R. No marks will be taken off for slow code, but when computing the dictionary of visual words in Q1.3, this can be the difference between minutes and hours. Remember, you will be applying this function to about 1500 images.

These functions will be used in the next part to select α points from every training image.

In your write-up: Show the results of your corner detector on 3 different images.

Figure 8: Possible results for the getPoints functions

Q1.3 Compute Dictionary of Visual Words                                                            
You will now create the dictionary of visual words. Write the function:

function [dictionary] = getDictionary(imgPaths, alpha, K, method)

This function takes in an cell array of image paths. Load the provided traintest.mat to get a list of the training image paths. For each and every training image, load it and apply the filter bank to it. Then get α points for each image an put it into an array, where each row represents a single n dimensional pixel (n is the number of filters). If there are T training images total, then you will build a matrix of size αT × 3n, where each row corresponds to a single pixel, and there are αT total pixels collected.

method will be a string either ‘random’ or ‘harris’, and will determine how the α points will be taken from each image. If the method is ‘random’, then the α points will be selected using getRandomPoints(..), while method ‘harris’ will tell your function to use getHarrisPoints(..) to select the α points. Once you have done this, pass all of the points to Matlab’s K-means function.

[~, dictionary] = kmeans(pixelResponses, K, ‘EmptyAction’, ‘drop’)

The result of Matlab’s kmeans function will be a matrix of size K × 3n, where each row represents the coordinates of a cluster center. This matrix will be your dictionary of visual words.

For testing purposes, use α = 50 and K = 100. Eventually you will want to use much larger numbers to get better results for the final part of the write-up.

This function can take a while to run. Start early and be patient. When it has completed, you will have your dictionary of visual words. Save this and the filter bank you used in a .mat file. This is your visual words dictionary. It may be helpful to write a computeDictionary.m which will do the legwork of loading the training image paths, processing them, building the dictionary, and saving the mat file. This is not required, but will be helpful to you when calling it multiple times.

For this question, you must produce two dictionaries. One named dictionaryRandom.mat, which used the random method to select points and another named dictionaryHarris.mat which used the Harris method. Both must be handed in.

Part 2: Build Visual Scene Recognition System
Q2.1 Convert image to word map                                                                    
Write a function to map each pixel in the image to its closest word in the dictionary.

[wordMap] = getVisualWords(I, filterBank, dictionary)

I is the input image of size H × W × 3. dictionary is the dictionary computed previously. filterBank is the filter bank that was used to construct the dictionary. wordMap is an H × W matrix of integer labels, where each label corresponds to a word/cluster center in the dictionary.

Use Matlab function pdist2 with Euclidean distance to do this efficiently. You can visualize your results with the function label2rgb.

Figure 9: Sample visual words plot, made from dictionary using α = 50 and K = 50

Once you are done, call the provided script batchToVisualWords. This function will apply your implementation of getVisualWords to every image in the training and testing set. The script will load traintest.mat, which contains the names and labels of all the data images. For each training image dat/<category>/X.jpg, this script will save the word map file dat/<category>/X.mat. For optimal speed, pass in the number of cores your computer has when calling this function. This will save you from having to keep rerunning getVisualWords in the later parts, unless you decide to change the dictionary.

In your write-up: Show the word maps for 3 different images from two different classes (6 images total). Do this for each of the two dictionary types (random and Harris). Are the visual words capturing semantic meanings? Which dictionary seems to be better in your opinion? Why?

Q2.2 Get Image Features                                                                                          
Create a function that extracts the histogram of visual words within the given image (i.e., the bag of visual words).

[ h ] = getImageFeatures(wordMap, dictionarySize)

h, the vector representation of the image, is a histogram of size 1×K, where dictionarySize is the number of clusters K in the dictionary. h(i) should be equal to the number of times visual word i occurred in the word map.

Since the images are of differing sizes, the total count in h will vary from image to image. To account for this, L1 normalize the histogram before returning it from your function.

Q2.3 Build Recognition System - Nearest Neighbors                                          
Now that you have build a way to get a vector representation for an input image, you are ready to build the visual scene classification system. This classifier will make use of nearest neighbor classification. Write a script buildRecognitionSystem.m that saves visionRandom.mat and visionHarris.mat and store in them:

dictionary: your visual word dictionary, a matrix of size K × 3n.
filterBank: filter bank used to produce the dictionary. This is a cell array of image filters.
trainFeatures: T × K matrix containing all of the histograms of visual words of the T training images in the data set.
trainLabels: T × 1 vector containing the labels of each training image.
You will need to load the train imagenames and train labels from traintest.mat. Load dictionary from dictionaryRandom.mat and dictionaryHarris.mat you saved in part Q1.3.

Part 3: Evaluate Visual Scene Recognition System
Q3.1 Image Feature Distance                                                                                
For nearest neighbor classification you need a function to compute the distance between two image feature vectors. Write a function

[dist] = getImageDistance(hist1, hist2, method)

hist1 and hist2 are the two image histograms who’s distance will be computed (with hist2 being the target), and returned in dist. The idea is that two images that are very similar should have a very small distance, while dissimilar images should have a larger distance.

method will control how the distance is computed, and will either be set to ‘euclidean’ or ‘chi2’. The first option tells to compute the Euclidean distance between the two histograms. The second uses χ2 distance.

Alternatively, you may also write the function

[dist] = getImageDistance(hist1, histSet, method)

which, instead of the second histogram, it takes in a matrix of histograms, and returns a vector of distances between hist1 and each histogram in histSet. This may make it possible to implement things more efficiently. Choose either one or the other to hand in.

Q3.2 Evaluate Recognition System - NN and kNN                                            
Write a script evaluateRecognitionSystem NN.m that evaluates your nearest neighbor recognition system on the test images. Nearest neighbor classification assigns the test image the same class as the ”nearest” sample in your training set. ”Nearest” is defined by your distance function.

Load traintest.mat and classify each of the test imagenames files. Have the script report both the accuracy (##correcttotal ), as well as the 8 × 8 confusion matrix C, where the entry C(i,j) records the number of times an image of actual class i was classified as class j.

The confusion matrix can help you identify which classes work better than others and quantify your results on a per-class basis. In an perfect classifier, you would only have entries in the diagonal of C (implying, for example, that an ‘auditorium’ always got correctly classified as an ‘auditorium’).

For each combination of dictionary (random or Harris) and distance metric (Euclidean and χ2), have your script print out the confusion metric and the confusion matrix. Use disp/fprintf so we can know which is which.

Now take the best combination of dictionary and distance metric, and write a script evaluateRecognitionSystem kNN.mat that classifies all the test image using k Nearest Neighbors. Have your script generate a plot of the accuracy for k from 1 to 40. For the best performing k value, print the confusion matrix. Note that this k is different from the dictionary size K. This k is the number of nearby points to consider when classifying the new test image.

In your write-up:

Include the output of evaluateRecognitionSystem m (4 confusion matrices and accuracies).
How do the performances of the two dictionaries compare? Is this surprising?
How do the two distance metrics affect the results? Which performed better? Why do you think this is?
Also include output of evaluateRecognionSystem kNN.m (plot and confusion matrix). Comment on the best value of k. Is larger k always better? Why or why not? How did you choose to resolve ties?
Extra credit
QX.1 Evaluate Recognition System - Support Vector Machine                 (extra: 
In the class, you learned that support vector machine (SVM) is a powerful classifier.

Write a script, evaluateRecognitionSystem SVM.m to do classification using SVM. Produce a new visionSVM.mat file with whatever parameters you need, and evaluate it on the test set.

More products