$25
Objective
The objective of this homework is twofold:
(a) Implementing an image classification algorithm, and gaining experience in working with Python functions that manipulate image files along the way;
(b) Understanding important theoretical properties of linear discriminant analysis using the Bayesian decision rule.
Important Concepts
Linear Discriminant Analysis
Recall from the lectures that for classification problems, there are several approaches to constructing decision boundaries for classifiers. In project 2, we studied one example of them, the linear least square. Linear least squares is a discriminative method. The Bayesian decision rule is a generative method. When framed under linear models, both methods can be derived from the linear discriminant analysis.
Suppose we have an input vector x ∈ Rd and K classes with class labels {1,2,...,K}. The discriminant function gj for the j-th class is a mapping which maps an input vector x to a number. We assign the class label j ∈ {1,2,...,K} to x if
j = argmax gi(x)
i∈{1,2,...,K}
This then gives us the hypothesis function h : Rd → {1,2,...,K}, which is defined as
(1)
h(x) = argmax gi(x)
(2)
i∈{1,2,...,K}
Notice that for two classes, (2) reduces to
,
which is what we had in homework 2.
Bayesian Decision Rule
In this homework, the general form of the discriminant functions we shall adopt on is based on the maximum a-posteriori (MAP) estimation method from Bayesian statistics theory.
Let us examine this concept in the context of our classifier. Notationally, consider X with range Rd and the classification function Y as random variables. For convenience, denote pj(x) as the class-conditional density of X in class j, i.e., pj(x) = pX|Y (x|j), and denote πj as the prior probability of class j, i.e., πj = pY (j), so
= 1. From Bayes’ theory, the posterior distribution of Y on X is
(3)
we can interpret it as the probability that the feature vector x is of class j. We define our discriminant functions as the log of the posterior:
gj(x) = logpY |X(j|x)
(4)
= logpj(x) + logπj
for all j and x. Therefore, by (2), the class label of every x ∈ Rd is obtained using the MAP method.
The Gaussian Model
In this homework, we will focus on analyzing the behavior of the discriminant functions and the decision boundaries they generate when we model each class density pj as Gaussian functions, mainly due to the analytical tractability of it.
From the lectures, recall that if
(5)
where |Σj| = detΣj, then (4) is written as
) (6)
In practice we do not know the mean µj, covariance matrix Σj and prior probability πj for any class, and so we need to estimate them based on our training data. Let be the training datasets for the K classes. Then:
(7)
for the sample mean of class j,
(8)
for the unbiased in-class sample covariance matrix of class j, and for simplicity we can choose
(9)
for the prior probability of class j.
Exercise 1: Maximum Likelihood Estimation of Covariance Matrix
In lecture, we proved that the ML estimate of the mean vector is the empirical average of the training samples. In this exercise we want to show that the ML estimate of the covariance matrix is
ΣbMLE (10)
where is a set of iid data sampled from a random variable X, and the mean µ is assumed to
be known. We focus on the multi-dimensional Gaussian so that the likelihood function is
(11)
To make things simpler we assume that Σ and ΣbMLE are invertible in this exercise.
Tasks:
(a) We show that (10) is the maximum likelihood estimate of the likelihood function defined in (11).
(i) Prove the matrix identity xTAx = tr[AxxT],
where A ∈ Rd×d, and the trace operator is defined as tr[A] = .
(ii) Show the likelihood function can be written in the form:
(12)
(iii) Let A MLE, and λ1,...,λd be the eigenvalues of A. Show that the result from the previous part leads to:
d d
p(x1,...,xN|Σ) = (13)
Hint: For matrix A with eigenvalues λ1,...,λd, tr[A] = .
(iv) Show that the likelihood function is maximized by the choice λi = 1 for all i. Therefore, (10) maximizes (11).
(b) Alternatively, directly compute the derivative of the likelihood function (11) with respect to Σ, set it to (matrix-valued) zero, and verify that (10) still solves your equation.
Hint: You might find it more convenient to work with the precision matrix P = Σ−[1], and consider the following matrix calculus identities:
∂tr[AX] ∂tr[XA] ∂|X| −1
= = A, and = |X|X (14)
∂X ∂X X
Exercise 2: Cat Classification
Image classification is an important problem in computer vision and (probably) the most widely used test bed problem in artificial intelligence. In this exercise, we are going to study a simplified version of a common image classification problem by classifying foreground and background regions in an image, using a Gaussian classifier.
The image that we are going to classify consists of a cat and some grass 1. The size of this image is 500×375 pixels. The left hand side of Figure 1 shows the image, and the right hand side of Figure 1 shows a manually labeled “ground truth”. Your task is to do as much as you can to extract the cat from the grass, and compare your result with the ”ground truth”. However, before proceeding on to tackling the task, let us establish some conventions, and understand further how our classifier is going to work.
We first need to understand what a patch is in an image. By patch we mean a small neighborhood surrounding the pixel. For example, when we say extract an 8 × 8 patch at pixel (i,j), we mean that we want to extract Y[i:i+8, j:j+8] from the image. In this exercise, we assume the size of our patches to be 8 × 8.
Figure 1: The “Cat and Grass” image.
Our classifier will be based on the MAP method discussed before, using the discriminant functions (6), i.e., we assume the posterior distrbutions to be Gaussian. More specifically, the MAP procedure is as follows: for any patch x ∈ R64=8×8 in the image,
gcat(x) Rcatgrass ggrass(x) (15)
For purposes of testing, we adopt the convention of assigning a value 1 to the patch x if it is classified as ”cat”, and 0 otherwise (this is the convention the ”ground truth” image is constructed on). Furthermore, you are given training data files train_cat.txt and train_grass.txt. The sizes of the arrays in these files are 64 × N, where N corresponds to the number of training samples and 64 corresponds to the size of the block 8×8. You will need to use these data to compute the necessary parameters for your Gaussian classifier with formulas (7), (8) and (9).
Now, let us start training and testing the classifier.
Tasks:
(a) We want to train the classifier first.
(i) Load the data using
train_cat = np.matrix(np.loadtxt(’train_cat.txt’, delimiter = ’,’)) train_grass = np.matrix(np.loadtxt(’train_grass.txt’, delimiter = ’,’)) The data need to be stored as matrices for later use.
(ii) Now compute the sample mean vectors µb(cat), µb(grass), sample covariance matrices (grass), and sample priors πb(cat), πb(grass) using the data arrays. Organize your code into a function for cleaner presentation.
Hint: If you want to convert a patch z into a 64 × 1 column vector, you can type z_vector = z.flatten(’F’);
To convert z_vector back to a patch, you can type z = reshape(z_vector, (8, 8), order = ’F’);
(b) Let us now test the classifier on cat_grass.jpg.
(i) Read the testing image cat_grass.jpg. The suggested way to read it is
Y = plt.imread(’cat_grass.jpg’) / 255
We divide the whole array by 255 to convert the into a floating point array within range (0,1). Such conversion is necessary because the training was done in such range.
(ii) Write a function that outputs your classifier’s classification result on every overlapping patch in the testing image cat_grass.jpg. Your function should be in the following form:
Output = np.zeros((M-8,N-8)) for i in range(M-8):
for j in range(N-8): z = Y(i:i+8, j:j+8) % intermediate steps
% if patch z classified as cat, then set Output(i,j) = 1 You can plot the output using plt.imshow(output*255, cmap = ’gray’)
Remark 1: Since the double for loop run over every i ∈ {1,...,M − 8} and j ∈ {1,...,N − 8}, this way of classification indeed have the patches overlapping.
Remark 2: By performing the above procedure, we left out the boundary pixels. Of course, this can be fixed (how?) but we will not drill into them.
(iii) Carry out the same task as in (ii), except now classify the patches independently/ nonoverlappingly with each other; in other words, your patches should now have indices (i,j) ∈ {(0,0),(0,8),(8,0),...}. Be cautious about the boundary pixels, we will still ignore those pixels in our classifier, i.e., for the pixels belonging to the set
({Y[i,j]|i > M −8}∪{Y[i,j]|j > N −8})∩({Y[i,j]|i mod 8 6= 0}∪{Y[i,j]|j mod 8 6= 0})
(16) leave your Output(i,j) matrix to be 0 for those boundary entries.
(iv) Let us examine the performance of our classifiers constructed in (ii) and (iii), by comparing the classification results to the ”ground truth” image truth.png. To do so, we use mean absolute error as the comparison metric. The mean absolute error between an estimate y and the ground truth y∗ is defined as
N
MAE(y, .
n=1
Report on the accuracy of the two versions of the Gaussian classifier.
Hint: Remember to read truth.png the same way you did in (b)(i).
(v) Go to the internet and download an image with similar content: a cheetah on grass, a cat on grass, or something like that. Apply your classifier to the image, and submit your resulting mask. Does it perform well? If yes, why? If no, what could go wrong? Write one to two bullet points to explain your findings. Please be brief; one sentence for each bullet point.
Exercise 3: Connection to Linear Least-Squares
In this exercise, let us take a step back, and study an important theoretical property of the Gaussian classifier: its connection to the linear least-squares (LLS) classifier we studied in homework 2 in the 2-class case. This will shed light on interesting geometric properties of the two classifiers. Before any derivations, we state some assumptions first, and recap on some theory.
Suppose that we have two classes of data, with datasets and , with x , and . Recall from lecture that unless the two classes share the same between-class sample covariance matrix
(17)
the discriminant functions from (6) are not going to be linear. Recalling that the decision boundary of the LLS classifier is linear, we will limit ourselves to the equal covariance case in this exercise, because in our case, only linear discriminant functions produce linear decision boundaries. Additionally, we assume Σb is invertible.
Let us also recall some details of the LLS classifier. We encoded class 1 with class label +1, and class 2 with class label −1. The decision boundary of the classifier was described by the set {x ∈ Rd| wTx + w0 = 0}, where w and w0 are characterized by the equation
A b (18)
where A ∈ R(N1+N2)×(d+1) and b ∈ RN1+N2 were formed from the x(ij)’s and ’s.
In this exercise, we are going to show that when N1 = N2, the Gaussian classifier produces the same decision boundary between the two classes as does the linear least square classifier.
Tasks:
(a) To warm up, write the expression for the decision boundary of the linear Gaussian classifier in the form {x ∈ Rd| βTx + β0 = 0} (i.e. find β and β0).
(b) Instead of adhering to the problem setup strictly, we prove a more general preliminary result first. But before that, we need to prove a technical lemma.
First, let us give a general encoding of c1 ∈ R to class 1 (replacing +1), and encoding c2 ∈ R to class 2 (replacing −1), where c1 6= c2. Be careful that the expression of b depend on c1 and c2. Also, for convenience, denote N = N1 + N2.
We want to show that for w satisfying (18), the following equation holds:
) (19)
The derivation is somewhat complicated, so we will show the result in a few steps:
(i) Prove that
A (20)
Hint: Write Σb in terms of the ’s and ’s using its definition, and observe its relation to
[ATA]1,1.
(ii) Prove that
A (21)
(iii) Using previous results and equation (18), show that
(iv) Prove that the left hand side of the equation from the last part can be simplified to the following:
(23)
Hint: This problem requires some patience. Expand the expression, and prove that the outer product terms
(24)
(v) Finally, prove that (19) holds.
(c) As a corollary of the lemma we showed in (b), show that w must be in the direction of β (i.e. w is a scalar multiple of β), even though our encoding of the classes for the LLS classifier is completely arbitrary.
(d) Show that in the case of c1 = +1,c2 = −1 (i.e. the original encoding), if N1 = N2, then the decision boundary generated by the Gaussian classifier is the same as that generated by the LLS classifier. In other words, show that there exists some scalar a such that aβ = w and aβ0 = ω0.
(e) Now let us verify our theory with a simple experiment. We carry it out in a few steps:
(i) Generate two clusters of data, each with Gaussian distribution, equal covariance but different means, and each of size 1000. Choose whatever mean vectors you like, but try to make the two clusters at least visually separable. Make a scatter plot of the data.
(ii) Using the expression you obtained in (a), plot the decision boundary on top of the scatter plot of the two classes of data you generated in the previous part.
(iii) Now obtain the decision boundary by solving the linear least square problem in the same way you did in homework 2, i.e., solve for the optimal w and w0 satisfying (18). Plot this decision boundary, and compare it to the previous one. What do you see?
(f) Having gone through this exercise, make some further comments on how you would interpret (18) geometrically, in the specific case of N1 = N2 = N/2 and c1 = +1,c2 = −1.
[1] Image Source: http://www.robots.ox.ac.uk/vgg/data/pets/