$34.99
1. Conceptual questions. [15 points]
1. (5 points) Please compare the pros and cons of KDE over histogram, and give at least one advantage and disadvantage to each.
2. (5 points) Why you cannot use maximum likelihood estimation to directly estimate GMM? Then how to estimate the model of GMM?
3. (5 points) For the EM algorithm for GMM, please show how to use the Bayes rule to drive τki in a closed-form expression.
2. Density estimation: Psychological experiments. [45 points]
Recall in this case, the kernel density estimator (KDE) for a density is given by
,
.
For two-dimensional KDE, use a two-dimensional Gaussian kernel: for
,
where x1 and x2 are the two dimensions respectively
.
(a) (5 points) Form the 1-dimensional histogram and KDE to estimate the distributions ofamygdala and acc, respectively. For this question, you can ignore the variable orientation. Decide on a suitable number of bins so you can see the shape of the distribution clearly. Set an appropriate kernel bandwidth h > 0.
(b) (5 points) Form 2-dimensional histogram for the pairs of variables (amygdala, acc). Decide on a suitable number of bins so you can see the shape of the distribution clearly.
(c) (20 points) Use kernel-density-estimation (KDE) to estimate the 2-dimensional densityfunction of (amygdala, acc) (this means for this question, you can ignore the variable orientation). Set an appropriate kernel bandwidth h > 0.
Please show the two-dimensional KDE (e.g., two-dimensional heat-map, two-dimensional contour plot, etc.)
Please explain what you have observed: is the distribution unimodal or bi-modal? Are there any outliers?
Are the two variables (amygdala, acc) likely to be independent or not? Please support your argument with reasonable investigations.
(d) (10 points) We will consider the variable orientation and consider conditional distributions. Please plot the estimated conditional distribution of amygdala conditioning on political orientation: p(amygdala|orientation = c), c = 2,...,5, using KDE. Set an appropriate kernel bandwidth h > 0. Do the same for the volume of the acc: plot p(acc|orientation = c), c = 2,...,5 using KDE. (Note that the conditional distribution can be understood as fitting a distribution for the data with the same orientation. Thus you should plot 8 one-dimensional distribution functions in total for this question.)
Now please explain based on the results, can you infer that the conditional distribution of amygdala and acc, respectively, are different from c = 2,...,5? This is a type of scientific question one could infer from the data: Whether or not there is a difference between brain structure and political view.
Now please also fill out the conditional sample mean for the two variables:
c = 2 c = 3 c = 4 c = 5
amygdala
acc
Remark: As you can see this exercise, you can extract so much more information from density estimation than simple summary statistics (e.g., the sample mean) in terms of explorable data analysis.
(e) (5 points) Again we will consider the variable orientation. We will estimate the conditional joint distribution of the volume of the amygdala and acc, conditioning on a function of political orientation: p(amygdala, acc|orientation = c), c = 2,...,5. You will use two-dimensional KDE to achieve the goal; et an appropriate kernel bandwidth h > 0. Please show the two-dimensional KDE (e.g., two-dimensional heat-map, two-dimensional contour plot, etc.).
Please explain based on the results, can you infer that the conditional distribution of two variables (amygdala, acc) are different from c = 2,...,5? This is a type of scientific question one could infer from the data: Whether or not there is a difference between brain structure and political view.
3. Implementing EM for MNIST dataset. [40 points]
Implement the EM algorithm for fitting a Gaussian mixture model for the MNIST handwritten digits dataset. For this question, we reduce the dataset to be only two cases, of digits “2” and “6” only. Thus, you will fit GMM with C = 2. Use the data file data.mat or data.dat. True label of the data are also provided in label.mat and label.dat.
The matrix images is of size 784-by-1990, i.e., there are 1990 images in total, and each column of the matrix corresponds to one image of size 28-by-28 pixels (the image is vectorized; the original image can be recovered by mapping the vector into a matrix).
First use PCA to reduce the dimensionality of the data before applying to EM. We will put all “6” and “2” digits together, to project the original data into 4-dimensional vectors.
Now implement EM algorithm for the projected data (with 4-dimensions).
(In this question, we use the same set of data from the provided data files for training and testing)
(a) (10 points) Implement EM algorithm yourself. Use the following initialization
• initialization for mean: random Gaussian vector with zero mean
• initialization for covariance: generate two Gaussian random matrix of size n-byn: S1 and S2, and initialize the covariance matrix for the two components are
Σ1 = S1S1T + In, and Σ , where In is an identity matrix of size n-by-n.
Plot the log-likelihood function versus the number of iterations to show your algorithm is converging.
(b) (20 points) Report, the fitted GMM model when EM has terminated in your algorithmsas follows. Report the weights for each component, and the mean of each component, by mapping them back to the original space and reformat the vector to make them into 28-by-28 matrices and show images. Ideally, you should be able to see these means corresponds to some kind of “average” images. You can report the two 4-by-4 covariance matrices by visualizing their intensities (e.g., using a gray scaled image or heat map).
4. De-bias review system using EM. [Bonus, 10 points]
In this question, we will develop an algorithm to remove individual reviewer’s bias from their score. Consider the following problem. There are P papers submitted to a machine learning conference. Each of R reviewers reads each paper, and gives it a score indicating how good he/she thought that paper was. We let x(pr) denote the score that reviewer r gave to paper p. A high score means the reviewer liked the paper, and represents a recommendation from that reviewer that it be accepted for the conference. A low score means the reviewer did not like the paper.
We imagine that each paper has some “intrinsic” true value that we denote by µp, where a large value means it’s a good paper. Each reviewer is trying to estimate, based on reading the paper, what µp is; the score reported x(pr) is then reviewer r’s guess of µp.
All sorts of different random factors influence the reviewing process, and hence we will use a model that incorporates several sources of noise. Specifically, we assume that reviewers’s scores are generated by a random process given as follows:
y(p) ∼ N(µp,σp2) z(r) ∼ N(νr,τr2)
x(pr)|y(p),z(r) ∼ N(y(p) + z(r),σ2).
The variables y(p) and z(r) are independent; the variables (x,y,z) for different paper-reviewer pairs are also jointly independent. Also, we only ever observe the x(pr)s; thus, the y(p)s and z(r)s are all latent random variables.
We would like to estimate the parameters µp, σp2, νr, τr2. If we obtain good estimates of the papers “intrinsic values” µp, these can then be used to make acceptance/rejection decisions for the conference.
We will estimate the parameters by maximizing the marginal likelihood of the data {x(pr);p = 1,...,P,r = 1,...,R}. This problem has latent variables y(p)s and z(r)s, and the maximum likelihood problem cannot be solved in closed form. So, we will use EM.
Your task is to derive the EM update equations. For simplicity, you need to treat only and as parameters, i.e. treat σ2 (the conditional
variance of x(pr) given y(p) and z(r)) as a fixed, known constant.
1. Derive the E-step (5 points)
(a) The joint distribution p(y(p),z(r),x(pr)) has the form of a multivariate Gaussian density. Find its associated mean vector and covariance matrix in terms of the parameters and σ2. [Hint: Recognize that x(pr) can be written as
x(pr) = y(p) + z(r) + ϵ(pr), where ϵ(pr) ∼ N(0,σ2) is independent Gaussian noise.
Remark: John Platt (whose SMO algorithm you’ve seen) implemented a method quite similar to this one to estimate the papers’ true scores. (There, the problem was a bit more complicated because not all reviewers reviewed every paper, but the essential ideas are the same.) Because the model tried to estimate and correct for reviewers’ biases, its estimates of the paper’s value were significantly more useful for making accept/reject decisions than the reviewers’ raw scores for a paper.