Starting from:

$25

ELEN4720-Homework 2 Naive Bayes classifier and Gaussian Process Model for Regression Solved

Problem 1

In this problem you will derive a naive Bayes classifier. For a labeled set of data (y1,x1),...,(yn,xn), where for this problem y ∈ {0,1} and x is a D-dimensional vector of counts, the Bayes classifier observes a new x0 and predicts y0 as

y0 = argmax .

The distribution p(y0 = y|π) = Bernoulli(y|π). What is “naive” about this classifier is the assumption that all D dimensions of x are independent. Assume that each dimension of x is Poisson distributed with a Gamma prior. The full generative process is

iid

Data: yi ∼ Bern(π), xi,d|yi ∼ Pois(λyi,d), d = 1,...,D Prior:Gamma(2,1) Derive the solution for π and each λy,d by maximizing

π,  = arg          max                                                                                                               .

π,λ

Please separate your derivations as follows: (a) Derive π using the objective above.

b

(b) Derive using the objective above, leaving y and d arbitrary in your notation.

Problem 2

In this problem you will implement the naive Bayes classifier derived in Problem 1, as well as the kNN algorithm and logistic regression algorithm. The data consists of examples of spam and non-spam emails, of which there are 4600 labeled examples. The feature vector x is a 54-dimensional vector extracted from the email and y = 1 indicates a spam email.[1]

In every experiment below, randomly partition the data into 10 groups and run the algorithm 10 different times so that each group is held out as a test set one time. The final result you show should be the cumulative result across these 10 groups.

(a)    Implement the naive Bayes classifier described above. In a 2 × 2 table, write the number of times that you predicted a class y data point (ground truth) as a class y0 data point (model prediction) in the (y,y0)-th cell of the table, where y and y0 can be either 0 or 1. There should be four values written in the table in your PDF. Next to your table, write the prediction accuracy—the sum of the diagonal divided by 4600. (The sum of all entries in the table should be 4600.)

(b)   In one figure, show a stem plot (stem() in Matlab) of the 54 Poisson parameters for each class averaged across the 10 runs. (This average is only used for plotting purposes on this homework. In practice you would relearn these parameters using the entire data set to find their final values.) Use the README file to make an observation about dimensions 16 and 52.

(c)    Implement the k-NN classifier for k = 1,...,20. Use the `1 distance for this problem. Plot the prediction accuracy as a function of k.

Problem 3

In this problem you will implement the Gaussian process model for regression. You will use the same data used for homework 1 to do this, which is again provided in the data zip file for this homework. Recall that the Gaussian process treats a set of N observations (x1,y1),...,(xN,yN), with xi ∈ Rd and yi ∈ R, as being generated from a multivariate Gaussian distribution as follows,

use: .

Here, y is an N-dimensional vector of outputs and K is an N × N kernel matrix. For this problem use the Gaussian kernel indicated above. In the lecture slides, we discuss making predictions for a new y0 given x0, which was Gaussian with mean µ(x0) and variance Σ(x0). The equations are shown in the slides.

There are two parameters that need to be set for this model as given above, σ2 and b.

a)    Write code to implement the Gaussian process and to make predictions on test data.

b)   For b ∈ {5,7,9,11,13,15} and σ2 ∈ {.1,.2,.3,.4,.5,.6,.7,.8,.9,1}—so 60 total pairs (b,σ2)— calculate the RMSE on the 42 test points as you did in the first homework. Use the mean of the Gaussian process at the test point as your prediction. Show your results in a table.

c)    Which value was the best and how does this compare with the first homework? What might be adrawback of the approach in this homework (as given) compared with homework 1?

d)   To better understand what the Gaussian process is doing through visualization, re-run the algo-rithm by using only the 4th dimension of xi (car weight). Set b = 5 and σ2 = 2. Show a scatter plot of the data (x[4] versus y for each point). Also, plot as a solid line the predictive mean of the Gaussian process at each point in the training set. You can think of this problem as asking you to create a test set by duplicating xi[4] for each i in the training set and then to predict that test set.



[1] I’ve preprocessed the data. The original data is at https://archive.ics.uci.edu/ml/datasets/Spambase. More information about the meanings of the 54 dimensions of the data is provided in two accompanying files.

More products