Starting from:

$25

EECS545 - Machine Learning -   Homework #2 - Solved

Source Code Instruction: Your source code should run under an environment with following libraries:

•   Python 3.6+

•   Numpy (for implementations of algorithms)

•   Matplotlib (for plots)

Please do not use any other library unless otherwise instructed. You should be able to load the data and execute your code on someone else’s computer with the environment described above. In other words, work with the setup we provide and do not change the directory structure. Use relative path instead of absolute path to load the data. Note the outputs of your source code must match with your solution.

Credits

Stanford CS229 and Bishop PRML.

1           [21 points] Logistic regression
(a)    [8 points] Consider the log-likelihood function for logistic regression:

N ℓ(w) = Xy(i) logh(x(i)) + (1 − y(i))log(1 − h(x(i))),

i=1

where .

Find the Hessian H of this function, and show that is negative semi-definite and thus ℓ is concave and has no local maxima other than the global one. That is, show that

zTHz ≤ 0

for any vector z. [Hint: You might want to start by showing the fact that Pi Pj zixixjzj = (xTz)2. Note that (xTz)2 ≥ 0.]

(b)    [8 points] Using the H you calculated in part (a), write down the update rule implied by Newton’s method for optimizing ℓ(w). Now use this rule (and not a library function) to implement Newton’s method and apply it to binary classification problem specified in files q1x.npy and q1y.npy. The two columns of q1x.npy represent the inputs (x(i)) and q1y.npy represents the outputs (y(i) ∈ {0,1}), with one training example per row. Initialize Newton’s method with w = 0 (the vector of all zeros). What are the coefficients w, including the intercept term, resulting from your fit? [Hint: You might refer to the code in q1 hint.py to start programming.]

(c)     [5 points] Plot the training data. (Your axes should be x1 and x2, corresponding to the two coordinates of the inputs, and you should use a different symbol for each point plotted to indicate whether that example had label 1 or 0.) Also plot on the same figure the decision boundary fit by logistic regression. (This should be a straight line showing the boundary separating the region where h(x) > 0.5 from where h(x) ≤ 0.5.)

2           [27 points] Softmax Regression via Gradient Ascent
Gradient ascent is an algorithm used to find parameters that maximize a certain expression (contrary to gradient descent, which is used to minimize an expression). For some function f(w), gradient ascent finds w∗ = argmaxw f(w) according to the following pseudo-code:

 

Algorithm 1 Gradient Ascent

 

w∗ ← random

repeat

w∗ ← w∗ + α∇wf(w∗)

until convergence return w∗

 

Softmax regression is a multiclass classification algorithm. Given a labeled dataset D = {(x(1),y(1)), (x(2),y(2)),...,(x(N),y(N))}, where y(i) ∈ {1,2,...,K} (total K classes), softmax regression computes the probability that an example x belongs to a class k:

 

where wk is a weight vector for class k. The above expression is over-parametrized, meaning that there is more than one unique {w1,w2,...,wK} that gives identical probability measures for p(y = k|x,w). An unique solution can be obtained using only K−1 weight vectors w = {w1,w2,...,wK−1} and fixing wK = 0:

 ; ∀k = {1,2,...,K − 1}

 

We define the likelihood of the ith training example p(y(i)|x(i),w) as:

                                                                                            K                                                                       (i)

p(y(i)|x(i),w) = Yhp(y(i) = k|x(i),w)iI(y =k)

k=1

where I(.) is the indicator function. We define the likelihood as:

                                                             N                                                       N        K                                                                       (i)

L(w) = Yp(y(i)|x(i),w) = YYhp(y(i) = k|x(i),w)iI(y =k)

                                                            i=1                                                     i=1k=1

Finally, we define the log-likelihood as:

                                                                                      N        K                                                                                  (i)

l(w) = logL(w) = XXlog(hp(y(i) = k|x(i),w)iI(y =k))

i=1 k=1

(a)    [13 points] Derive the gradient ascent update rule for the log-likelihood of the training data. In other words, derive the expression ∇wml(w) for m = 1,...,K − 1. Show that:

 

N

= Xϕ(x(i))hI(y(i) = m) − p(y(i) = m|x(i),w)i

i=1

Remember that in this notation, wk is a weight vector for class k, NOT the kth element of w.

[Hints: logab = blog(a). Further, it helps to consider the two cases separately; a case for y(i) = k = m, and another for y(i) = k ̸= m, which is equivalent to using Kronecker delta δkm].

(b)    [14 points] Using the gradient computed in part (a), implement gradient ascent for softmax regression in q2.py. Use a learning rate α = 0.0005. Load q2  data.npz, which is a dictionary that contains q2x train, q2y train, q2x test, and q2y test. Train your classifier on the training data and report the accuracy on the test data in your solution. Report your accuracy in both your writeup and as output in your submitted code.

Implement your code in q2.py. Recall that softmax regression classifies an example x as:

y = argmaxp(y′|x,w)

y′

While you must implement your own softmax regression from scratch, you can use the logistic regression function from sklearn (sklearn.linear model.LogisticRegression) to validate your results. Your results should not be much less than the accuracy of the predictions from sklearn’s LogisticRegression.


3           [22 points] Gaussian Discriminate Analysis
Suppose we are given a dataset {(x(i),y(i));i = 1,...,N} consisting of N independent examples, where x(i) ∈ RM are M-dimensional vectors, and y(i) ∈ {0,1}. We will model the joint distribution of (x(i),y(i)) using:

p(y(i)) = ϕy(i)(1 − ϕ)1−y(i)

 

 

Here, the parameters of our model are ϕ,Σ,µ0, and µ1. (Note that while there are two different mean vectors µ0 and µ1, there is only one covariance matrix Σ.)

(a)    [8 points] Suppose we have already fit ϕ,Σ,µ0, and µ1, and now want to make a prediction at some new query point x. Show that the posterior distribution of the label (y) at x takes the form of a logistic function, and can be written as

 

where w is a function of ϕ,Σ,µ0, and µ1. [Hint 1: For part (a) only, you may want to redefine x to be M + 1 dimensional vectors by adding an extra coordinate x0 = 1.] [Hint 2: wTx is a scalar.]

(b)    [8 points] When M (the dimension of x(i)) is 1, then Σ =  becomes a scalar, and its determinant is |Σ| = σ2. Accordingly, the maximum likelihood estimates of the parameters are given by

 

 .

The log-likelihood of the data is

N

ℓ(ϕ,µ0,µ1,Σ) = logYp(x(i),y(i);ϕ,µ0,µ1,Σ)

i=1

N

= logYp(x(i)|y(i);ϕ,µ0,µ1,Σ)p(y(i);ϕ) i=1

By maximizing ℓ with respect to the four parameters, prove that the maximum likelihood estimates of ϕ,Σ,µ0, and µ1 are indeed the ones described above. (You may assume that there is at least one positive and one negative example, so that the denominators in the definitions of µ0 and µ1 above are non-zero.)

(c) [6 points] Prove part (b) for M ̸= 1, excluding the maximum likelihood estimate for Σ.

4           [30 points] Naive Bayes for classifying SPAM
In this problem, we will use the naive Bayes algorithm to build a SPAM classifier. Our classifier will distinguish between “real” (non-spam) emails and spam emails. For this problem, we obtained a set of spam emails and a set of non-spam newsgroup messages. Using only the subject line and body of each message, the classifier will learn to distinguish between the two groups.

In this problem, we’re going to call the features tokens.

In our data, the text emails have been pre-processed so that they can be used for naive Bayes. The preprocessing ensures that only the email body and subject remain in the dataset; email addresses (EMAILADDR), web addresses (HTTPADDR), currency (DOLLAR) and numbers (NUMBER) were also replaced by special tokens to allow them to be considered properly in the classification process. If you are interested in the preprocessing, two examples for spam emails and their pre-processed forms and one example for a non-spam email and its pre-processed form are in the folder samples FYI.

We have done the feature extraction for you, so you can just load the data matrices (called documentterm matrices in text classification) which contain all the data. In a document-term matrix, the ith row represents the ith document/email, and the jth column represents the jth distinct token. Thus, the (i,j)th entry of this matrix represents the number of occurrences of the jth token in the ith document.

For this problem, we chose the set of tokens (also called a vocabulary) to only contain the medium frequency tokens, as the tokens that occur too often or too rarely do not have much classification value. (Examples: tokens that occur very often are terms like “the,” “and,” and “of,” which occur in any spam and non-spam emails.) Also, terms were stemmed using a standard stemming algorithm; basically, this means that “price,” “prices” and “priced” have all been replaced with “price,” so that they can be treated as the same token. For a list of the tokens used, see the file TOKENS LIST.txt in the samples FYI folder.

Since the document-term matrix is sparse (has lots of zero entries), we store it in an efficient format to save space. We provide a starter code q4 hint.py which contains the readMatrix function that reads in the document-term matrix, the correct class labels for all emails, and the full list of tokens.

(a)    [12 points] Implement a naive Bayes classifier for spam classification, using the multinomial event model and Laplace smoothing. You should use functions provided in q4 hint.py to train your parameters with the training dataset MATRIX.TRAIN. Then, use these parameters to classify the test dataset MATRIX.TEST and report the error using the evaluate function. Implement your code in q4.py.

Remark. If you implement naive Bayes in the straightforward way, you’ll note that the computed p(x|y) = Qj p(xj|y) often equals zero. This is because p(x|y), which is the product of many numbers less than one, is a very small number. The standard computer representation of real numbers cannot handle numbers that are too small, and instead rounds them off to zero. You’ll have to find a way to compute naive Bayes’ predicted class labels without explicitly representing very small numbers such as p(x|y). [Hint: Think about using logarithms.]

(b)    [8 points] Some tokens may be particularly indicative of an email being spam or non-spam. One way to measure how indicative a token i is for the SPAM class by looking at:

   email is SPAM)

                                                            p(xj = i|y = 0)                         P(tokeni|email is NOTSPAM)

Using the parameters you obtained in part (a), find the 5 tokens that are most indicative of the SPAM class (i.e., 5 tokens that have the highest positive value on the measure above). Implement your code in q4.py.

(c)     [10 points] Repeat part (a), but with different naive Bayes classifiers each trained with different training sets MATRIX.TRAIN.*. Evaluate the classifiers with MATRIX.TEST and report the classification error for each classifier. Plot a graph of the test error with respect to size of training sets. Which trainingset size gives you the best classification error? Implement your code in q4.py.

More products