Starting from:

$25

EECS545 -Machine Learning  -   Homework #3 - Solved

Source Code Instruction: Your source code should run under an environment with following libraries:

•   Python 3.6+

•   Numpy (for implementations of algorithms)

•   Matplotlib (for plots)

•   Scikit-Learn (for Q5)

1           [10 points] MAP estimates and weight decay
Consider using a binary logistic regression model p(y = 1|x;w) = σ(wTϕ(x)) where σ is the sigmoid function, ϕ : Rd → RM is a feature transform. Assume that a training set {(x(i),y(i));i = 1,...,N} with x(i) ∈ Rd,y(i) ∈ {0,1} is given as usual. The maximum likelihood estimate of the parameters w is given by N

                                                                               wML = argmaxYp(y(i)|x(i);w).                                                                         (1)

w

i=1

We regularize logistic regression by imposing a prior on the parameters. Suppose we choose the prior w ∼ N(0,τ2I) (here, τ > 0, and I is the M × M identity matrix), and obtain the MAP estimate of w as:

N

wMAP = argmaxp(w)Yp(y(i)|x(i);w).

w i=1 Prove that:
(2)
∥wMAP∥2 ≤ ∥wML∥2.
(3)
2           [25 + 5 points] Direct construction of valid kernels
In class, we saw that by choosing a kernel k(x,z) = ϕ(x)Tϕ(z), we can implicitly map data to a high dimensional space, and have the SVM algorithm work in that space. One way to generate kernels is to explicitly define the mapping ϕ to a higher dimensional space, and then work out the corresponding k.

However, in this question we are interested in direct construction of kernels. Suppose we have a function k(x,z) that gives an appropriate similarity measure (similar to inner product) for our learning problem, and consider plugging k into a kernelized algorithm (like SVM) as the kernel function. In order for k(x,z) to be a valid kernel, it must correspond to an inner product in some higher dimensional space resulting from some feature mapping ϕ. Mercer’s theorem tells us that k(x,z) is a (Mercer) kernel if and only if for any finite set  , the matrix K is symmetric and positive semi-definite, where the square matrix

K ∈ RN×N is given by .

Now here comes the question: Let k1,k2 be kernels over  be a positive real number, let f : RD → R be a real-valued function, let ϕ : RD → RM be a function mapping from  be a kernel over RM × RM, and let p : R → R be a polynomial with positive coefficients.

For each of the functions k below, state whether it is necessarily a kernel. If you think it is a kernel, please prove it; if you think it is not, please give a counterexample.

(a)    [2 points] k(x,z) = k1(x,z) + k2(x,z)

(b)    [2 points] k(x,z) = k1(x,z) − k2(x,z)

(c)     [2 points] k(x,z) = ak1(x,z)

(d)    [2 points] k(x,z) = −ak1(x,z)

(e)    [5 points] k(x,z) = k1(x,z)k2(x,z)

(f)      [3 points] k(x,z) = f(x)f(z)

(g)    [3 points] k(x,z) = k3(ϕ(x),ϕ(z))

(h)    [3 points] k(x,z) = p(k1(x,z))

(i)      [3 points] For this subproblem you don’t have to check if k is a kernel or not. Instead, given a kernel k(x,z) = (xTz + 1)2, find a feature map ϕ associated with k(x,z) such that k(x,z) = ϕ(x)Tϕ(z). You may assume D = 2 for this subproblem.

(j)      [5 points extra credit] Prove that the Gaussian Kernel,  can be expressed as k(x,z) = ϕ(x)Tϕ(z), where ϕ(·) is an infinite-dimensional vector. Specifically, find the closed form of an infinite dimensional feature vector ϕ.

[Hint: ∥x−z∥2 = xTx+zTz−2xTz. It might be useful to consider Power Series:  

3           [20 points] Kernelizing the Perceptron
Consider a binary classification problem with labels y ∈ {−1,1}. Assume that we have a (pre-processed) dataset D = {(x([1]),y(1)),...(x(N),y(N))} where x(n) ∈ Rd+1 with . The perceptron is a linear classifier, given by sign(h(x;w)) where h(x;w) = wTx and w ∈ Rd+1 denotes the weights of the perceptron classifier. The perceptron is trained via Algorithm 1 described below.

Essentially, in each iteration, the algorithm picks a random sample from the dataset (x(n),y(n)) ∈ D and checks if it is misclassified i.e., sign(h(x(n);wt)) ̸= y(n). This is implemented by checking if y(n)h(x(n);wt) is negative (Line 5). If a positive example is misclassified, then we update the weights with +x(n). If a negative example is misclassified, then we update the weights with −x(n) (Line 6). If the example is correctly classified, nothing is done. This process is repeated for T iterations.

 

Algorithm 1: Perceptron training algorithm

 

Assume that T is large and D is linearly separable. In 1943, Frank Rosenblatt showed that this simple algorithm converges and outputs a wT that is able to perfectly separate the training data D 1.

What if the dataset D is not linearly separable? In this problem, you’ll show that we can kernelize the perceptron algorithm. This allows us to operate in a high-dimensional (possibly infinite-dimensional) feature space ϕ(x) where we can assume that the data is linearly separable.

Let ϕ : Rd+1 → RM be a high-dimensional feature transform that is intractable to compute. Assume that D is separable in the feature space i.e., Dϕ = {(ϕ(x(1)),y(1)),...(ϕ(x(N)),y(N))} is linearly separable. Note that the perceptron classifier is now given by sign(h(ϕ(x);w)) with h(ϕ(x);w) = wTϕ(x). Lastly, assume that we have a computationally-tractable kernel k such that k(x,x′) = ϕ(x)Tϕ(x′).

(a)    Define   as usual.

In this subproblem, we assume we that we use Algorithm 1 in the feature space for analysis. Thus, assume input dataset to Alg. 1 is Dϕ.

(i)     [3 points] Assuming wt is of the form ΦTαt where αt ∈ RN, show that wt+1 is also of the form ΦTαt+1 for some αt+1 ∈ RN.

(ii)   [3 points] Show that all intermediate weights wt where 0 ≤ t ≤ T (note that this includes the optimal weights wT), can be expressed as ΦTαt where αt ∈ RN. [hint: induction!]

(b)    Note that by means of the previous subproblem, we have showed that the weights of the perceptron are a linear combination of the features.

(i)     [4 points] Observing your solution for part (a), can you provide an update rule for αt+1 from αt and other known quantities. What is the maximum number of elements that differ between αt and αt+1?

(ii)   [4 points] Show that h(ϕ(x(n)),wt) can be expressed entirely in terms of αt and kernel k (doesn’t involve ϕ or Φ).

(c)     [6 points] Kernelize the perceptron algorithm. Write it exactly in the style of Alg. 1 i.e., each line (wherever applicable) in Alg. 1 should be replaced with its “kernelized” version. The algorithm should be computationally tractable and thus must not contain any ϕ or Φ terms.

You have trained the kernelized perceptron and are now given a new test data point x; how would you classify it?


4           [30 points] Implementing Soft Margin SVM by Optimizing Primal Objective
Support Vector Machines (SVM) is a discriminative model for classification. Although it is possible to develop SVMs that do K class classifications, we are going to restrict ourselves to binary classification in this question, where the class label is either +1 (positive) or −1 (negative). SVM is not a probabilistic algorithm. In other words, in its usual construction, it does not optimize a probability measure as a likelihood. SVM tries to find the “best” hyperplane that maximally separates the positive class from the negative class.

Recall that the objective function for maximizing the soft margin is equivalent to the following minimization problem:

  subject to

(4)

The above is known as the primal objective of SVM. Notice the two constraints on the slack variables ξi. It means that ξi must satisfy both of those conditions, while minimizing the sum of ξi ’s times C. The constrained minimization is equivalent to following minimization involving the hinge loss term:

N

                                                               (5)

You will be working with minimization of the above objective in this problem.

(a)    [5 points] Briefly show that Eq.(4) and Eq.(5) are indeed equivalent optimization problems. Clearly show your work for eliminating slack variables.

(b)    [6 points] Find the derivatives of the loss function E(w,b) with respect to the parameters w,b. Show that:

 (i) (i)

                                                                                                                                                             x                                            (6)

                                                                                                           (7)

where I[·] is the indicator function. If f(x) = max(0,x), then assume that  

                                                                                                                                                                         0    otherwise

(c)     [9 points] Implement the SVM algorithm using batch gradient descent. In previous assignments, you have implemented gradient descent while minimizing over one variable. Minimizing over two variables (w, b) is not different. Both gradients are computed from current parameter values, and then parameters are simultaneously updated.[2]

Example pseudocode for optimizing w and b is given by Algorithm 2 below.

 

Algorithm 2: SVM Batch Gradient Descent

 

1   w∗ ← 0;

2   b∗ ← 0;

3   for j = 1 to NumIterations do

 

The learning rate α(.) is now a function of time (iteration number) rather than a constant. This allows us to define a decaying learning rate as:

 

 Throughout this question, set η0 = 0.5 and the slack cost C = 5. Loading the file q4 data.npy loads the arrays q4x train, q4y train, q4x test, and q4y test. Run gradient descent over the training data 6 times, once for each of the NumIterations={5,50,100,1000,5000,6000}. For each run, please report your

 

trained parameters (w,b) and the test classification accuracy. Note that the parameters are initialized with zeros and the learning rate for updating b∗ is 0.01α(j) (see Algorithm 2).

(d)    [4 points] In stochastic gradient descent, we compute the gradients and update our parameters for every training example (rather than for every whole batch). To do stochastic gradient descent properly, we need to define a loss function per example (note that the loss function E(w,b) above is defined over the entire training set). We can rewrite the loss function above as:

 

 

Expressing the error function as one summation allows us to define an error term per training example like:

 

then,E(w,b) = XE(i)(w,b)

i=1

Now derive the gradients ∇wE(i)(w,b) and ∂b∂ E(i)(w,b).

(e)    [6 points] Implement the SVM algorithm using stochastic gradient descent (SGD). The pseudo-code looks like:

 

Algorithm 3: SVM Stochastic Gradient Descent

 

1   w∗ ← 0;

2   b∗ ← 0;

3   for j = 1 to NumIterations do

 

Use the same α(.),η0,C as part (b). Run your implementation of SVM Stochastic Gradient Descent over the training data 6 times, once for each of the NumIterations= {5,50,100,1000,5000,6000}. For each run, report your trained parameters (w,b) and the test classification accuracy.

 


5           [15 points] SciKit-Learn SVM for classifying SPAM
Recall Q4 from HW2. We will repeat the same problem using SciKit-Learn SVM, and compare naive Bayes and SVM.

NOTE: For your convenience, this is the same information about the data from Q4 in HW2.

In our data, the text emails have been pre-processed so that they can be used for naive Bayes. The preprocessing ensures that only the email body and subject remain in the dataset; email addresses (EMAILADDR), web addresses (HTTPADDR), currency (DOLLAR) and numbers (NUMBER) were also replaced by the special tokens to allow them to be considered properly in the classification process. (In this problem, we’ll going to call the features “tokens”.)

We have done the feature extraction works for you, so you can just load the data matrices (called document-term matrices in text classification) which contain all the data. In a document-term matrix, the ith row represents the ith document/email, and the jth column represents the jth distinct token. Thus, the (i,j) entry of this matrix represents the number of occurrences of the jth token in the ith document.

For this problem, we chose the set of tokens (vocabulary) to only contain the medium frequency tokens, as the tokens that occur too often or too rarely do not have much classification value. (Examples: tokens that occur very often are terms like “the,” “and,” and “of,” which occur in any spam and non-spam emails.) Also, terms were stemmed using a standard stemming algorithm; basically, this means that “price,” “prices” and “priced” have all been replaced with “price,” so that they can be treated as the same token.

Since the document-term matrix is sparse (has lots of zero entries), we store it in an efficient format to save space. We provide a starter code q5.py which contains the readMatrix function that reads in the document-term matrix, the correct class labels for all emails, and the full list of tokens.

(a)    [5 points] Implement an SVM classifier with linear kernel for spam classification using the corresponding package from SciKit-Learn[3]. You may need to run pip install scikit-learn to install the package.

You should use the LinearSVC class[4] in this package (already imported in q5.py) to train your parameters with the training dataset MATRIX.TRAIN. Then use these parameters to classify the test dataset MATRIX.TEST and report the error using the evaluate function. Implement your code in

 

q5.py.

(b)     [6 points] Repeat part (a), but with training sets of size ranging from 50, 100, 200, . . . , up to 1400, by using the files MATRIX.TRAIN.*. Plot the test error each time (use MATRIX.TEST as the test data) to obtain a learning curve (test set error v.s. training set size). Also, report the number of support vectors for each of the learned models. Which training set size gives the best test set error?

(c)     [4 points] Reference instructor results for Q4 in HW2 (see Canvas → Homework → HW2 → solution → hw2 sol.pdf). How do naive Bayes and support vector machines compare (in terms of test error) as a function of the training set size?

Credits
Some questions adopted/adapted from http://www.stanford.edu/class/cs229/ and from Bishop PRML.


 
[1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.335.3398&rep=rep1&type=pdf
[2] Note: it is a common mistake to implement gradient descent over multiple parameters by updating the first parameter, then computing the derivative w.r.t second parameter using the updated value of the first parameter. In fact, updating one parameter then computing the gradient of the second parameter using the updated value of the first parameter, is a different optimization algorithm, known as Coordinate Descent.
[3] See https://scikit-learn.org/stable/ for more information about the package.
[4] See https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html for more information about the LinearSVC class.

More products