$25
Foundations of Computer and Data Science
Problem 1: We would like to test the ability of the first kernel method (discussed in the class) to approximate pdfs. Therefore we generate 1000 realizations of a random variable uniformly distributed in [0,1]. a) Approximate the corresponding pdf using the Gaussian kernel
K .
Plot the resulting approximations for different values of h. Do not limit your graph in [0,1] but extend it beyond the two boundaries. What do you observe as far as capturing the support of the random variable (the interval [0,1]) and the constant value of the pdf (equal to 1 in [0,1]) is concerned? b) Complete the same steps for the Laplacian kernel
K .
Problem 2: The Matlab data file hw4-2data.mat contains two matrices: stars and circles each being a list of 2 dimensional (2-D) vectors. Each 2-D vector identifies a point in the 2-D space which is labeled either as star or circle. We are interested in developing a classifier that distinguishes between the two sets. We would like to use the kernel method to find a nonlinear separating boundary. To achieve this we assign the label “1” to stars and the label “−1” to circles and if φ(X),X = [x1,x2]| is the transformation we would like to apply to the data then we want to solve the following minimization problem to find the optimum φ(X)
, (1)
where V is the vector space of functions generated by the Gaussian kernel
K(X,Y ) = e−h1kX−Y k2 = e−h1{(x1−y1)2+(x2−y2)2}.
a) Use the Representer theorem to prove that for the first two sums we can replace φ(X) by its orthogonal projection φˆ(X) onto the linear subspace generated by K(X,Xi),Xi ∈ stars and K(X,Xj),Xj ∈ circles where
φˆ(X) = X αiK(X,Xi) + X βjK(X,Xj). (2)
Xi∈stars Xj∈circles
b) For the term kφ(X)k2 use the orthogonality principle to show that
kφ(X)k2 = kφˆ(X)k2 + kφ(X) − φˆ(X)k2 ≥kφˆ(X)k2.
Does this suggest that we can replace φ(X) with φˆ(X) everywhere in the original cost function in (1)? If yes, then WHY? c) Using the form of φˆ(X) defined in (2), find the optimum coefficients αi,βj. d) Once you identify the optimum φˆ(X) explain how you are going to use it to classify a new point Xnew as “star” or
“circle” given, of course, that φˆ(Xnew) will not be exactly equal to 1 or −1. e) After you have specified your final classification rule in d) find (numerically) the separating boundary for the two classes in the 2-D space (also place the training points on the 2-D plane to verify the quality of your boundary). Repeat the process for different values of h and λ.
Problem 3: Suppose that the scalar random variable y and the random vector X are related through the following model
y = θ∗|X + w, (3)
where w is a scalar random variable with mean 0 and independent from X and θ∗ a deterministic vector. Assuming that we have knowledge of the statistical behavior of the random pair (y,X) we would like to estimate θ∗ by solving the following minimization problem
minE[(y − θ|X)2]. (4) θ
a) Find the optimum θ and prove that it is equal to θ∗. b) Assume now that the statistical behavior of the pair (y,X) is not available and, instead, you are observing pairs (yt,Xt) that are independent realizations of (y,X). Find the learning (adaptive) algorithm that provides a new estimate θt every time a new pair (yt,Xt) becomes available. The resulting algorithm is known as Least Mean Squares (LMS) and constitutes the most well known algorithm in Adaptive Signal Processing. c) Use your own favorite θ∗, of length 5, generate pairs (yt,Xt) using the model in (3), where Xt is a Gaussian vector of length 5 with independent and identically distributed elements of mean 0 and variance 1 and wt is again Gaussian independent from Xt with mean 0 and variance 0.1. Apply your algorithm from b) and verify its convergence towards θ∗. Select a small learning rate so that your estimates are not very “noisy”. Plot the squared norm kθt − θ∗k2 as a function of the iteration t and examine whether it converges to something small. Use logarithmic scale (“semilogy” in Matlab) to be able to observe differences between small quantities. d) Repeat the previous simulation but with a learning rate which is half the learning rate you used in question c). What do you observe as far as convergence rate and steady state error is concerned? Can you compare the performance in the two cases and claim that one is “better” than the other?