Starting from:

$30

COMS4771- Homework 04 Solved

1   [Estimating parameters with complete and incomplete data] Consider the data generation process for observation pair (a,b) as follows:

–   a is the outcome of an independent six-faced (possibly loaded) dice-roll. That is, chance of rolling face ‘1’ is p1, rolling face ‘2’ is p2, etc., with a total of six distinct possibilities.

–   Given the outcome a, b is drawn independently from a density distributed as qae−qab (where qa 0).

(i)      List all the parameters of this process. We shall denote the collection of all the parameters as the variable θ (the parameter vector).

(ii)     Suppose we run this process n times independently, and get the sequence:

(a1,b1),(a2,b2),...,(an,bn).

What is the likelihood that this sequence was generated by a specific setting of the parameter vector θ?

(iii)    What is the most likely setting of the parameter vector θ given the complete observation sequence ? that is, find the Maximum Likelihood Estimate of θ given the observations.

(iv)    What is the probability of the partial (incomplete) observation b1,b2,...,bn given a specific setting of the parameter vector θ?

(v)     Derive the Expectation Maximization (EM) algirthm to estimate of the parameters given the incomplete observation sequence .

2   [kernelizing k-means] k-means with Euclidean distance metric assumes that each pair of clusters is linearly separable. This may not be the case. A classic example is where we have two clusters corresponding to data points on two concentric circles in the R2.

(i)      Implement Lloyd’s method for k-means algorithm and show the resulting cluster assignment for the dataset depicted above. Give two more examples of datasets in R2, for which optimal k-means setting results in an undesirable clustering. Show the resulting cluster assignment for the two additional example datasets.

Let φ denote an explicit non-linear feature space mapping in some inner product space. We will show how one can derive an implicit kernel-based version of the Lloyd’s method for k-means, where we only operate on data as φ(xi) · φ(xj) = K(xi,xj).

(ii)     Let zij be an indicator that is equal to 1 if the φ(xi) is currently assigned to the jth cluster and 0 otherwise (1 ≤ i ≤ n and 1 ≤ j ≤ k). Show that the jth cluster center cj can be written as , where αij only depends on zij variables.

(iii)    Given any two data points φ(xi) and φ(xj), show that the square distance kφ(xi) − φ(xj)k2 can be computed using only (linear combinations of) inner products.

(iv)    Given the results of parts (ii) and (iii), show how to compute the square distance kφ(xi)− cjk2 using only (linear combinations of) inner products between the data points x1,...,xn.

(v)     From results from parts (ii), (iii) and (iv), propose the algorithm for kernelized version of Lloyd’s method of finding k-means.

(vi)    Implement your proposed kernelized k-means method and run it on the three example datasets of part (i). Compare the resulting cluster for various choices of kernels (e.g.

linear kernel, quadratic kernel, rbf kernel).

(submit your datasets and kernelized code on Courseworks to receive full credit)

3   [Randomized decision rules] Consider a “reward-based” prediction framework, where given a continuous state-space X and a discreet action-space A, a learning algorithm f : X → A decides to perform action a ∈ A when it is in state x ∈ X. The learner then receives a numerical (possibly negative) reward R(x,a) for performing action a ∈ A in state x ∈ X. (Note: such reward-based frameworks are common in robotics where the learning agent, i.e., the robot, performs some—possibly randomized—action a in some situation x and receives reward R(x,a). The goal of the learning agent is to maximize its expected reward.)

(i)      Assume that f is allowed take a randomized action on any state x. That is, in any state x ∈ X, f chooses to perform action a with probability P[a|x]. Show that the average (over the states) expected reward received by f, i.e., Ex,f(x)[R(x,f(x))] is



(ii)     Show that the expected reward is maximized by choosing P[a|x] = 1 for the action a associated with the maximal reward R(x,a) (for a given state x). This shows us that there is no benefit in randomizing the best decision rule.

(iii)    Can one benefit from randomizing a suboptimal rule? Briefly explain.

4   [From distances to embeddings] Your friend from overseas is visiting you and asks you the geographical locations of popular US cities on a map. Not having access to a US map, you realize that you cannot provide your friend accurate information. You recall that you have access to the relative distances between nine popular US cities, given by the following distance matrix D:

Distances (D)
BOS
NYC
DC
MIA
CHI
SEA
SF
LA
DEN
BOS
0
206
429
1504
963
2976
3095
2979
1949
NYC
206
0
233
1308
802
2815
2934
2786
1771
DC
429
233
0
1075
671
2684
2799
2631
1616
MIA
1504
1308
1075
0
1329
3273
3053
2687
2037
CHI
963
802
671
1329
0
2013
2142
2054
996
SEA
2976
2815
2684
3273
2013
0
808
1131
1307
SF
3095
2934
2799
3053
2142
808
0
379
1235
LA
2979
2786
2631
2687
2054
1131
379
0
1059
DEN
1949
1771
1616
2037
996
1307
1235
1059
0
Being a machine learning student, you believe that it may be possible to infer the locations of these cities from the distance data. To find an embedding of these nine cities on a two dimensional map, you decide to solve it as an optimization problem as follows.

You associate a two-dimensional variable xi as the unknown latitude and the longitude value for each of the nine cities (that is, x1 is the lat/lon value for BOS, x2 is the lat/lon value for NYC, etc.). You write down the an (unconstrained) optimization problem

minimize ,

i,j

where Pi,j(kxi − xjk − Dij)2 denotes the embedding discrepancy function.

(i)      What is the derivative of the discrepancy function with respect to a location xi?

(ii)     Write a program in your preferred language to find an optimal setting of locations x1,...,x9. You must submit your code to Courseworks to receive full credit.

(iii)    Plot the result of the optimization showing the estimated locations of the nine cities.

(here is a sample code to plot the city locations in Matlab)

cities={’BOS’,’NYC’,’DC’,’MIA’,’CHI’,’SEA’,’SF’,’LA’,’DEN’};

locs = [x1;x2;x3;x4;x5;x6;x7;x8;x9];

figure; text(locs(:,1), locs(:,2), cities);

What can you say about your result of the estimated locations compared to the actual geographical locations of these cities?

More products