Starting from:

$40

CS771 Introduction to ML Homework Assi 2 Solved

CS771 Introduction to ML 
Homework Assignment Number 2         

Solution 1
Let {(xn,yn)}Nn=1 xn ∈ RD, yn ∈ {0,1}

For Logistic Regression, we have

p(yn | w,xn) = Ber(yn; µn) = (µn)yn(1 − µn)1−yn

where

ewTxn T

                                                        µn = σ(w xn) =   wTxn                                                                                                                                           (1)

1 + e

N

Using, p(y | w,X) = Q p(yn | w,xn), we obtain the Likelihood given by

n=1

N

p(y | w,X) = Y(µn)yn(1 − µn)1−yn

n=1

Taking log, we get the log-Likelihood, LL(w), given by

N

LL(w) = X yn log(µn) + (1 − yn)log(1 − µn)

n=1

So, the loss function L(w), which is the negative log-Likelihood, is given by

N

L(w) = − X yn log(µn) + (1 − yn)log(1 − µn)

n=1

N

= − X yn log(σ(wT xn)) + (1 − yn)log(1 − σ(wT xn))

n=1

                                      XN                            wTxn                                                                                                                             ewTxn

e

= −   yn log(  wTxn ) + (1 − yn)log(1 − 1 + ewTxn ) 1 + e n=1

N

= − X ynwT xn − yn log(1 + ewTxn) − (1 − yn)log(1 + ewTxn)

n=1

N

                                     = − X ynwT xn − log(1 + ewTxn)                                                                     (2)

n=1

Therefore, we have

                                                                     N                            wTxn

                                                                                      e            

∇wL(w) = − X yn − 1 + ewTxn xTn

n=1

N

                                                                 = − X yn − µnxTn                                                                                         (3)

n=1

The Hessian Matrix is given by ∇2wwT L(w), which is obtained as follows

∇2wwT L 

                                                                            ∂ T XN                         xTn

                                                                 = −                  yn − µn

∂w

n=1 N

                                                                     = X µn(1 − µn)xnxTn                                                                             (4)

n=1

where we used the fact that,

                                          ∂µn               ∂             ewTxn 

                                                   =                        Txn

                                         ∂wT           ∂wT        1 + ew

ewTxn

                                                  = Txn)2 xn

(1 + ew

                                                          ewTxn                                                                    ewTxn

                                                  =            1 −  × x

                                                        1 + ewTxn ×              1 + ewTxn                                   n

                                                    = µn(1 − µn)xn                                                                                                                (5)

Let

y = [ y1 y2 ... yN ]T ∈ RN×1

µ = [ µ1 µ2 ... µN ]T ∈ RN×1

X = [ xT1 xT2 ... xTN ]T ∈ RN×D

γ = diag(µ1(1 − µ1), µ2(1 − µ2), ... µN(1 − µN)) ∈ RN×N

The given second order Newton Raphson’s optimisation update is:

(t+1)              (t) − H(t)−1g(t)           (6) w    = w

where

g = ∇wL(w)

N

= − X yn − µnxTn

n=1

                                                               = −XT (y − µ)                                                                  (7)

H = ∇2wwT L(w)

N

= X µn(1 − µn)xnxTn

n=1

                                                              = XT γX                                                                           (8)

Using (6),(7) and (8), we get

w(t+1) = w(t) − H(t)−1g(t)

= w(t) + XT γ(t)X−1XT (y − µ(t))

= XT γ(t)X−1XT γ(t)Xw(t) + XT γ(t)X−1XT γ(t) γ(t)−1(y − µ(t))

                                               −1                   T (t) Xw(t) + γ(t)−1(y − µ(t))

= XT γ(t)X X γ

= XT γ(t)X XT γ(t) yˆ(t)
(9)
where yˆ(t) = Xw(t) + γ(t)−1(y − µ(t)) ∈ RN×1

We know that this is the minimiser of weighted least squares

(yˆ(t) − Xw(t))T γ(t)(yˆ(t) − Xw(t)) and therefore,
(10)
w(t+1) = arg minw(t)(yˆ(t) − Xw(t))T γ(t)(yˆ(t) − Xw(t))

N

= arg minw(t) Xγ(nnt) (yˆn(t) − w(t)T xn)2

n=1

N

= arg minw(t) X µ(nt)(1 − µ(nt)) (yˆn(t) − w(t)T xn)2

n=1
(11)
N

X (t)              (t) − w(t)T xn)2 = arg minw(t)          γn (yˆn

n=1
(12)
where γn(t) = µ(nt)(1 − µ(nt))
(13)
−1

Hence, the required update is:

N

                              w(t+1) = arg minw(t) X γn(t) (yˆn(t) − w(t)T xn)2                                                                                     (14)

n=1

                         where γn(t) = µ(nt)n(1 − µ(nt)) and ˆyn(t) = [yˆ(t)]n = [Xw(t) + γ(t)−1(y − µ(t))]n                           (15)

From this, we can conclude that the Newton update for w corresponds to solving importanceweighted regression problem given in (14) where γn denotes the importance of n’th example and ˆyn denotes a modified real-valued label, both of which are given in (15)

Note that yˆ is kind of like an adjusted response in comparison to y and the expression for γn makes sense because when µn ∈ {0,1}, γn = 0 and hence the weight for yˆn(t) − w(t)T xn becomes 0

2

Solution 2
Given {(xn,yn)}Nn=1 xn ∈ RD×1 and yn ∈ {±1}, we will discuss Kernalized Perceptron

Ignoring the bias term b ∈ R, let he hyperplane be of the form wT x = 0, where w ∈ RD×1

We know that the standard update for Stochastic gradient descent algorithm for Perceptron is given by w(t+1) = w(t) + ynxn when ynwT xn < 0

Due to this update, it is easy to see that

N

                                                                  w = X αnynxn                                                                                             (1)

n=1

where αn denotes the number of times the algorithm makes mistake on n’th example (xn,yn)

Let α = [α1 α2 ... αN]T ∈ RN×1

Let K : RD×1×RD×1 → R 3 (xi,xj) →7 φ(xi)T φ(xj) be arbitrary kernel where φ : RD → X 3 x 7→ φ(x) is the implicit feature mapping induced by K

While working in φ induced space (X), from (1), we will have

N

w = X αnynφ(xn)

n=1

Hence, the prediction ˆy made on example (x,y) will be

yˆ = sign(wT φ(x))

N

= sign( X αnynφ(xn)T φ(x ))

n=1

N

= sign( X αnynφ(xn)T φ(x) )

n=1
(2)
N

= sign( X αnynK(xn,x) )
(3)
n=1

Since we don’t explicitly know what φ is, we will have to solve the dual formulation for α as follows:

Kernelized Perceptron − Stochastic gradient descent
1. Initialize : α = α(0) = 0 ∈ RN×1, t = 0, and set the learning rate ηt = 1 ∀t 2. Pick random example : (xn,yn) n ∈ {1,2 ... N}

N

3. Let :
yˆn = sign( XαiyiK(xi,xn) )

i=1
4. Check :
If ynyˆn < 0, then update αn = αn + 1

(Increase the number of mistakes made on (xn,yn))
5. Stopping condition :
Go to 2. if not yet converged

 

3

Solution 3

Given {(xn,yn)}Nn=1 xn ∈ RD×1 and yn ∈ {±1}

The objective for SVM with unequal class importance is
 
                                                      ||w||     XN

                                       minw,b,ξ         +         Cynξn

2 n=1

                                   subject to            yn(wT xn + b) ≥ 1 − ξn ∀n = 1,2...N

ξn ≥ 0 ∀n = 1,2...N

The Lagrangian (primal) problem for this is given by
(1)
                                ||w||     XN                         XN                                        n        T     n                  − XN βnξn

                 minw,b,ξ         +          Cynξn +               αn((1 − ξn) − y (w x + b))

2

                                             n=1                        n=1                                                                               n=1

Let
(2)
                                  ||w||     XN                        XN                                       n        T     n                  − XN βnξn

                Lp(w,b,ξ) =      +         Cynξn +               αn((1 − ξn) − y (w x + b))

2
(3)
                                                n=1                        n=1

Setting partial derivatives to 0, we obtain

∇wLp = 0

N

⇒ w − X αnynxn = 0

n=1
n=1
N

⇒ w = X αnynxn

n=1

∇bLp = 0
 
(4)
N

⇒ Xαnyn = 0

n=1

∇ξnLp = 0

⇒ Cyn − αn − βn = 0
 
(5)
⇒ αn = Cyn − βn
 
(6)
Substituting (4) − (6) in (3) and using (2) gives

                                                                 N      N                                              N

  XXαiαjyiyjxTi xj − X αn

i=1 j=1                  n=1 subject to     0 ≤ αn ≤ Cyn ∀n = 1,2...N

n

X

αnyn = 0

n=1

or, we can state the dual formulation (maximisation problem) as

                                                                      NN                  N

                                                          maxα X  XXαiαjyiyjxTi xj                                                                      (7)

 
n=1                        i=1 j=1
subject to
0 ≤ αn ≤ Cyn ∀n = 1,2...N

n

X

αnyn = 0

n=1
(7) gives the required dual (maximisation) formulation.

This differs from the standard dual problem because here we have separate upper limits for αn corresponding to xn depending on whether yn = +1 (in which case the upper bound will be C1) and yn = −1 (in this case upper bound will be C−1). This makes sense since we want to upper bound the number of true positives and false negatives by different values because one of these might cost more than other in practical applications.


4

Solution 4
Here, we would like to improve upon the standard K-means batch algorithm and make it ”online” using Stochastic gradient descent.

Let  D×1, K denote the number of clusters (given) and µk ∈ RD×1 denote the centroids (to be learned) k = 1,2...K

Before giving the algorithm, for sole purpose of convenience, we would like to rewrite the given

K-means objective function

                                      N       K

We are given that L = P P znk||xn − µk||2

n=1 k=1

where zn = [zn1 ... znK] ∈ R1×K is one-hot representation of the xn depending on which

cluster it belongs in.

Equivalently, this objective can be re-written as

N

L = P ||xn − µzn||2 n=1

For a single vector xn, this objective is given by:

L(xn,µzn) = ||xn − µzn||2

More generally, define
 
L(xn,µk) = ||xn − µk||2 for n = 1...N, k = 1...K

They key to designing the online algorithm is to observe the following:

Let µ(n) denote the mean of a sample of n points {x1,...xn}, then, we have
(1)
                           1 n                         n−1

µ(n) =      Xxj = 1 Xxj + 1xn = n − 1µ(n−1) + 1xn n n          n           n          n j=1         j=1

Hence, µ(n) is a linear, infact, a convex combination of µ(n−1) and xn Further, we can see that

                (n)             n − 1 (n−1)              1 n                     (n−1) − 1µ(n−1) + 1xn

                µ =             µ             + x = µ

                             n                   n                         n                n

Therefore,
(2)
µ(n) = µ(n−1) + 1(xn − µ(n−1))
(3)
n

After observing this, we have the following online algorithm                                                         (4)

1.  Initialise centroids :      {µk}Kk=1 randomly, Set nk = 1 for k = 1...K

where nk denotes size of cluster with centroid µk

2.   Input : Suppose xn is the incoming vector

3. Find closest centroid (Greedy) :
µj = arg mini L(xn,µi)                  refer (1)
4. Assign cluster to the input :
zn = [01 02 ...0j−1 1j 0j+1 ... 0K]
5. Update the closest centroid :
1 µj = µj +            (xn − µj)

nj + 1
6. Update size of cluster :
nj = nj + 1
7. Stopping condition :
Stop if there are no more incoming vectors, i.e, all have
been assigned a cluster, else repeat steps (2) − (7)

From the algorithm above, we can answer the questions asked in assignment:

Q. How would you solve step 1? (step 1 with respect to assignment)

A. We are greedily assigning incoming vector xn to the cluster of the closest centroid based on Euclidean distance.

Q. What will be the SGD-based cluster mean update equations for step 2?

Intuitively, why does the update equation make sense?

A. The SGD-based cluster mean/centroid update is given in step 5. of algorithm in (4)

1

         µj = µj +       (xn − µj)

nj + 1

which makes intuitive sense from (2) and (3) because we are taking weighted average of the incoming vector with the vectors already present in the cluster.

Q. For your derived SGD update, suggest a good choice of the step size and mention why you think it is a good choice.

1

A. From step 5. we see that the step size is given by 

nj + 1

This is a good choice for step size because of two reasons: One, it makes intuitive sense from (2) and (3) since, we’re supposed to divide by the new size of the cluster in weighted average.

Two, it is always ≤ 1 and monotonically decreasing as more data points (vectors) are incoming.

5

Solution 5
Suppose we are given {xn}Nn=1 where xn ∈ RD×1, K be the number of clusters (given). We want to implement kernelized K-means clustering

Let K(·,·) : RD×1 × RD×1 → R be arbitrary kernel that implicitly induces the map φ(·) : RD×1 → F where F is abstract space, possibly infinite dimensional, and assume that we can use the kernel trick

The challenge here is due to the possibility of F being infinite dimensional, we are unable to store φ({xn}Nn=1), i.e, the kernel-induced feature map representation of the data points nor the cluster centroids since these lie in F

To overcome, we will use the kernel trick that replaces the Euclidean inner product hφ(·),φ(·)i by K(·,·). This allows us to operate in the original space RD×1 without having to care about

F

For convenience, let for each xn ∈ RD×1, assign zn ∈ R1×K a one-hot representation for the cluster that xn belongs to, i.e, for example if xn belongs to the cluster of centroid µj, then zn = [01, 02, ...0j−1 1 0j+1 ...0K]

Also, let Cj represent the set of data points that belong in the cluster with centroid µj

The standard K-means clustering randomly initialises the K centroids µn ∈ RD×1, for n = 1,2...K

Then, we assign cluster to each data point by the rule

zn = arg mink∈{1,2...K}||xn − µk||2

After this, each centroid is updated by the rule
(1)
                                                                  µj   X xn                                                                                            (2)

|Cj| xn∈Cj

and we repeat this until the data points are re-assigned or some given metric of convergence is satisfied.

In the kernelized version of this, we will randomly initialise the K centroids µk in the original space itself, i.e, µk ∈ RD×1 which will represent the clusters Ck, for k = 1,2...K The assignment step will require us to compute for each xn

                                                   zn = arg mink=1,2...K||φ(xn) − µk||2                                                                      (3)

and the centroids will then be recomputed as

                                                                µj   X φ(xn)                                                           (4)

|Cj| xn∈Cj

Now note that, we can compute (3) using (4) and the kernel trick

zn = arg mink=1,2...K||φ(xn) − µk||2

= arg mink=1,2...Khφ(xn),φ(xn)i + hµk,µki − 2hφ(xn),µki

= arg mink=1,2...Khµk,µki − 2hφ(xn),µki

                = arg mink=1,2...K  1       X φ(xi), 1      X  , 1        X φ(xj)

                                              |Ck| xi∈Ck                                       |Ck| xj∈Ck                                                                                                 |Ck| xj∈Ck

                                              1          X X                                                         2 X

             = arg mink=1,2...K |C |2                             hφ(xi),φ(xj)i − |Ck| xj∈Chφ(xn),φ(xj)i

k xi∈Ck xj∈Ck

                                              1          X X                                            2      X
             = arg mink=1,2...K |C |2                                 K(xi,xj) − |Ck| xj∈Ck K(xn,xj)                               (5)

k xi∈Ck xj∈Ck

which is independent of the centroids µk

Hence, the kernelized K-means algorithm can be given as                                                            (6)

1. Initialise the clusters :
Initialise {µk}Kk=1 randomly as one of the given data points

Set Ck = {µk} which represents the k0th cluster
2. Assign cluster (Greedy) :
for j = 1,2...N
                                                                                        1          X X                                           2      X
                                                   zn = arg mink=1,2...K |C |2                                          K(xi,xj) − |Ck| xj∈Ck K(xn,xj)

k xi∈Ck xj∈Ck

3.   Update the clusters : for j = 1,2...K

Cj = {xn | zn = [01,...,0j−1 1 0j+1 ...0K]}

4.   Stopping condition :      Stop if the data points are re-assigned or if any other given

metric of convergence is met, else repeat (2) − (4)

Q. What is the difference between how the clusters centroids would need to be stored in kernel K-means versus how they are stored in standard K-means?

A. In standard K-means, we have to store explicitly the centroids for each iteration.

In kernelized K-means, we don’t have to store the centroids, only have to maintain the clusters, i.e, maintain which data points belong to which cluster.

Q. Assuming each input to be D-dimensional in the original feature space, and N to be the number of inputs, how does kernel K-means compare with standard K-means in terms of the cost of input to cluster mean distance calculation?

A. In standard K-means, cost of computing distance between a data point and a cluster centroid is of

O(D) time complexity since we are calculating Euclidean distance between two D dimensional points In kernelized K-means, however, distance between data point and cluster centroid is done using the objective in (5) which is of O(N2)

During cluster assignment of xn the standard K-means requires us to calculate the distance of xn from all the K clusters which is done in O(KD). Adding for all data points, this gives

O(KDN) time complexity in total

For cluster assignment in kernelized K-means, we have to calculate step 2. of (6) which is

O(KN2) for each xn and adding for all data points gives O(KN3) in total We see that kernelized version is slower than standard one.

Introduction to ML (CS771)

QUESTION

Homework Assignment Number 2
Solution 6
Dataset : binclass.txt

Model : Positive and negative labeled examples are sampled from N(µ+,σ+2 I2)

and N(µ−,σ−2 I2) respectively

Boundary : N(µ+,σ+2 I2) − N(µ−,σ−2 I2) = 0, Linear

 

Dataset : binclass.txt

Model : Positive and negative labeled examples are sampled from N(µ+,σ2I2) and N(µ−,σ2I2) respectively

Boundary : N(µ+,σ2I2) − N(µ−,σ2I2) = 0, Linear

 

Dataset : binclass libSVM.txt

Model : SVM with linear kernel and hyper-parameter C = 1

Boundary : Linear
Library : SVC from sklearn.svm

 

Dataset : binclassv2.txt

Model : Positive and negative labeled examples are sampled from N(µ+,σ+2 I2) and N(µ−,σ−2 I2) respectively

Boundary : N(µ+,σ+2 I2) − N(µ−,σ−2 I2) = 0, Linear

 

Dataset : binclassv2.txt

Model : Positive and negative labeled examples are sampled from N(µ+,σ2I2) and N(µ−,σ2I2) respectively

Boundary : N(µ+,σ2I2) − N(µ−,σ2I2) = 0, Linear

 

Dataset : binclassv2 libSVM.txt

Model : SVM with linear kernel and hyper-parameter C = 1

Boundary : Linear
Library : SVC from sklearn.svm

 

Conclusion
For the first dataset, binclass.txt, all of the three models, i.e, Gaussian with different/same covariance and linear SVM seem to do equally good job. However, for the second dataset, binclassv2.txt, we see that this dataset contains some outliers in the +ve (red) class due to which Gaussian with different covariance is overfitting the data, whereas Gaussian with same covariance and linear SVM are still able to generalise well.

More products