Starting from:

$30

Machine-Learning-Homework 4 Solved

Ask questions to #ask-your-tutor-viktor
Last exercise you implemented LDA yourself. In this exercise, you’ll take a look under the hood of LDA: You will work on a step-by-step derivation of LDA from the Least Squares Error. In the second part of the exercise you will implement a Naive Bayes Classifier on the digits dataset.


1           LDA-Derivation from the Least Squares Error
The goal of this exercise is to derive closed-form expressions for the optimal parameters wˆ and ˆb in

Linear Discriminant Analysis, given some training set with two classes. Remember that the decision boundary in LDA is given by a D − 1 dimensional hyperplane (where D is the dimension of the feature space) that we parametrize via

                                                                               wˆTx +ˆb = 0.                                                                                   (1)

wˆ is the hyperplane’s normal vector and ˆb a scalar fixing its position in the D-dimensional space. Note that w and x are column vectors in this exercise. The decision rule for our two classes at query point x is then

                                                                                                    (2)

In the training phase we are given N datapoints {xi}i∈1,...,N with xi ∈ RD and their respective labels {yi}i∈1,...,N with yi ∈ {−1,1}. We assume that the training set is balanced, i.e.

                                                                                                                                                            (3)

with Nk denoting the number of instances in either class. The optimal parameters wˆ and ˆb are now the ones minimizing the least squares error criterion:

N

                                                     w,ˆ ˆb = argmin  .                                                         (4)

i=1 You shall solve this problem in three steps:

1.1           
First, compute ˆb from

                                                                 .                                                                   (5)

1.2           
Second, use this result to reshuffle

                                                                .                                                                  (6)

into the intermediate equation

                                                                .                                                                   (7)

Here, µ1 and µ−1 are the class means

 (8)

(9)

and SB and SW are the between-class and within-class covariance matrices

 (10)

                                                                                                                                   .                                                   (11)

The notation µyi means

 if yi = −1

(12) if yi = 1

During your calculations you may find the following relation for general vectors a, b and c useful:

                                                                       a · (bT · c) = (a · cT) · b                                                                        (13)

1.3           
Finally, transform equation (7) into

                                                                                                                                               (14)

where τ is an arbitrary positive constant, expressing the fact that sign(τ(wˆTx + ˆb)) is the same decision rule as sign( ˆwTx +ˆb).

2           Naive Bayes Classifier 
We will once again work with the digits dataset, this time using the naive Bayes classifier. If we assume that all classes are equally likely, i.e. the priors are p(y = k) = 1/C with C the number of classes, the decision rule is defined by

             yˆ = argmax                                                                                                             (15)

where pj(xj|y = k) are 1-dimensional histograms (or other suitable 1-dimensional probability estimators) for each feature j and class k. Since the product of many small probabilities is a tiny number prone to numerical inaccuracy, it is better to rewrite this as a sum of logarithms:

             yˆ = argmax                                                                                                      (16)

Implement training of the naive Bayes classifier as a function

histograms, binning = fit_naive_bayes(features, labels, bincount) where histograms is the C × D × L array of histograms (D is the number of feature dimensions, L the number of bins), and binning is a C × D × 2 array describing the bin layout. Specifically, binning[k,j,0] is the lower bound of the first bin for class k and feature j, and binning[k,j,1] is the corresponding bin width. The data in binning is needed during predicition to find out in which bin lkj a new instance i belongs:

  Xij − binning

 (17) binning[k,j,1]

where ⌊.⌋ is the floor function, i.e. rounding down. When fit_naive_bayes() is called with bincount=0, the number of bins shall be determined automatically according to the FreedmanDiaconis rule. However, this rule recommends a bin width ∆xkj for each feature j in class k, and the corresponding bin count must be computed by the equation

             #bins                                                                                 (18)

where ⌈.⌉ is the ceiling function, i.e. rounding up. Thus, the rule may suggest a different number of bins for every feature, whereas we want a fixed bin count L for simplicity. You should therefore choose L as a suitable compromise between the different recommendations of the Freedman-Diaconis rule. For features that need fewer bins than your choice of L, you may just fill the first L′ < L bins and leave the rest empty. For features where the rule recommends more bins, you must increase ∆xkj so that the data fits in L bins. Now implement prediction as a function predicted_labels = predict_naive_bayes(test_features, histograms, binning)

Filter the training set to use only digits “3” and “9”, train the classifier with

•      the 2-dimensional feature space you constructed in exercise 2 (with bincount=0), and

•      the full 64-dimensional feature space (one histogram per pixel with bincount=8) and determine the two confusion matrices on the test set. Visualize the decision boundary for the 2-D case, overlayed with the scatterplot of the test data.

More products