$30
To classify Iris datasets using linear discriminant functions.
Background
Linear discriminant functions are those functions which are either linear in the components of feature vector x or linear in some given set of functions of x. Linear classifiers can be easily built using discriminant functions and does not need the knowledge on the underlying probability densities of the given data. Linear classifiers are attractive due to their simplicity and ideal for initial, trial classifiers. The classifier parameters can be computed using a set of training samples and minimizing a criterion function.
A discriminant function g(x) that is linear in the component of the feature vector x can be written as:
g(x) = wTx + w0 (1)
where w is the weight vector and w0 is the bias or threshold. In general for a two category class problem we decide on ω1 (class 1) by observing the feature vector x if g(x) 0 and ω2 (class 2) if g(x) < 0.
A generalized form of linear discriminant functions can be given as:
g(x) = aTy (2)
where the augmented feature vector y and augmented weight vector a are given by:
1
x1
x2 y = .
.
. xd
w0
w1
w2 a = .
.
. wd
(3)
The goal is to compute the weight vector using training data samples. Gradient descent procedures are commonly used to achieve this by iteratively updating the weight vector while minimizing some criterion function. The following pseudo code outlines the basic gradient descent algorithm:
1. begin initialise: a, criterion θ, η (.), k = 0
2. do k ← k + 1
3. a ← a − η(k)∇J(a)
4. until |η(k)∇J(a)| < θ
5. return a
6. end
Laboratory Exercises
Let’s first consider the following test cases from the Iris dataset investigated in Lab 1:
• data set A: a 50×2 matrix (50 rows, 2 columns) of the samples from the Iris Setosa class, with columns as the features x2 and x3 respectively.
• data set B: a 50×2 matrix of the samples from the Iris Vericolour class, with columns as the features x2 and x3 respectively.
• data set C: a 50×2 matrix (50 rows, 2 columns) of the samples from the Iris Virginia class, with columns as the features x2 and x3 respectively.
Using the gradient descent approach and the perceptron criterion compute the weight vectors and classify the given data for the conditions given below. The perceptron criterion function Jp(a) and its gradient ∇Jp are given as:
Jp(a) = X(−aT · y)
y∈Y
(4)
∇Jp = X(−y)
(5)
y∈Y
where Y (a) are the set of samples misclassified by a.
1. Construct a new data set from A and B by setting aside 30% of the samples from these original sets (this will be used for training purposes). Construct a second set from the remaining 70% of data (this will be used for testing). Use the 30% set (training samples) to compute the weight vector for η(k) = η(·) = 0.01, θ = 0, and an initial value of a(0) = [001]. Limit your maximum iterations to 300 (i.e even if you cannot achieve θ = 0).
2. Use the 70% set (test samples) and the weight vector computed in the previous step to calculate the classification accuracy of the classifier.
3. Repeat the above steps by changing the data split to 70%(training) and 30%(testing).
4. Repeat all the above for test and training sets constructed from the combination of data sets B and C (i.e. Iris Vericolour vs. Iris Virginia)
5. Compute the weight vectors and study the gradient descent algorithms for two different values of η(k), and the initial values of a.
6. In each of the above steps, plot the training data on the feature space, the decision boundary for training set, the perception criterion function over the iterations and the number of iterations required for convergence.
7. Discuss in your report the following: (i) the effect of using different sizes for training and testing data, (ii) the effect of different learning rates, threshold, initial weight values, (iii) the criterion function over the iterations, and (iv) comment on the achieved classification accuracy for the given data.