Starting from:

$30

ME5404-Project 1 SVM for Classification of Spam Email Messages Solved

I.     DATA
The data used in this project is the Spam Data Set[1], which contains a total of 4,601 examples. Each example has a feature vector with 57 attributes that represent the selected key features of an email message, and a label indicating whether the associated email message is spam or not. (Detailed description of these attributes can be found on the source webpage of the dataset.) One feature vector is shown below for illustration:

0.00000 0.01043 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000

0.00000 0.00000 0.01043 0.01043 0.02105 0.00000 0.00000 0.00000

0.00000 0.03166 0.06332 0.00000 0.02105 0.00000 0.00000 0.00000

0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000

0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000

0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000

0.00000 0.00000 0.00000 0.00196 0.00000 0.00000 0.02601 0.12811 0.98827

For this project, three sub-datasets, namely, the training set (with the filed name train.mat), the test set (test.mat), and the evaluation set (eval.mat), have been created from the Spam dataset. The training set and the test set contain 2,000 and 1,536 randomly selected examples, respectively. They are in the MATLAB MAT-file format and are included in the zipfile that also contains this document; the eval.mat file is not included, since this dataset will be used in the assessment as decribed in Section V below.

In these two MAT-files, the feature vectors are held in a variable with the name:

<file_name_without_extension_data, while the labels (either “+1” for spam or “−1” for non-spam) associated with the individual feature vectors are held in a variable with the name: <file_name_without_extension_label. Thus in train.mat, the two variables are train_data and train_label. Similarly, in test.mat the variables are test_data and test_label.

The third sub-dataset, namely, eval.mat, is formed using a subset of the remaining examples (after train.mat and test.mat have been chosen) in the Spam dataset. This third dataset will be used for the assessment of the program that you will submit, as described in Section

V.

II.     REQUIREMENT
A. What to be done

The main tasks involved in this project are:

Task 1: Write a MATLAB (M-file) program to compute the discriminant function g(·), if one exists, for the following SVMs, using the training set provided: (i) A hard-margin[2] SVM with the linear kernel

                                           K                            (1)

(ii)     A hard-margin SVM with a polynomial kernel

 p

K(2) where the values of p are listed in Table 1.

(iii)    A soft-margin SVM with a polynomial kernel as given in Equation (2) above, and with the values for p and C as listed in Table 1.

Note that a MATLAB function quadprog (available in the Optimization Toolbox) can be used to solve constraint optimization problems.

Task 2: Write a MATLAB (M-file) program to implement the SVMs with the discriminant functions obtained in Task 1. Apply these SVMs to classify the given training set and test set, and report the classification results in Table 1 by filling the entries indicated by “?”. Discuss the results and their implications, including issues related to the admissibility of the kernels and the existence of optimal hyperplanes for the three types of SVMs listed in Task 1 above.

PETER C. Y. CHEN, 2021                                                                                                                                                                                                                                                                                                                                                                                    2

Type of SVM
 
Training accuracy
 
 
Test accuracy
 
Hard margin with Linear kernel
 
?
 
 
?
 
Hard margin with polynomial kernel
p=2
p=3
p=4
p=5
p=2
p=3
p=4
p=5
?
?
?
?
?
?
?
?
Soft margin with polynomial kernel
C=0.1
C=0.6
C=1.1
C=2.1
C=0.1
C=0.6
C=1.1
C=2.1
p=1
?
?
?
?
?
?
?
?
p=2
?
?
?
?
?
?
?
?
p=3
?
?
?
?
?
?
?
?
p=4
?
?
?
?
?
?
?
?
p=5
?
?
?
?
?
?
?
?
TABLE I: Results of SVM classification.
Task 3: Design a SVM of your own. This SVM can be one of the three types specified in Task 1 above (i.e., hardmargin with linear kernel, hard-margin with polynomial kernel, and soft-margin with polynomial kernel), or one with your own choice of kernel. Using the given training set, compute the discriminant function g(·) of the SVM. Implement the resulting SVM in a MATLAB M-file program. This program will be used to classify the evaluation set as part of the assessment discussed in Section V.


 
 approximated by a soft-margin SVM with a very large C value.

More products