Starting from:

$30

COMP5318 - Machine Learning and Data Mining - Assignment 1 - Solved

1          Summary
The goal of this assignment is to build a classifier to classify some grayscale images of the size 28x28 into a set of categories. The dimension of the original data is large, so you need to be smart on which method you gonna use and perhaps perform a pre-processing step to reduce the amount of computation. Part of your marks will be a function of the performance of your classifier on the test set.

2          Dataset 
The dataset can be downloaded from Canvas. The dataset consists of a training set of 30,000 examples and a test set of 5,000 examples. They belong to 10 different categories. The validation set is not provided, but you can randomly pick a subset of the training set for validation. The labels of the first 2,000 test examples are given, you will analyse the performance of your proposed method by exploiting the 2,000 test examples. It is NOT allowed to use any examples from the test set for training; or it will be considered as cheating. The rest 3,000 labels of the test set are reserved for marking purpose. Here are examples illustrating sample of the dataset (each class takes one row):

 

There are 10 classes in total:

•     0: T-shirt/Top

•     1: Trouser

•     2: Pullover

•     3: Dress

•     4: Coat

•     5: Sandal

•     6: Shirt

•     7: Sneaker

•     8: Bag

•     9: Ankle boot

3          How to load the data and make output prediciton
There is a Input folder including only 2 files (which can be downloaded from Canvas):

1.      train.csv (30000 image samples for training including features and label)

2.      test_input.csv (5000 images for prediction)

3.1        How to load the data
To read the csv file and load the data into a dataframe using pandas.

The training data files are in the ./Input/train and testing data file are in ./Input/test. Use the following code:

 

[’train.csv’]

 [2]: # train.csv including feature and label using for training model.

data_train_df = pd.read_csv(’./Input/train/train.csv’)

[3]: data_train_df.head()

[3]:                                    id v1 v2 v3 v4 ... v781 v782 v783 v784 label

0            0 0      0 0 0 ... 0              0              0              0              2

1            1 0      0 0 0 ... 0              0              0              0              1

2            2 0      0 0 0 ... 0              0              0              0              1

3            3 0      0 0 1 ... 0              0              0              0              4

4            4 0      0 0 0 ... 0              0              0              0              8

[5 rows x 786 columns]

Then data would be a dataframe with 30000 samples including 784 features (from v1 to v784) and its label.

 

Showing a sample data. The first example belongs to class 2: Pullover

 

3.2        How to loading test data and output the prediction
 [6]: # test_input.csv includes 5000 samples used for label prediction. Test samples 

,→do not have labels.

data_test_df = pd.read_csv(’./Input/test/test_input.csv’)

[7]: data_test_df.head()

[7]:                                  id v1 v2 v3 v4 ... v780 v781 v782 v783 v784

0            0 0      0 0 0 ... 0              0              0              0              0

1            1 0      0 0 0 ... 0              0              0              0              0

2            2 0      0 0 0 ... 0              0              0              0              0

3            3 0      0 0 0 ... 0              0              0              0              0

4            4 0      0 0 0 ... 0              0              0              0              0

[5 rows x 785 columns]

After making a prediction on test data, all predicted lables will be saved in “test_output.csv”. You may use the following code to generate an output file that meets the requirement. The output

 

We will load the output file using the code for loading data above. It is your responsibility to make sure the output file can be correctly loaded using this code. The performance of your classifier will be evaluated in terms of the top-1 accuracy metric, i.e.

Number of correct classifications

                                                       Accuracy =     ∗ 100%

Total number of test examples used

4          Task 
Your task is to determine / build a classifier for the given data set to classify images into categories and write a report. The score allocation is as follows:

1.    Code: max 65 points

2.    Report: max 35 points


4.1        Code
        4.1.1       The code must clearly show

1.    Pre-process data

2.    Details of your implementation for each algorithm

3.    Fine-tune hyper-parameters for each algorithm and running time

4.    The comparison result between 4 different algorithms including 3 single methods and one ensemble method

5.    Hardware and software specifications of the computer that you used for performance evaluation

        4.1.2      Data pre-processing

You will need to have at least one pre-process techique before you can apply the classification algorithms. Pre-process techique can be Normalisation, PCA, etc.

        4.1.3       Classification algorithms

You will now apply multiple classifiers to the pre-processed dataset. You have to implement at least 3 classifiers in particular:

1.    Nearest Neighbor

2.    Logistic Regression

3.    Naïve Bayes

4.    Decision Tree

5.    SVM

and one ensemble method:

1.    Bagging

2.    Boosting

3.    Random forest

For binary classifiers, we can use those classifiers for the data which has more than 2 labels using the one-vs-rest method. The implementation can use sklearn, or can be implemented from scratch.

        4.1.4      Parameter Tuning

For each classifiers we would like to find the best parameters using grid search with k-fold (k>=5) cross validation.

        4.1.5       Classifier comparisons

After finding the best parameter for each algorithm, we would like to make comparisons between all classifiers using their own best hyper-parameters.

More products