$35
This assignment is to give you hands-on experience with dimension reduction and the comparison of different classification models. It consists of a programming assignment (with optional extensions for bonus points) and a report. This project is individual work, no code sharing please, but you may post bug questions to Piazza for help.
Topic
Compare, analyze, and select a classification model for identifying letters in various fonts.
Programming work
A) Data preprocessing
This dataset contains 26 classes to separate, but for this assignment, we’ll simplify to three binary classification problems.
Pair 1: H and K Pair 2: M and Y Pair 3: Your choice
For each pair, set aside 10% of the relevant samples to use as a final validation set.
B) Model fitting
For this project, you must consider the following classification models:
k-nearest neighbors SVM
Decision tree Artificial Neural Network
Random Forest
For each model, choose a hyperparameter to tune using 5-fold cross-validation. You must test at least 3 values for a categorical hyperparameter, and at least 5 values for a numerical one. Hyperparameter tuning should be done separately for each classification problem; you might end up with different values for classifying H from K than for classifying M from Y.
Optional extension 1 – Tune more hyperparameters
For bonus points, tune more than just one hyperparameter per model.
2 bonus points for each additional hyperparameter, up to 10 bonus points total. Optional extension 2 – Consider more classification models
For bonus points, suggest additional classification models to me.
If I give the go-ahead, include one for 5 bonus points or two for 10 bonus points.
C) Dimension reduction
For each of the models, implement a method of dimension reduction from the following:
Simple Quality Filtering Embedded Methods
Filter Methods Feature Extraction
Wrapper Feature Selection
Please refer to the lecture slides for more details on the methods. Implement a total of at least 3 different methods to reduce the number of features from 16 to 4. Retrain your models using reduced datasets, including hyperparameter tuning.
Optional extension 3 – Implement more dimension reduction methods For bonus points, implement additional unique methods of dimension reduction.
1 bonus point for each additional method, up to 5 bonus points total.
IMPORTANT: You may use any packages/libraries/code-bases as you like for the project, however, you will need to have control over certain aspects of the model that may be black-boxed by default. For example, a package that trains a kNN classifier and internally optimizes the k value is not ideal if you need the cross-validation results of testing different k values.
Data to be used
We will use the Letter Recognition dataset in the UCI repository at
UCI Machine Learning Repository: Letter Recognition Data Set
(https:// archive.ics.uci.edu/ml/datasets/letter+recognition)
Note that the first column of the dataset is the response variable (i.e., y).
There are 20,000 instances in this dataset.
For each binary classification problem, first find all the relevant samples (ex. All the H and K samples for the first problem). Then set aside 10% of those samples for final validation of the models. This means that you cannot use these samples to train your model parameters, your model hyperparameters, or your feature selection methods.