Starting from:

$30

COMP551- MiniProject 1 Solved


Background
In this miniproject you will implement two linear classification techniques—logistic regression and linear discriminant analysis (LDA)—and run these two algorithms on two distinct datasets. These two algorithms are discussed in Lectures 4 and 5, respectively. The goal is to gain experience implementing these algorithms from scratch and to get hands-on experience comparing their performance.

Task 1: Acquire, preprocess, and analyze the data
Your first task is to acquire the data, analyze it, and clean it (if necessary). We will use two datasets in this project, outlined below.

•    Dataset 1 (Wine Quality):

–   Link:https://archive.ics.uci.edu/ml/datasets/Wine+Quality

–   This is a dataset where the goal is to predict the quality of wine based on its chemical properties.

–   Note: We will only be using the red wine subset of the data.

–   Note: This data contains quality ratings from 0-10. We will convert this task to a binary classification by defining the ratings of {6,7,8,9,10} as positive (i.e., 1) and all other ratings as negative (i.e., 0).

•    Dataset 2 (Breast Cancer Diagnosis):

–   Link:https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

–   This is a dataset where the goal is to predict whether a tumour is malignant or benign based on various properties.

–   Note: We will use the data folder titles breast-cancer-wisconsin.data for this project.

The essential subtasks for this part of the project are:

1.   Download the datasets (noting the correct subsets to use, as discussed above).

2.   Load the datasets into numpy objects (i.e., arrays or matrices) in Python. Remember to convert the wine datasetto a binary task, as discussed above.

3.   Clean the data. Are there any missing or malformed features? Are there are other data oddities that need to bedealt with? You should remove any examples with missing or malformed features and note this in your report.

4.   Compute some statistics on the data. E.g., what are the distributions of the positive vs. negative classes, whatare the distributions of some of the numerical features?

Task 2: Implement the models
Now you will implement the logistic regression and LDA models. You are free to implement these models as you see fit, but you should follow the equations that are presented in the lecture slides, and you must implement the models from scratch (i.e., you cannot use SciKit Learn or any other pre-existing implementations of these methods).

In particular, your two main tasks in the part are to:

1.   Implement logistic regression using gradient descent, as discussed in Lecture 4.

2.   Implement linear discriminant analysis using the equations from Lecture 5.

You are free to implement these models in any way you want, but you must use Python and you must implement the models from scratch (i.e., you cannot use SciKit Learn or similar libraries). Using the numpy package, however, is allowed and encouraged. Regarding the implementation, we recommend the following approach (but again, you are free to do what you want):

•    Implement both the logistic regression and LDA models as Python classes. You should use the constructor for the class to initialize the model parameters as attributues, as well as to define other important properties of the model.

•    Each of your models classes should have (at least) two functions:

–   Define a fit function, which takes the training data (i.e., X and y)—as well as other hyperparameters (e.g., the learning rate and/or number of gradient descent iterations)—as input. This function should train your model by modifying the model parameters.

–   Define a predict function, which takes a set of input points (i.e., X) as input and outputs predictions (i.e., yˆ) for these points. Note that you need to convert probabilities to binary 0-1 predictions by thresholding the output at 0.5!

•    In addition to the model classes, you should also define a functions evaluate_acc to evaluate the model accuracy. This function should take the true labels (i.e., y), and target labels (i.e., yˆ) as input, and it should output the accuracy score.

•    Lastly, you should implement a script to run k-fold cross validation.

Task 3: Run experiments
The goal of this project is to have you explore linear classification and compare different features and models. You will use 5-fold cross validation to estimate performance in all of the experiments, and you should evaluate performance using accuracy (i.e., proportion of predictions that are correct). You are welcome to perform any experiments and analyses you see fit (e.g., to compare different features), but at a minimum you must complete the following experiments in the order stated below:

1.   Test different learning rates for logistic regression.

2.   Compare the runtime and accuracy of LDA and logistic regression on both datasets.

3.   For the wine dataset, find a new subset of features and/or additional features that improve the accuracy. Werecommend that you use logistic regression for this exploration. Hint: Try exploring different interaction terms.

More products