Starting from:

$30

COMP551-MiniProject 1 Solved

Background
In this miniproject you will implement two classification techniques—K-Nearest Neighbour and Decision Trees— and compare these two algorithms on two distinct health datasets. The goal is to get started with programming for Machine Learning, how to properly store the data, run the experiments, and compare different methods. You will also gain experience implementing these algorithms from scratch and get hands-on experience comparing performance of different models.

Task 1: Acquire, preprocess, and analyze the data
Your first task is to acquire the data, analyze it, and clean it (if necessary). We will use two fixed datasets in this project, outlined below.

•  Dataset 1: breast cancer wisconsin.csv (Breast Cancer dataset): https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

•  Dataset 2: hepatitis.csv (Hepatitis dataset): http://archive.ics.uci.edu/ml/datasets/Hepatitis

The essential subtasks for this part of the project are:

1.   Load the datasets into NumPy or Pandas objects in Python.

2.   Clean the data. Are there any missing or malformed features? Are there are other data oddities that need to bedealt with? You should remove any examples with missing or malformed features and note this in your report.

If you choose to play with Pandas dataframes, a handy line of code that might be helpful is df[˜df.eq(’?’).any(1)], where df is the dataframe, and ’?’ represents a missing value in the datasets. This is a straightforward way to handle this issue by simply eliminating rows with missing values. You are welcome to explore other possible ways!

3.   Compute basic statistics on the data to understand it better. E.g., what are the distributions of the positive vs.negative classes, what are the distributions of some of the numerical features?

Task 2: Implement the models
You are free to implement these models as you see fit, but you should follow the equations that are presented in the lecture slides, and you must implement the models from scratch (i.e., you CANNOT use SciKit Learn or any other pre-existing implementations of these methods). However, you are free to use relevant code given in the course website.

In particular, your two main tasks in the part are to:

1.   Implement K - Nearest Neighbour .

2.   Implement Decision Tree with appropriate cost function.

You are free to implement these models in any way you want, but you must use Python and you must implement the models from scratch (i.e., you cannot use SciKit Learn or similar libraries). Using the NumPy or Pandas package, however, is allowed and encouraged. Regarding the implementation, we recommend the following approach (but again, you are free to do what you want):

•  Implement both models as Python classes. You should use the constructor for the class to initialize the modelparameters as attributes, as well as to define other important properties of the model.

•  Each of your models classes should have (at least) two functions:

–   Define a fit function, which takes the training data (i.e., X and Y)—as well as other hyperparameters (e.g., K value in KNN and maximum tree depth in Decision Tree)—as input. This function should train your model by modifying the model parameters.

–   Define a predict function, which takes a set of input points (i.e., X) as input and outputs predictions (i.e., yˆ) for these points.

•  In addition to the model classes, you should also define a functions evaluate_acc to evaluate the model accuracy. This function should take the true labels (i.e., y), and target labels (i.e., yˆ) as input, and it should output the accuracy score.

Task 3: Run experiments
The goal of this project is to have you compare different features and models.

Split each dataset into training and test sets. Use test set to estimate performance in all of the experiments after training the model with training set. Evaluate the performance using accuracy. You are welcome to perform any experiments and analyses you see fit (e.g., to compare different features), but at a minimum you must complete the following experiments in the order stated below:

1.   Compare the accuracy of KNN and Decision Tree algorithm on the two datasets.

2.   Test different K values and see how it affects the training data accuracy and test data accuracy.

3.   Similarly, check how maximum tree depth can affect the performance of of Decision Tree on the provideddatasets. Describe your findings.

4.   Try out different distance/cost functions for both models. Describe your findings.

5.   Present a plot of the decision boundary for each model. Describe the key features in short.

Note: The above experiments are the minimum requirements that you must complete; however, this project is open-ended. For example, you might investigate different stopping criteria for Decision Tree or different features that you select for the training process. We would also love to see possible ways to improve model performance. You do not need to do all of these things, but you should demonstrate creativity, rigour, and an understanding of the course material in how you run your chosen experiments and how you report on them in your write-up.

More products