Starting from:

$30

comp551 - MiniProject 1 - Machine Learning  - Solved

In this miniproject, you will be exploring two datasets. The goal is to gain experience in deploying basic supervised machine learning techniques to tackle a real-world data science problem. In particular, the project encourages you to explore preprocessing of the data, the effect of hyper-parameters, size of the dataset, and performing model selection. You are encouraged to explore techniques you have learned in class to visualize the data and thereafter form a hypothesis about possible patterns in the data.

Preprocessing
Your first task is to acquire the data, analyze it, and clean it (if necessary). You will use two datasets in this project, outlined below.

•   Dataset 1 (Adult dataset): This dataset presents several attributes of different individuals and the prediction task is to determine whether someone makes over 50K a year. Download and read information about the dataset here.

•   Dataset 2 (Your choice!): Select any dataset from UCI or related to your own research. We suggest selecting a dataset of appropriate size (not too small or too large) such that the experiments can be conducted effectively and efficiently.

The essential subtasks for this part of the project include:

1.    Download the datasets. Hints: For clarity, in the Adult dataset, adult.data contains the training/validation data and adult.test contains the test data.

2.    Load the datasets into Pandas dataframes or NumPy objects (i.e., arrays or matrices) in Python.

3.    Clean the data. You should remove instances that have too many missing or invalid data entries.

4.    Convert discrete variables into multiple variables using one-hot encoding. For an example on how to do this, check out ”Encoding categorical features” in the scikit-learn documentation.

Experiments
In this part, you will compare two supervised learning frameworks, namely K-nearest neighbours (KNN) and decision trees, to predict whether the income of an adult exceeds $50K/yr. A similar analysis should be performed for the second dataset. The specific subtasks for this part include:

1.    Implement and perform 5-fold cross validation on the training/validation data (for the Adult dataset, this data is contained in the adult.data file) to optimize hyperparameters for both models. Your implementation for cross-validation should be from scratch. You should not use existing packages for cross validation. Report the mean of the training and validation metrics for the given hyperparameters.

2.    Sample growing subsets of the training/validation data and repeat step 1. We want to understand how the size of a dataset impacts both the training and validation error.

3.    Take the best performing model (the one with the best performance on 5-fold cross validation) and apply it on the test set (in the Adult dataset, this is the adult.test file). This is an unbiased estimate of how your model would perform on new/unseen data.

4.    [Optional] Go above and beyond! Examples: different normalization techniques or other ways of handling of missing data (search “data imputation” techniques). Employ more sophisticated techniques for hyper-parameter search. Engineering new features out of existing ones to get a better performance. Investigate which features are the most useful (e.g., by correlating them with your predictions or removing them from your data)?

5.    Analyze your findings; how did the choice of the various hyper-parameters impact generalization? How about the size of training data? If any of these findings do not agree with your expectation, you can form hypotheses and further investigate them.


More products