$25
In this assignment, you will examine a data file and carry out the steps of the data science process, including the cleaning, exploring and modelling. You will need to develop and implement appropriate steps, in IPython, to load a data file into memory, clean, process, and analyse it. This assignment is intended to give you practical experience with the typical first steps of the data science process.
The “Practical Data Science” Canvas contains further announcements and a discussion board for this assignment. Please be sure to check these on a regular basis - it is your responsibility to stay informed with regards to any announcements or changes.
Where to Develop Your Code
You are encouraged to develop and test your code in two environments: Jupyter Notebook (or Jupyter Lab) on Lab PCs or your laptop.
Task 1.1: Data Preparation (4%)
First of all, you have to register for a Kaggle account with your student email address. Then, you can use this Kaggle account to participate in our course Kaggle competition here:
https://www.kaggle.com/t/cb6996c0cfb34595bae674f4e138e5e8
After accepting to participate in the competition, you can download the datasets to work offline. The first task in this Assignment 2 will be data cleaning, similar to the first task in Assignment 1.
Being a careful data scientist, you know that it is vital to carefully check any available data before starting to analyse it. Your task is to prepare the provided data for analysis. You will start by loading the CSV data from the file (using appropriate pandas functions) and checking whether the loaded data is equivalent to the data in the source CSV file. Then, you need to clean the data by using the knowledge we taught in the lectures. You need to deal with all the potential issues/errors in the data appropriately (such as: typos, extra whitespaces, sanity checks for impossible values, and missing values etc).
Note: These steps must be performed consistently for train/val/test sets.
Task 1.2: Data Exploration (9%)
Explore at least 3 columns or column pairs using appropriate descriptive statistics and graphs (if appropriate), e.g. the distribution of a numerical attribute, the proportion of each value of a categorical attribute. For each explored column/pair, please think carefully and report in your notebook:
1) the way you used to explore a column (e.g. the graph); 2) what you can observe from the way you used to explore it.
Please format each graph carefully, and use it in your final report. You need to include appropriate labels on the x-axis and y-axis, a title, and a legend. The fonts should be sized for good readability. Components of the graphs should be coloured appropriately, if applicable.
Note: These steps are for the training dataset only.
Task 2: Feature Engineering (10%)
Use suitable Python approaches to extract potential features for model input. Conduct appropriate analysis to evaluate feature importance (e.g. correlation analysis), then use suitable method(s) to select the final feature for the model. The feature choices must be explained via analysis.
Note: These steps must be performed consistently for train/val/test sets.
Task 3: Modelling (10%)
You have to train 3 different models to predict the values of the label column in the validation set. You must report 3 evaluation metrics (RMSE, MAE, and R2) for each of these 3 models.
A result table should look like this:
Model
RMSE
MAE
R2
Model 1
0.43
0.54
0.87
Model 2
0.23
0.56
0.86
Model 3
0.45
0.53
0.89
You must briefly describe your model structure/configuration.