$29.99
/ Week 8 (/COMP9321/22T1/resources/73587) / Assignment-3 (View only Draft)
Assignment-3 (View only Draft)
Introduction
In this assignment, you will be using the loan dataset provided and the machine learning algorithms you have learned in this course in order to predict:
1. If a loan applicant will be able to repay the loan or not
- This will help the bank to decide if it is risky to approve the loan application
2. Predict the client's income based on the information provided in the application
- This can help the bank to further investigate if the provided documents for payslips are fishy or not.
NOTE: this is a very challenging problem and we are not expecting very high accuracy in your predictions. However, you must apply all your analytic skills to build decent ML models;
Datasets
In this assignment, you will be given two datasets training.csv & test.csv (https://webcms3.cse.unsw.edu.au/COMP9321/22T1/resources/74268) Here is the description of the columns in these datasets:
Row Description
SK_ID_CURR ID of loan in our sample
TARGET Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases)
NAME_CONTRACT_TYPE Identification if loan is cash or revolving
CODE_GENDER Gender of the client
FLAG_OWN_CAR Flag if the client owns a car
FLAG_OWN_REALTY Flag if client owns a house or flat
CNT_CHILDREN Number of children the client has
AMT_INCOME_TOTAL Income of the client
AMT_CREDIT Credit amount of the loan
AMT_ANNUITY Loan annuity
AMT_GOODS_PRICE For consumer loans it is the price of the goods for which the loan is given
NAME_TYPE_SUITE Who was accompanying client when he was applying for the loan
NAME_INCOME_TYPE Clients income type (businessman, working, maternity leave,…)
NAME_EDUCATION_TYPE Level of highest education the client achieved
NAME_FAMILY_STATUS
NAME_HOUSING_TYPE Family status of the client
What is the housing situation of the client (renting, living with parents,
...)
REGION_POPULATION_RELATIVE means the client lives in more populated region)
DAYS_BIRTH Client's age in days at the time of application
DAYS_EMPLOYED How many days before the application the person started current employment
DAYS_REGISTRATION How many days before the application did client change his registration
DAYS_ID_PUBLISH How many days before the application did client change the identity document with which he applied for the loan
OWN_CAR_AGE Age of client's car
FLAG_MOBIL Did client provide mobile phone (1=YES, 0=NO)
FLAG_EMP_PHONE Did client provide work phone (1=YES, 0=NO)
FLAG_WORK_PHONE Did client provide home phone (1=YES, 0=NO)
FLAG_CONT_MOBILE Was mobile phone reachable (1=YES, 0=NO)
FLAG_PHONE Did client provide home phone (1=YES, 0=NO)
FLAG_EMAIL Did client provide email (1=YES, 0=NO)
OCCUPATION_TYPE What kind of occupation does the client have
CNT_FAM_MEMBERS How many family members does client have
REGION_RATING_CLIENT Our rating of the region where client lives (1,2,3)
REGION_RATING_CLIENT_W_CITY Our rating of the region where client lives with taking city into account
(1,2,3)
WEEKDAY_APPR_PROCESS_START On which day of the week did the client apply for the loan
HOUR_APPR_PROCESS_START Approximately at what hour did the client apply for the loan
Flag if client's permanent address does not match contact address
REG_REGION_NOT_LIVE_REGION
Normalized population of region where client lives (higher number
(1=different, 0=same, at region level)
Flag if client's permanent address does not match work address
REG_REGION_NOT_WORK_REGION
(1=different, 0=same, at region level)
Flag if client's contact address does not match work address
LIVE_REGION_NOT_WORK_REGION
(1=different, 0=same, at region level)
Flag if client's permanent address does not match contact address
REG_CITY_NOT_LIVE_CITY
(1=different, 0=same, at city level)
Flag if client's permanent address does not match work address
REG_CITY_NOT_WORK_CITY (1=different, 0=same, at city level)
LIVE_CITY_NOT_WORK_CITY Flag if client's contact address does not match work address
(1=different, 0=same, at city level)
ORGANIZATION_TYPE Type of organization where client works
EXT_SOURCE_1 Normalized score from external data source
EXT_SOURCE_2 Normalized score from external data source
EXT_SOURCE_3
APARTMENTS_AVG Normalized score from external data source
Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
apartment size, common area, living area, age of building, number of
elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
BASEMENTAREA_AVG
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
YEARS_BEGINEXPLUATATION_AVG
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor YEARS_BUILD_AVG
COMMONAREA_AVG
Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
ELEVATORS_AVG
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
ENTRANCES_AVG
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
FLOORSMAX_AVG
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
FLOORSMIN_AVG
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
LANDAREA_AVG
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
LIVINGAPARTMENTS_AVG
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
LIVINGAREA_AVG
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
NONLIVINGAPARTMENTS_AVG
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
NONLIVINGAREA_AVG
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor
APARTMENTS_MODE
BASEMENTAREA_MODE
YEARS_BEGINEXPLUATATION_MODE
YEARS_BUILD_MODE
Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
COMMONAREA_MODE
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
ELEVATORS_MODE
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
ENTRANCES_MODE
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
FLOORSMAX_MODE
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
FLOORSMIN_MODE
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
LANDAREA_MODE
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
LIVINGAPARTMENTS_MODE
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
LIVINGAREA_MODE
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
NONLIVINGAPARTMENTS_MODE
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor
NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI
YEARS_BEGINEXPLUATATION_MEDI
Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
YEARS_BUILD_MEDI
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
COMMONAREA_MEDI
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
ELEVATORS_MEDI
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
ENTRANCES_MEDI
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
FLOORSMAX_MEDI
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
FLOORSMIN_MEDI
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
LANDAREA_MEDI
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
LIVINGAPARTMENTS_MEDI
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
LIVINGAREA_MEDI
apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor
NONLIVINGAPARTMENTS_MEDI
NONLIVINGAREA_MEDI
FONDKAPREMONT_MODE
HOUSETYPE_MODE
Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
TOTALAREA_MODE
apartment size, common area, living area, age of building, number of
elevators, number of entrances, state of the building, number of floor
Normalized information about building where the client lives, What is
average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
WALLSMATERIAL_MODE
apartment size, common area, living area, age of building, number of
elevators, number of entrances, state of the building, number of floor
Normalized information about building where the client lives, What is
average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix)
EMERGENCYSTATE_MODE
apartment size, common area, living area, age of building, number of
elevators, number of entrances, state of the building, number of floor
How many observation of client's social surroundings with observable
OBS_30_CNT_SOCIAL_CIRCLE
How many observation of client's social surroundings defaulted on 30
DEF_30_CNT_SOCIAL_CIRCLE
How many observation of client's social surroundings with observable
OBS_60_CNT_SOCIAL_CIRCLE
How many observation of client's social surroundings defaulted on 60
DEF_60_CNT_SOCIAL_CIRCLE
DAYS_LAST_PHONE_CHANGE How many days before application did client change phone
FLAG_DOCUMENT_2 Did client provide document 2
FLAG_DOCUMENT_3 Did client provide document 3
FLAG_DOCUMENT_4 Did client provide document 4
FLAG_DOCUMENT_5 Did client provide document 5
FLAG_DOCUMENT_6 Did client provide document 6
FLAG_DOCUMENT_7 Did client provide document 7
FLAG_DOCUMENT_8 Did client provide document 8
FLAG_DOCUMENT_9 Did client provide document 9
FLAG_DOCUMENT_10 Did client provide document 10
FLAG_DOCUMENT_11 Did client provide document 11
FLAG_DOCUMENT_12 Did client provide document 12
FLAG_DOCUMENT_13 Did client provide document 13
FLAG_DOCUMENT_14 Did client provide document 14
FLAG_DOCUMENT_15 Did client provide document 15
FLAG_DOCUMENT_16 Did client provide document 16
FLAG_DOCUMENT_17 Did client provide document 17
FLAG_DOCUMENT_18 Did client provide document 18
FLAG_DOCUMENT_19 Did client provide document 19
FLAG_DOCUMENT_20 Did client provide document 20
FLAG_DOCUMENT_21 Did client provide document 21
Number of enquiries to Credit Bureau about the client one hour before
AMT_REQ_CREDIT_BUREAU_HOUR
application
Number of enquiries to Credit Bureau about the client one day before
AMT_REQ_CREDIT_BUREAU_DAY
application (excluding one hour before application)
Number of enquiries to Credit Bureau about the client one week
AMT_REQ_CREDIT_BUREAU_WEEK
before application (excluding one day before application)
Number of enquiries to Credit Bureau about the client one month
AMT_REQ_CREDIT_BUREAU_MON
before application (excluding one week before application)
Number of enquiries to Credit Bureau about the client 3 month before
AMT_REQ_CREDIT_BUREAU_QRT
application (excluding one month before application)
Number of enquiries to Credit Bureau about the client one day year
AMT_REQ_CREDIT_BUREAU_YEAR
(excluding last 3 months before application)
You can use the training dataset (but not validation) for training machine learning models, and you can use the test dataset to evaluate your solutions and avoid over-fitting.
Please Note:
Part-I: Regression (10 Marks)
In the first part of the assignment, you are asked to predict the client's "income" based on the information provided in their loan application. More specifically, you need to predict a client's income based on columns (or any subsets) provided in the dataset except for AMT_INCOME_TOTAL, which you are predicting.
Part-II: Classification (10 Marks)
Using the same datasets, you must predict if a loan application should be approved or not. For this part, you can use all columns (or any subset) of the dataset except "TARGET", the column that you are going to predict.
Submission
You must submit two files:
A python script z{id}.py
A report named z{id}.pdf
Python Script and Expected Output files
Your code must be executed in CSE machines using the following command with three arguments:
$ python3 z{id}.py path1 path2
path1 : indicates the path for the dataset which should be used for training the model (e.g.,
~/training.csv)
For example, the following command will train your models for the first part of the assignment and use the test dataset to report the performance:
$ python3 YOUR_ZID.py training.csv test.csv
Your program should create 4 files on the same directory as the script:
z{id}.PART1.summary.csv z{id}.PART1.output.csv z{id}.PART2.summary.csv z{id}.PART2.output.csv
For the first part of the assignment:
" z{id}.PART1.summary.csv " contains the evaluation metrics (MSE, correlation) for the model trained in the first part of the assignment. Use the given validation dataset to compute the metrics. The file should be formatted exactly as follow:
zid,MSE,correlation z123456,6.13,0.53
MSE : the mean_squared_error in the regression problem
correlation : The Pearson correlation coefficient in the regression problem (a floating number between -1 and 1)
" z{id}.PART1.output.csv " stores the predicted revenues for all of the movies in the evaluation dataset (not the training dataset), and the file should be formatted exactly as:
SK_ID_CURR,predicted_income
1,178000
2,256000 ...
For the second part of the assignment:
" z{id}.PART2.summary.csv " contains the evaluation metrics (average_precision, average_recall, accuracy the unweighted mean ) for the model trained in the second part of the assignment. Use the given validation dataset to compute the metrics. The file should be formatted exactly as:
zid,average_precision,average_recall,accuracy z123456,0.69.71,0.89
average_precision : the average precision for all classes in the classification problem (a number between 0 and 1)
average_recall : the average recall for all classes in the classification problem (a number between 0 and 1)
" z{id}.PART2.output.csv " stores the predicted ratings for all of the movies in the test dataset (not the training dataset) and it should be formatted exactly as follow:
SK_ID_CURR,predicted_target
1,1
2,0 ...
Marking Criteria
You will be marked based on:
(4 marks) Your code must run and perform the designated tasks on CSE machines without problems and create the expected files. Your submission will be penalized up to 50% if is not able to create the output files.
(8 marks) How well your model (trained on the training dataset) performs in the test dataset (a different dataset not available for students - will be used for fair marking)
(5 marks) A report
You should provide a report, containing your analysis of the dataset which helps you in the feature engineering of your machine learning models. For this, you must use Jupiter Notebook and export it as a PDF (https://towardsdatascience.com/jupyter-notebook-to-pdf-in-a-few-lines-
3c48d68a7a63) file. Add comments in your notebook describing what are you concluding for each of your analyses. Use chars and any skill you have learnt in the course to support your decisions about features used in your ML models.
The late penalty is 5% per day, and submissions after day 5 will not be marked.
You will be penalized (1 mark per minute) if your models take more than 3 minutes to train and generate output files.
Your assignment will not be marked (zero marks) if any of the following occur:
If it generates hard-coded predictions
If it also uses the second dataset (test/validation) to train the model
If it does not run on CSE machines with the given command (e.g., python3 zid.py training_dataset.csv test_dataset.csv) Do NOT hard-code the dataset names
FAQ
Can we define our own feature set?
Yes, you can define any features; make sure your features do not rely on the test datasets.
For the average precision/recall functions, should we use the unweighted ('macro') mean or the weighted mean?
Use the unweighted ('macro') mean
Should we calculate metrics to 1 Decimal Place?
2 Decimal Places
Can we use any machine learning algorithm?
Yes, as long as it is provided in sklearn.
What python modules can we use for developing our solutions?
How should we calculate the Pearson correlation coefficient?
It is calculated between your predictions and the real values for the test dataset.
Will I get penalized for "Warnings" thrown by my code? No, you will not get penalized
Comments
(/COMP9321/22T1/forums/search?forum_choice=resource/74265)
(/COMP9321/22T1/forums/resource/74265)
Add a comment
2) The assignment specification refers to a validation set. Should we be splitting off some ofour training data for validation purposes? Or should we just assume by "validation" you mean "test"?
Reply