Starting from:

$25

ST404 - Applied Statistical Modelling  Assignment 3 - Credit Card Data  - Solved

PROBLEM OUTLINE 
You are acting for a consultancy firm and have been asked by a Taiwanese Credit Card Company‡ to help them to predict customers who are likely to default on their credit card.   

 

You have been provided with two sample data sets of customers who have a credit limit that is equal or greater than TN$250,000. (TN$ are Taiwanese dollars.)  As this information is confidential you have been provided with a historic data set in order to demonstrate “proof of concept”.  If your firm is successful it will be commissioned in the future to provide modelling services for current data.  The aim is to build a model that is able to predict which customers are likely to default on paying their card bills.  This can be later used to build a suitable score card for customers applying for a card or for further credit on their card.  It is important for the company to be able to explain the justification for refusing someone credit and therefore as well as a model that is able to predict, they are interested in the interpretation of the model.   

 

 

 

 

  Data Availability 
The data are provided on Moodle as two R data sets and consist of:

 

Training Data:            CardT.rds: 5000 observations

 

Validation Data:         CardV.rds: 2067 observations   

 

Both data sets contain the same variables which are listed in Table 3.1 on page 3 below.  To read these in please use the readRDS command.  If you copy the files to your R working directory, then to read in these data the commands would be:

 

CardT <- readRDS("CardT.rds") 

CardV <- readRDS("CardV.rds")

 
 

 

Variable 
Type 
Details: description / Factor Level = label (meaning) 
LIMIT_BAL
Continuous
Credit Limit (NT$): it includes both the individual consumer credit and his/her family (supplementary) credit
SEX
Factor
Gender
1=male

2=female
EDUCATION
Factor
Education Level
1  = GradSch(graduate school)

2  = Uni (university)

3  = HighSch (high school)

4  = Other (others)
MARRIAGE
Factor
Marital status
1  = married

2  = single

3  = others
AGE
Continuous
Age in years
PAY_1
Factor
The repayment

status in  
September 2005
-2 = no consumption

-1 = pay duly  

0  = the use of revolving credit

1  = payment delay for one month

2  = payment delay for two months

            ...   

8  = payment delay for eight months

9  = payment delay for nine months and above§
PAY_2
Factor
August 2005
PAY_3
Factor
July 2005
PAY_4
Factor
June 2005
PAY_5
Factor
May 2005
PAY_6
Factor
April 2005
BILL_AMT1
Continuous
Amount of bill

statement in

 
September 2005  
Taiwanese Dollars (NT$)
BILL_AMT2
Continuous
August 2005  
BILL_AMT3
Continuous
July 2005  
BILL_AMT4
Continuous
June 2005  
BILL_AMT5
Continuous
May 2005  
BILL_AMT6
Continuous
April 2005  
PAY_AMT1
Continuous
Amount paid in  
September 2005  
PAY_AMT2
Continuous
August 2005  
PAY_AMT3
Continuous
July 2005  
PAY_AMT4
Continuous
June 2005  
PAY_AMT5
Continuous
May 2005  
PAY_AMT6
Continuous
April 2005  
default
binary
Default payment (dependent variable)
0  = No

1  = Yes
Table 3.1: Details of Variables in provided data sets 
             

  

§ 

http://inseaddataanalytics.github.io/INSEADAnalytics/CourseSessions/ClassificationProcessCreditCar dDefault.html 

 

  Data Analyses Required
 

  You should begin your analysis with a suitable EDA of the data, this should look at univariate considerations and at the relationships between variables especially to the dependent variable.   This may result in you requiring to then manipulate your data in order to fit a suitable model.

  Using the training data, you should then build a suitable logistic regression model in order to predict which customers are likely to default.    

a)     You should take into consideration any information you found from your initial EDA when building your model. e.g., are there any steps that need to be taken before model fitting?

b)     You should aim to use at least two selection methods.   

  You should validate your model using suitable evaluation tools (e.g., diagnostic plots.)

 As the dependent variable is a binary variable you should evaluate your model performance.  Choose a suitable cut off point for the predicted probability of default to make a binary prediction "likely to default" or "unlikely to default" then calculate the false positive and false negative rates using the validation data. You may also produce a ROC chart.   You should discuss the interpretation, success and limitations of your final model

More products