Starting from:

$25

STAT542- Project 3 - Solved

 
Lending Club Loan Status
You are provided with historical loan data issued by Lending Club. The goal is to build a model to predict the chance of default for a loan. 

Source

There are two sets of lending club data on Kaggle 

·         https://www.kaggle.com/wendykan/lending-club-loan-data: data 2007-15. 

·         https://www.kaggle.com/wordsforthewise/lending-club: all data till 2018Q2. 

We will use data from the 2nd site: accepted_2007_to_2018Q2.csv. 

The dataset has over 100 features, but some of them have too many NA values, and some are not suposed to be available at the beginning of the loan. For example, it is not meaningful to predict the status of a loan if we knew the date/amount of the last payment of that loan. So we focus on the following features (5 features in each row and 30 features in total including the response 'loan_status') 

'addr_state', 'annual_inc', 'application_type', 'dti', 'earliest_cr_line', 
  'emp_length', 'emp_title', 'fico_range_high', 'fico_range_low', 'grade', 
  'home_ownership', 'initial_list_status', 'installment', 'int_rate', 'id',
  'loan_amnt', 'loan_status', 'mort_acc', 'open_acc', 'pub_rec', 
  'pub_rec_bankruptcies', 'purpose', 'revol_bal', 'revol_util', 'sub_grade',
  'term', 'title', 'total_acc', 'verification_status', 'zip_code'
Students do not need to download data from Kaggle. A copy of cleaned data (with 30 features) is available on Piazza: loan_stat542.csv 

[What do the different Note statuses mean?] After a loan is issued by lendclub, the loan becomes "Current". 

·         The ideal scenario: lending club keeps receiving the monthly installment payment from the borrower, and eventually the loan is paid off and the loan status becomes "Fully Paid." 

·         Signs of trouble: loan is past due with 15 days (In Grace Period), late for 15-30 days, or late for 31-120 days. 

·         Once a loan is past due for more than 120 days, its status will be "Default" or "Charged-off". Lending Club explains the difference between these two [Here]. For this project, we will treat them the same. 

We focus on closed loans, i.e., loan status being one of the following: 

·         Class 1 (bad loans): 'Default' or 'Charged Off'; 

·         Class 0 (good loans): 'Fully Paid'. 

What you need to submit? 

Before the deadline (Thursday, November 29, 11:30PM, Pacific Time), please submit the following two items to the corresponding assignment box on Compass/Coursera: 

·         R/Python code
(.R or .py or zip) that takes a training data and a test data as input and outputs up to three submission files in the format described below named as "mysubmission1.txt", "mysubmission2.txt", and "mysubmission3.txt". You could output just one or two files if you only use one or two prediction models. 

·         A report
(3 pages maximum, pdf only) that provides the details of your code, e.g., pre-processing, some technical details or implementation details (if not trivial) of the models you use, etc. 

Report accuracy (see evaluation metric given below), running time of your code and the computer system you use (e.g., Macbook Pro, 2.53 GHz, 4GB memory, or AWS t2.large). 

How we evaluate your code? 

Name your main file as mymain.R. If you have multiple R files, upload the zip file. After unzipping your file, we will run the command "source(mymain.R)" in a directory, in which there are two csv files: 

·         train.csv, 

·         test.csv. 

The two csv files are in the same format as loan_stat542.csv on Piazza, except that the column of "loan_status" is missing in test.csv. 

We construct the training and test data based on Project3_test_id.csv on Piazza. 

Build up to three prediction models. Record the prediction from a model in a txt file, which should contain a header and have the following format: 

 
 
id, prob
1077501, 0.73
1077430, 0.02
etc.
 
where "id" is the same as the "id" column from test.csv, and "prob" contains the default probability returned from your model for that loan. 

After running your code, we should see up to three files in the same directory named "mysubmission1.txt", "mysubmission2.txt", and "mysubmission3.txt". Each submission file correspoonds to a prediction on the test data. Then we'll evaluate the prediction accuracy using Log-loss. 

Our evaluation code looks like the following: 

 
 
#########################################################################
# log-loss function
logLoss = function(y, p){
    if (length(p) != length(y)){
        stop('Lengths of prediction and labels do not match.')
    }
    
    if (any(p < 0)){
        stop('Negative probability provided.')
    }
    
    p = pmax(pmin(p, 1 - 10^(-15)), 10^(-15))
    mean(ifelse(y == 1, -log(p), -log(1 - p)))
}
 
#########################################################################
# Test code begins
start.time = Sys.time()
source('mymain.R')
end.time = Sys.time()
run.time = as.numeric(difftime(end.time, start.time, units = 'min'))
 

More products