Starting from:

$30

Machine Learning Homework 2 -Solved

The Commercial Banking Corporation (hereafter the “Bank”), acting by and through its department of 
Customer Services and New Products is seeking proposals for banking services. The Bank ultimately 
wants to predict which customers will buy a variable rate annuity product. Previously the bank sought 
consulting work on the same project, but also had a focus on understanding the factors involved. Here 
the focus is more on predictive power. 
A variable annuity is a contract between you and an insurance company / bank, under which the insurer 
agrees to make periodic payments to you, beginning either immediately or at some future date. You 
purchase a variable annuity contract by making either a single purchase payment or a series of purchase 
payments. 
A variable annuity offers a range of investment options. The value of your investment as a variable 
annuity owner will vary depending on the performance of the investment options you choose. The 
investment options for a variable annuity are typically mutual funds that invest in stocks, bonds, money 
market instruments, or some combination of the three. If you are interested in more information, see: 
http://www.sec.gov/investor/pubs/varannty.htm 
The project will be broken down into 3 phases: 
• Phase 1 – MARS and GAMs 
• Phase 2 – Tree-Based Models 
• Phase 3 – Model InterpretationObjective – Phase 2 
The scope of services in this phase includes the following: 
• For this phase use only the insurance_t data set. 
• Previous analysis has identified potential predictor variables related to the purchase of the 
insurance product so no initial variable selection before model building is necessary. 
• The data has missing values that need to be imputed. 
o Typically, the Bank has used median and mode imputation for continuous and 
categorical variables but are open to other techniques if they are justified in the report. 

• The Bank is interested in the value of random forest models. 
o Build a random forest model. 
§ (HINT: You CANNOT just copy and paste the code from class. In class we built a 
model to predict a continuous variable. Make sure your target variable is a 
factor for the random forest.) 
o Tune the model parameters and recommend a final random forest model. 
§ You are welcome to consider variable selection as well for building your final 
model. Describe your process for arriving at your final model. 
o Report the variable importance for each of the variables in the model. 
§ Pick one metric to rank things by – no need to report multiple metrics for each 
variable. 
o Report the area under the ROC curve as well as a plot of the ROC curve. 
§ (HINT: Use the same approaches you used back in the logistic regression class.) 
• The Bank is also interested in the value of an XGBoost model. 
o Build an XGBoost model. 
§ (HINT: You CANNOT just copy and paste the code from class. In class we built a 
model to predict a continuous variable. You will need to look up the 
documentation for the ‘objective = "binary:logistic" ‘ option.) 
§ Use the area under the ROC curve (AUC) as your evaluation metric instead of 
the default in XGBoost. 
o Tune the model parameters and recommend a final XGBoost model. 
§ You are welcome to consider variable selection as well for building your final 
model. Describe your process for arriving at your final model. 
o Report the variable importance for each of the variables in the model. 
o Report the area under the ROC curve as well as a plot of the ROC curve. 
§ (HINT: Use the same approaches you used back in the logistic regression class.)Data Provided 
The following two sets of data are provided for the proposal: 
• The training data set insurance_t contains 8,495 observations and selected variables. 
o All of these customers have been offered the product in the data set under the variable 
INS, which takes a value of 1 if they bought and 0 if they did not buy. 
o There are selected variables describing the customer’s attributes before they were 
offered the new insurance product. 
• The validation data set insurance_v contains 2,124 observations and selected variables. 
• The table below describes the Roles and Description of the variables found in both data sets. 
o Except for Branch of Bank, consider anything with more than 10 distinct values as 
continuous.Name 
Model Role 
Description 
ACCTAGE 
Input 
Age of oldest account 
DDA 
Input 
Indicator for checking account 
DDABAL 
Input 
Checking account balance 
DEP 
Input 
Checking deposits 
DEPAMT 
Input 
Total amount deposited 
CHECKS 
Input 
Number of checks written 
DIRDEP 
Input 
Indicator for direct deposit 
NSF 
Input 
Number of insufficient fund issues 
NSFAMT 
Input 
Amount of NSF 
PHONE 
Input 
Number of telephone banking interactions 
TELLER 
Input 
Number of teller visit interactions 
SAV 
Input 
Indicator for savings account 
SAVBAL 
Input 
Savings account balance 
ATM 
Input 
Indicator for ATM interaction 
ATMAMT 
Input 
Total ATM withdrawal amount 
POS 
Input 
Number of point of sale interactions 
POSAMT 
Input 
Total amount for point of sale interactions 
CD 
Input 
Indicator for certificate of deposit account 
CDBAL 
Input 
CD balance 
IRA 
Input 
Indicator for retirement account 
IRABAL 
Input 
IRA balance 
INV 
Input 
Indicator for investment account 
INVBAL 
Input 
INV balance 
MM 
Input 
Indicator for money market account 
MMBAL 
Input 
MM balance 
MMCRED 
Input 
Number of money market credits 
CC 
Input 
Indicator for credit card 
CCBAL 
Input 
CC balance 
CCPURC 
Input 
Number of credit card purchases 
SDB 
Input 
Indicator for safety deposit box 
INCOME 
Input 
Income 
LORES 
Input 
Length of residence in years 
HMVAL 
Input 
Value of home 
AGE 
Input 
Age 
CRSCORE 
Input 
Credit score 
INAREA 
Input 
Indicator for local address 
INS 
Target 
Indicator for purchase of insurance product 
BRANCH 
Input 
Branch of bank

More products