Starting from:

$30

Business Analytics-Practicum I Solved

Case Study 1 (15%)
Buy-books-on-line.com is an on line store that sells books about science and information technology. The store is very well known in the academic community so a lot of its customers are university professors and also librarians at universities buying on behalf of their institutions. A very popular category of the books that the store sells is that related to “Business Analytics”. In this category the store has a list of 56 books such as “Credit Risk Analytics”, “Marketing Analytics”, “Analytics at Work” etc. The past year 1,896 customers have bought at least one book that belongs to the “Business Analytics” category i.e. at least one of the 56 books.  

The sales department of the store wants to exploit cross selling opportunities so as to sell as many books as possible. The optimal way to achieve this, is to do wise next best offer propositions to its customers by applying associations rules. The analytics department of the store has collected a data set with 19,805 past sales transactions related to the “Business Analytics” book category. The data set is called “On_Line_Book_Store”.

You are hired as an analyst by the on-line store to aid the analytics department in this market basket analysis initiative. After the data analysis you should write a report to the analytics team of the company (technical people) to explain them what you did, which method you used, how it works and what were your results. As already said the report should contain an executive summary in a business format. Use as minimum support level the 0.05 and as minimum confidence level the 0.1. Also set the maximum number of items in a rule equal to 3 (three) (in the interface this option is referred erroneously as minimum number of items in a rule). Save the rules table in the CASUSER library with the title MBA_Results. In the main body of the report you should answer the following questions:

1) Write the Executive Summary. This part accounts for 20% of the case study’s mark. 2) What are the sales (in units) of each book? Provide a relevant chart (bar chart) using the SAS Visual Analytics software. Enrich the chart so as to show data labels, chart title, titles in both axis. This question accounts for 20% of the case study’s mark.

3) Which two books should the store advertise to customers who bought/ are searching to buy only one of the following:

•     Managerial Analytics

•     Implementing Analytics

•     Customer Analytics for Dummies

•     Enterprise Analytics

In other words create the Amazon’s “Customers who Bought this Item also bought” list of books. What is the biggest lift of the rules with three (3) items where each one of the above mentioned 4 books is on the left side of the rule? How is it interpreted? This part accounts for 30% of the case study’s mark.

4) If you set the maximum items in a rule to 3, which are the 3 books most bought together by customers? How many occurrences of this set of 3 books are found together? What does this number mean? What is the support metric of this set of 3 books and how is it calculated? This part accounts for 30% of the case study’s mark.  

 

Case Study 2 (25%)
Sports-OnLine.com is an on line retailer that sells sports clothes and shoes and it is operating in the market since October 2001. On January 2007, after six year of operation, the management team of the store wants to exploit the electronic data captured the previous years to better understand the market. After a meeting with the marketing department, it was decided that a customer segmentation analysis should be performed and, based on the available data, a Recency Frequency Monetary (RFM) analysis would be the most suitable technique for the desired objective.

 
 

Customer_ID
Date_OF_Transaction
Amount_Of_Transaction
Cust345
05/03/2005
123
Cust120
10/01/2004
34
Cust657
23/02/2006
53
…….
…….
…….
Cust219
03/03/2003
12
Cust086
29/07/2002
65
 
During the period Oct 2001 – Dec 2006, 995 customers have done 4906 sales transactions that have been recorded by the on line store and have been stored in the following data set: 4,906 transactions 

 

The IT department in cooperation with the Business Analytics department have transformed the above data set into RFM format, and have produced the SAS data set named RFM_Final_Practice.sas7bdat that is presented below. Since the 4,906 transactions of the previous data set have been produced by 995 customers the RFM data set has 995 rows, each one corresponding to a single customer.

 

 
 

Customer_ID
R
F
M
Cust001
4
5
485
Cust002
14
4
350
Cust003
13
2
233
…….
…….
…….
…….
Cust994
24
1
185
Cust995
6
2
187
 
995 

Customers 

 

You are hired as the Marketing Analytics consultant to perform the RFM segmentation with the machine learning software SAS Visual Data Mining and Machine Learning in SAS Viya. Do the clustering of the customers and the profiling of the segments created. Name the segments (e.g. churners, good customers, bad customers, first time customers etc) and describe briefly what marketing actions are appropriate for each segment (e.g. customer reactivation program, contact customers for feedback, cross sell activities, special promotions etc) and why. The breakdown of marks for this part of the cases study is 60% Business (20% Executive Summary, 40% Report) and 40% technical (e.g. methodology, graphs, tables etc).  

 

Case Study 3 (60%)
This case study refers to a fictitious insurance organization - XYZ – and more specifically to its marketing department that is currently organizing a targeted campaign to identify segments of customers who are likely to purchase a variable annuity (an insurance product). A variable annuity is a contract with an insurance company that includes investments you choose and a fixed insurance component. It allows you to receive monthly income payments that are guaranteed for the rest of your life, which makes them a popular choice for people afraid of running out of money in retirement (so it is designed to provide retirement income). Penalties can be incurred for early withdrawals.  

As it can be understood the organization cannot contact all the customers in its database to promote the new product because this won’t prove profitable since a solicitation has a cost. So the organization must decide who to contact from the data base next year and this will depend on how much likely those customers will respond positively with a purchase i.e. how much likely is that they will purchase or not the insurance product. In order for this to be achieved, the management of the insurance organization must optimize the ROI of the campaign so it has decided to invest in analytics based decision making and more specifically to develop a machine learning based customer response model. In order to gather data to develop the model the organization approached this year a sample of its customers and more specifically 30,129 and promoted the insurance product using telephone solicitations, personal contacts and offering of advertising material and gifts to each one of them and recorded whether the solicited customers replied positively to the purchase of the insurance product. For each customer 47 characteristics that represented other product usage and demographics were available.

The information that will be used to learn the behavior of the customers on whether they will purchase or not the new product is whether they responded positively to this year’s campaign given the 47 characteristics mentioned previously. Next year the developed model will be applied to the rest of the customers in the database to predict whether they will purchase the new product. If the above methodology predicts that a person in the database is likely to respond to the solicitation with a purchase next year, the marketing department will make the solicitation whereas if the methodology predicts the opposite then the marketing department will ignore the person so as not to lose money from the marketing effort.  

The business analytics department of the organization, in cooperation with the IT department, has extracted the sample customer’s database in a SAS data set named

“insurance_campaign_history”, which contains data about the customers included in the sample solicited. The data set contains the 47 customer’s characteristics - input variables (other product usage and demographics). You can find the relevant data dictionary of this data set at the end of this document. The data set consists also of a target variable that is coded as 1/ 0 and that it indicates whether a customer from the solicited ones bought the insurance products or not (1=bought, 0=not bought).   

You are hired as a machine learning engineer to aid the insurance organization develop a model that will predict whether a customer will purchase the insurance product or not if s/he is solicited. The model should be built from the above mentioned data set. After the model is developed by using the historical data (this year’s data), it can be applied to the rest of the customer’s database next year to predict whether the rest of the customers will buy the product or not. The ID’s of the members that are more likely to purchase the insurance product will be directed to the marketing department for solicitation. A sample of the rest of the customers to be scored are stored in the “insurance_campaign_score” SAS data set.  

 

Please follow the following steps and answer the relevant questions:     

Open SAS VDMML on SAS Viya (You will also need to open SAS Visual Analytics on SAS Viya to explore the data).

Create a new project.

Create a new data source (“insurance_campaign_history”) by consulting the relevant data dictionary at the end of this document.

 

1)              Write the Executive Summary. This part accounts for 10% of the case study’s mark.  The management team of the marketing department has come up with the following profit matrix to be used for the evaluation of the models to be created. The numbers represent monetary units e.g. dollars, euro, pounds etc.  

 
 

 
Pred
iction 
 
Contact – Solicit 
 
Ignore 
Actual 
Responder 
800
 
0
Non-Responder 
-300
 
0
 

2)              Using any assumptions you like, give an interpretation of the profit matrix presented above. This part accounts for 7.5% of the case study’s mark.

3)              Based on the above profit matrix what minimum probability (cut - off point) should a customer have so as to be considered a buyer and hence to be considered for solicitation? Provide the mathematical calculations. This part accounts for 5.5% of the case study’s mark.  4) Use the project settings for this question (the gear on the upper right corner of the screen). Partition the historical data set to training and validation using the 70% - 30% rule of thumb? Why this process must be done? The sampling in the data partition is stratified. What does this mean? Also use the Misclassification Rate (Event) as the performance criterion and input the previously calculated cut-off point into the software. This question accounts for 2.5% of the case study’s mark.

5)              Are there any missing values in the variables of the data set? Provide a screenshot of the SAS Visual Analytics software to prove this. What is the proportion of buyers and nonbuyers in the data set? Provide a screenshot from the SAS Visual Analytics software to prove it (pie chart). This part accounts for 2.5% of the case study’s mark.

6)              The proportion of responders to non – responders in the historical data set is 30% - 70%. What would you do if this proportion was 10% - 90%? This part accounts for 5% of the case study’s mark.

7)              Provide a graph (pie chart) using SAS Visual Analytics on SAS Viya that shows the proportion of buyers and non-buyers for those customers that have purchased more than 5 times credit cards. What do you observe? This question accounts for 2.5% of the case study’s mark.

8)              Use SAS Visual Analytics on SAS Viya to show the average deposit amount (DepAmt) for a) buyers and b) non-buyers. What does this mean with respect to the target variable? This question accounts for 2.5% of the case study’s mark.

Add a decision tree node to the workspace and connect it to the data source node.  

9)              What is the variable used for the first split? Explain briefly why this variable is selected (hint: logworth). Which cases are directed to the left node and which to the right node? Where are the missing values directed to? This part accounts for 2.5% of the case study’s mark.

 

Add a second decision tree node to the workspace. Name the new decision as Maximal tree. In the properties window of the tree change the method to largest (maximal) i.e. Pruning options -- > Selection Method -- > Largest. Run the tree node.

10)          How many terminal leaves does the tree have? How is this tree called? Check the performance of the training and validation data set when the Misclassification Rate is used as the assessment criterion. Provide the relevant graph (subtree assessment plot) in your report. How is the phenomenon presented in line for the training data set (blue line) called? Explain it briefly in a couple of sentences. Describe what is the solution to the phenomenon. Provide a screenshot of the largest tree in your report. This part accounts for 7.5% of the case study’s mark.

11)          Run the first decision tree node. How many terminal leaves does the optimal tree have? Provide a screenshot of the optimal tree. Provide a screenshot of the subtree assessment plot when Misclassification Rate is selected as the performance criterion and comment on it (in a couple of sentences). This part accounts for 7,5% of the case study’s mark. 12) Beware that the decision tree and the decision tree model are two different concepts. In the previous part you provided a screenshot of the decision tree. In this part provide a description of the decision tree model. This part accounts for 7,5% of the case study’s mark. (Please interpret the tree by using only 5 of the terminal leaves).

13)          Write a paragraph to interpret the decision tree as you would explain it to the management team of the insurance organization i.e. to non - technical people. What are the most important variables that separate buyers from non – buyers? This part accounts for 7.5% of the case study’s mark. (Please interpret the tree by using only 5 of the terminal leaves).

Add a logistic regression node to the pipeline. Accept the default settings and run the regression node.  

Add a neural network node to the pipeline. Accept the default settings and run the neural network node.

 

14)          Go to the results window of the model comparison node and focus on the score rankings overlay plots. Check the cumulative % response chart for the validation data set. Explain what this graph shows by using the 20% and 100% points in the x axis (the 20% and 100% most highly ranked customers). This part accounts for 5.5% of the case study’s mark.  15) Check the % response chart for the validation data set. How is this graph constructed and what do the values of the x axis represent? Explain what this graph shows by using the 25% point in the x axis. This part accounts for 5.5% of the case study’s mark.

16)          Check the cumulative lift chart for the validation data set. Explain what his graph shows by using the 20% point in the x axis. This part accounts for 5.5% of the case study’s mark.   

17)          Check the cumulative % captured response graph for the validation data set. Explain what this graph shows by using the 40% point in the x axis. This part accounts for 5.5% of the case study’s mark.

 

By now you must have selected the optimal model, so it is time to put it into production and score the data set named “insurance_campaign_score” i.e. to predict which customers are more likely to be buyers in the next campaign. Insert the necessary node (Score Data) to do that, run it and provide a screenshot with the completed process flow (In the score data node attach the insurance_campaign_score data set and for the output library select the CASUSER). You should also notice that because this data set needs to be scored it does not contain a target variable. Name the new scored table as Scored_Insurance.

In order to answer the final three questions, do the following: Select the Score data node and go to the results. Select the Output data tab and View Output. Press the Explore and Visualize button, select the CASUSER library and name the table as Scored_Insurance_Visualize. You will be transferred to the SAS Visual Analytics environment.  

18)          How many customers are there in the “insurance_campaign_score” data set? How many of them are predicted as buyers and how many as non-buyers? Provide a relevant bar chart using SAS Visual Analytics. This part accounts for 2.5% of the case study’s mark.  19) What is the biggest probability of being a buyer assigned to a customer? What is the smallest one? This part accounts for 2.5% of the case study’s mark.

20) Check the customers with Cust_ID= 07636 and and Cust_ID =29773. Based on which column of the score data set and why the software assigns 1 / 0 to these two customers i.e. predicts that they will be buyers/ non - buyers?  This part accounts for 2.5% of the case study’s mark.

 


Data Dictionary for Insurance_Campaign_History Data Set 

Variable  
Description  
Role 
Level 
ATM  
Used ATM service (1=yes, 0=no)  
Input 
Binary 
ATMAmt  
ATM withdrawal amount  
Input 
Interval 
AcctAge  
Age of oldest account in years  
Input 
Interval 
Age  
Age of customer in years  
Input 
Interval 
Branch  
Branch of Bank (B1 – B19)  
Rejected 
Nominal 
CC  
Has credit card account (1=yes, 0=no)  
Input 
Binary 
CCBal  
Credit card balance  
Input 
Interval 
CCPurc  
Number of credit card purchases  
Input 
Interval 
CD  
Has certificate of deposit (1=yes, 0=no)  
Input 
Binary 
CDBal  
Certificate of deposit balance  
Input 
Interval 
CRScore  
Credit score  
Input 
Interval 
CashBk  
Number of times customer received cash back  
Input 
Interval 
Checks  
Number of checks  
Input 
Interval 
DDA  
Checking account (1=yes, 0=no)  
Input 
Binary 
DDABal  
Checking account balance  
Input 
Interval 
Dep  
Number of checking deposits  
Input 
Interval 
DepAmt  
Amount deposited  
Input 
Interval 
DirDep  
Direct deposit (1=yes, 0=no)  
Input 
Binary 
HMOwn  
Owns home (1=yes, 0=no)  
Input 
Binary 
HMVal  
Home value in thousands of dollars  
Input 
Interval 
ILS  
Has installment loan (1=yes, 0=no)  
Input 
Binary 
ILSBal  
Installment loan balance  
Input 
Interval 
IRA  
Has retirement account (1=yes, 0=no)  
Input 
Binary 
IRABal  
Retirement account balance  
Input 
Interval 
InArea  
Local address (1=yes, 0=no)  
Input 
Binary 
Income  
Income in thousands of dollars  
Input 
Interval 
Ins  
Purchase variable annuity account (1=yes, 

0=no)  
Target 
Binary 
Inv  
Has investment account (1=yes, 0=no)  
Input 
Binary 
InvBal  
Investment account balance  
Input 
Interval 
LOC  
Has line of credit (1=yes, 0=no)  
Input 
Binary 
LOCBal  
Line of credit balance  
Input 
Interval 
LORes  
Length of residence in years  
Input 
Interval 
MM  
Has money market account (1=yes, 0=no)  
Input 
Binary 
Variable  
Description  
Role 
Level 
MMBal  
Money market balance  
Input 
Interval 
MMCred  
Number of money market credits  
Input 
Interval 
MTG  
Has mortgage account (1=yes, 0=no)  
Input 
Binary 
MTGBal  
Mortgage balance  
Input 
Interval 
Moved  
Recent address change (1=yes, 0=no)  
Input 
Binary 
NSF  
Occurrence of insufficient funds (1=yes, 0=no)  
Input 
Binary 
NSFAmt  
Amount of insufficient funds  
Input 
Interval 
POS  
Number of point of sale transactions  
Input 
Interval 
POSAmt  
Amount in point of sale transactions  
Input 
Interval 
Phone  
Number of times customer used telephone banking  
Input 
Interval 
Res  
Area classification (R=rural, S=suburb, 

U=urban)  
Rejected 
Nominal 
SDB  
Has a safety deposit box (1=yes, 0=no)  
Input 
Binary 
Sav  
Saving account (1=yes, 0=no)  
Input 
Binary 
SavBal  
Saving balance  
Input 
Interval 
Teller  
Number of teller visits  
Input 
Interval 
Cust_ID 
Customer Identification Number 
ID 
Nominal 
Partition_ Indicator 
Partition Indicator 
Partition 
Binary 
 

 

More products