Starting from:


STA4102 Final Exam Solution

1. This is the ’take-home and in-class component’ to the exam. Work is to be done using software such as R, SAS, Python, MATLAB, Octave, C/C++.
3. Work is to be done individually.
4. Partial credit will be given where for incorrect answers that include an exposition of the thought processes that went into your results and code that contains an error(’bug’)whichisalongthecorrectapproach. Itisexpectedthatexplanations will follow the answers.
5. P-values are assumed to be significant at 0.05.
6. Present results and code together. If the output/export of the code/results is a problem screenshots are good within a word document or pdf.
7. There is an ’in class’ component listed at the end.

A dataset from US store sales is taken from kaggle ( and is in a file called ’transfusion.csv’.
a. Load the dataset. Separate 70% of the rows to be training data and the rest to be testing data.
(Total: 8 points)
2 Adatasetfromcreditcardcustomersistakenfromkaggle( and is in a file called ’LifeExpectancy.csv’.
a. Load the dataset. Produce a training and testing dataset on a 70-30 split (randomized selections) (1 points) b.
• Produce a decision tree model to predict the life expectancy, excluding the first 4 columns (Country/Year/Continent/Least Developed), using the training data, and find the MSE on the testing data.
• Produce a random forest model to predict the life expectancy excluding the first 4 columns (Country/Year/Continent/Least Developed), using the training data, and find the MSE on the testing data.
• Produce an XGBoost model to predict the life expectancy excluding the first 4 columns, (Country/Year/Continent/Least Developed), using the training data, and find the MSE on the testing data.
• Discuss the merits of each model based upon the MSE. Use your models to then discuss the most important variables for the predictions.
(4 points) c. Place the data (LifeExpectancy.csv), into an SQL data base. (1 points)
d. Extract all the life expectancy values, from the DB where the year is greater than 2008, and then less than 2008 into 2 separate variables. Then produce a histogram to show the distribution of the values from the 2 variables. Afterwards conduct a t-test to assess whether they have a significant mean difference or not. (2 points) e. Using SQL, find the average life expectancy for each country. (use Group By) (1 points)
(Total: 9 points)
On the designated time of the exam as stated on the official schedule you will present in person to me this presentation.
a. Produce a slide presentation of 4 slides with this content.
• Slide1 is your name and student id
• Slide2 is the main results obtained for Q1
• Slide3 is the main results obtained for Q2
• Slide4 the software issues you ran into during your assignment
(3 points)
(Total: 3 points)

More products