Starting from:

$30

DATA 303/473 Assignment 2 -Solved


Q1.In a 2015 article comparing technological advancement of hybrid electric vehicles (HEV) 
in difffferent market segments, authors Lim et al. collected data on prices and other features for 154 HEV 
models (Lim et al. 2015. Technological Forecasting and Social Change. vol 97, pages 140-153 ). We will use 
regression analysis to explore the factors that inflfluence price. The dataset is in the fifile hybrid_reg.csv and 
contains the following variables: 
• carid: Vehicle ID 
• vehicle: Make of vehicle 
• year: Model year 
• msrp: Manufacturer’s suggested retail price in 2013 (US dollars). 
• accelrate: Acceleration rate in km/hour/second 
• mpg: Fuel economy in miles/gallon 
• mpgmpge: Max of mpg and mpge (mpge is miles per gallon equivalent for plug-in HEVs to take into 
account the all electric range, with mpge = 33
.7∗driverange 
batterycapacity 

• carclass: Model class. C = Compact, M = Midsize, TS = 2 Seater, L = Large, PT = Pickup Truck, 
MV = Minivan, SUV = Sport Utility Vehicle 
• carclass_id: Index representing model class 
The variables carid and vehicle are vehicle identififiers and will not be used in the analysis. Likewise 
carclass_id will not be used as it is a numerical form of the variable carclass and does not provide any 
additional information. 
a.  Read the dataset into R. Prepare the data for analysis by adding the new variables below 
to the dataset. Give the number of observations in each year group of the new variable yr_group.: 
• yr_group: group year as follows “1997-2004”, “2005-2008”, “2009-2011”, “2012-2013”. 
• msrp.1000: convert msrp from US$ to US$1000 by dividing msrp by 1000. 
b. (Use the ggplot2 package to plot msrp.1000 against each of the predictor variables, yr_group, 
accelrate, mpg, mpgmpge and carclass. Are there strong indications of non-linear relationships with 
any of the numerical predictors? If so, which ones? 
c Create pairwise scatterplots of the numerical predictors. Is there any indication of potential 
multicollinearity among these predictors? 
d. Fit a linear model with all predictors (yr_group, accelrate, mpg, mpgmpge and carclass) 
included in the model. Calculate the VIF statistic for the predictors. To check for evidence of 
multicollinearity we will use a difffferent threshold defifined by 
V IFmodel = 

1 − R

model 

where R

model 
is the R2 value for the model that includes all predictors. Using this threshold identififies 
predictors that have stronger relationships with other predictors than the response variable has. It 
is a more stringent way of identifying multicollinearity. If GV IF(1/(2×Df)) > V IFmodel, then this is 
evidence of severe multicollinearity. Calculate V IFmodel for your fifitted model. Is there evidence of 
severe multicollinearity? Are you surprised by the result? 
e. Fit a generalised additive model to the data including all predictors, using a smooth spline 
for each numerical predictor. Present the RSE, R2 and adjusted R2 values in a table. 
f. Print the results for the signifificance of smooth terms in a table. Which of the numerical 
predictors have a signifificant non-linear effffect on msrp.1000? Justify your answer brieflfly. 
g. Perform a diagnostic check of regression assumptions and adequacy of basis functions for 
the model you fifitted in part (e). What conclusions do you draw from your results? (Note: ensure your 
diagnostic plots fifit on a single page). 
2hCalculate and print a table of AIC values for the model in part (e) (Model 1) and each of 
the following models: 
• Model 2: excludes mpg only from Model 1 
• Model 3: excludes mpgmpge only from Model 1 
• Model 4: excludes mpg and mpgmpge from Model 1 
i. What do your results in part (h) indicate about whether both mpg and mpgmpge should be 
included in the model? Explain your answer brieflfly. What regression pitfall does this point to? 
j. [2 marks] Are you surprised by your conclusions in part (i) given your fifindings in part (d)? Explain 
your answer brieflfly. 
k. Calculate and print a table of BIC values for Models 1 to 4. Based on these results, which 
model would you choose as your preferred model? Explain your answer brieflfly. 
2. Q2. (5 marks) Suppose we have a data set with fifive predictors: 
• X1 =GPA 
• X2 =IQ 
• X3 =Gender(0=female, 1=male) 
• X4 =Interaction between GPA and IQ 
• X5 =Interaction between GPA and Gender. 
The response variable, Y , is starting salary after graduation (in thousands of dollars). Suppose we get the 
following regression coeffiffifficient estimates: 
βˆ0 = 5, βˆ1 = 8, βˆ2 = 0.2, βˆ3 = 10, 
βˆ4 = 0.05, βˆ5 = 2 
a. ( Write down the estimated model equation in terms of Yˆ , X1, X2 and X3. 
b. Which one of the following statements is correct and why? Show any working you do. 
i. For a fifixed value of IQ and GPA, males earn more on average than females provided that the GPA 
is high enough. 
ii. For a fifixed value of IQ and GPA, females earn more on average than males. 
iii. The difffference in expected salary between males and females increases as GPA increases. 
iv. An increase in IQ by one point is associated with a reduction in expected salary, provided GPA is 
high enough. 
c. ) True or False: Since the coeffiffifficient for the GPA:IQ interaction term is very small, there is 
very little evidence of an interaction effffect. Justify your answer. 
Assignment total: 40 marks 
‘ 
3

More products