Starting from:

$30

MATH4330 Assignment 3 -Solved

Question 1: Hourly carbon monoxide (CO) averages were recorded on summer week
days at a measurement station in Los Angeles. The station was established by the 
Environmental Protection Agency as part of a larger study to assess the e↵ectiveness of 
the catalytic converter. It was located about 25 feet from the San Diego Freeway, which 
in this particular area is located at 145 degrees north. It was located such that winds 
from 145 to 325 degress (which in the summer are the prevalent wind directions during 
the daylight hours) transport the CO emissions from the highway toward the measure
ment station. Aggregate measurements were recorded for each hour of the day 1 to 24 
and the dataset is available in the fifile CO2.txt. Note: you can load this fifile in R using 
read.table() and setting the argument header=TRUE. 
Hour - hour of the day, from midnight to midnight 
CO - average summer weekday CO concentration (parts per million) 
TD - average weekday traffiffiffic density (traffiffiffic count/traffiffiffic speed) 
WS - average perpendicular wind-speed component 
(wind speed ⇥ cos(wind direction - 235 degrees)) 
(a)] Run a linear regression model to examine the e↵ect of weekday traffiffiffic 
density and wind-speed component on CO concentration. Report the estimated 
slope parameters and their confifidence intervals. 
(b)  Examine residual plots in the model from part (a). Do you think any 
of the linear regression assumptions have been violated? Explain. 
(c) Run a weighted least-squares model using the same outcome and pre
dictors from part (a). Have the estimates and confifidence intervals changed much? 
Explain. 
Question 2: Recall the chest pain dataset from Assignment 2. This dataset is from the 
Duke University Cardiovascular Disease Databank and consists of 2258 patients and 6 
variables. The patients were referred to Duke University Medical Center for chest pain. 
The variables included in the dataset acath2.csv are the following: 
- sex: sex of the patient (0=male, 1=female) 
1- age: age of the patient 
- cad.dur: duration of symptoms of coronary artery disease 
- cholest: cholesterol (in mg) 
- sigdz: signifificant coronary disease by cardiac catheterization (defifined as ≥ 75% 
diameter narrowing in at least one important coronary artery - 1=yes, 0=no) 
- tvdlm: severe coronary disease (defifined as three vessel or left main disease by 
cardiac catheterization - 1=yes, 0=no)) 
(a)  Run a logistic regression model to see the e↵ect of cholesterol (contin
uous measure) on signifificant coronary disease (sigdz). Report the odds ratio and 
interpret. Calculate and interpret a 95% confifidence interval for the odds ratio. 
(b) Calculate the predicted probability of signifificant coronary disease for an 
individual with cholesterol equal to 400. 
(c) Do you think the expression in (b) can be used to accurately predict 
signifificant coronary disease in the general population? Explain. 
(d)  Run another logistic regression model to see the e↵ect of cholesterol on 
sigdz, but this time adjust for age and sex. Report the odds ratio. From this 
new model fifit do you think there is evidence that age and sex are confounders in 
the cholesterol/coronary disease relationship? Explain. (Hint: it has nothing to do 
with signifificance of the predictor variables). 
(e) [Create an ROC curve for each of the models in part (a) and (d). Report 
the AUC for each one. Which model has better predictive accuracy? Explain. 
24330 Assignment 3 
Ravish Kamath: 213893664 
07 November, 2022 
## Type 'citation("pROC")' for a citation. 
## 
## Attaching package: 'pROC' 
## The following objects are masked from 'package:stats': 
## 
## cov, smooth, var 
## Loading required package: carData 
Question 1 
Hourly carbon monoxide (CO) averages were recorded on summer week- days at a measurement station in 
Los Angles. The station was established by the Environmental Protection Agency as part of a larger study to 
assess the effffectiveness of the catalytic converter. It was located about 25 feet from the San Diego Freeway, 
which in this particular area is located at 145 degrees north. It was located such that winds from 145 to 325 
degrees (which in the summer are the prevalent wind directions during the daylight hours) transport the CO 
emissions from the highway toward the measurement station. Aggregate measurements were recorded for 
each hour of the day 1 to 24 and the data set is available in the fifile CO2.txt. Note: you can load this fifile in 
R using read.table() and setting the argument header = TRUE. 
Hour - hour of the day, from midnight to midnight 
CO - average summer weekday CO concentration (parts per million) 
TD - average weekday traffiffiffic density (traffiffiffic count/traffiffiffic speed) 
WS - average perpendicular wind-speed component (wind speed x cos(wind direction - 235 degrees)) 
(a)  Run a linear regression model to examine the effffect of weekday traffiffiffic density and wind-speed 
component on CO concentration. Report the estimated slope parameters and their confifidence intervals. 
(b) Examine residual plots in the model from part (a). Do you think any of the linear regression 
assumptions have been violated? Explain. 
(c) Run a weighted least-sqaures model using the same outcome and predictors from part(a). 
Have the estimates and confifidence intervals changed much? Explain. 
1Solution 
4630 Assignment 1 R Code Ravish Kamath 213893664 
Solution 
Part A 
fit = lm(CO ~ Traffic + Wind, data = CO2df) 
summary(fit) 
## 
## Call: 
## lm(formula = CO ~ Traffic + Wind, data = CO2df) 
## 
## Residuals: 
## Min 1Q Median 3Q Max 
## -0.72858 -0.31710 -0.09629 0.22409 1.26554 
## 
## Coefficients: 
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) 1.274461 0.198137 6.432 2.25e-06 *** 
## Traffic 0.018290 0.001343 13.616 6.85e-12 *** 
## Wind 0.174747 0.056765 3.078 0.0057 ** 
## --- 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.4987 on 21 degrees of freedom 
## Multiple R-squared: 0.9495, Adjusted R-squared: 0.9447 
## F-statistic: 197.5 on 2 and 21 DF, p-value: 2.419e-14 
confint(fit) 
## 2.5 % 97.5 % 
## (Intercept) 0.86241251 1.68650968 
## Traffic 0.01549692 0.02108392 
## Wind 0.05669662 0.29279680 
—0 = 1.274461 and C.I is (0.8624, 1.6865). 
—1 = 0.018290 and C.I. is (0.0155, 0.0211) 
—2 = 0.174747 and C.I. is (0.0567, 0.2928) 
Part B 
Residual vs. Fitted 
plot(fit$fitted.values, fit$residuals, pch=19, col="darkgreen", cex=0.5, xlab = 'Fitted Values', 
ylab = 'Residuals', main = 'Residual vs. Fitted') 
abline(h=0, lty=2, lwd=2, col="red") 
2Solution 
4630 Assignment 1 R Code Ravish Kamath 213893664 







Residual vs. Fitted 
Fitted Values 
For the Residual vs. Fitted plot, I would say that there is no discernible change in the variation nor any 
pattern. Hence it has not violated the error variance is constant. 

−0.5 
0.0 
0.5 
1.0 
ResidualsSolution 
4630 Assignment 1 R Code Ravish Kamath 213893664 
Residual vs. Predictors 
plot(CO2df$Traffic, fit$residuals, pch=19, col="darkgreen", cex=0.5, xlab = 'Traffic', 
ylab = 'Residuals', main = 'Residual vs. Traffic') 
abline(h=0, lty=2, lwd=2, col="red") 

50 
100 
150 
200 
250 
300 
Residual vs. Traffic 
Traffic 

−0.5 
0.0 
0.5 
1.0 
ResidualsSolution 
4630 Assignment 1 R Code Ravish Kamath 213893664 
plot(CO2df$Wind, fit$residuals, pch=19, col="darkgreen", cex=0.5, xlab = 'Wind', 
ylab = 'Residuals', main = 'Residual vs. Wind') 
abline(h=0, lty=2, lwd=2, col="red") 







Residual vs. Wind 
Wind 
I would say for the fifirst plot shows no clear pattern, hence there is no violation, however the second plot 
(Residual vs. Wind) does tend to show a parabola pattern. This may require squaring the wind variable. 
#Normality Plot 
par(mfrow = c(1,2)) 
hist(fit$residuals, xlab = 'Residuals') 
qqPlot(scale(fit$residuals), main ='QQ-Plot', ylab = 'Residuals'); abline(0,1) 
## [1] 11 12 

−0.5 
0.0 
0.5 
1.0 
ResidualsSolution 
4630 Assignment 1 R Code Ravish Kamath 213893664 
Histogram of fit$residuals 
Residuals 
−1.0 
0.0 0.5 1.0 1.5 
−2 
−1 



QQ−Plot 
norm quantiles 
11 
12 
Based of the histogram and qqplot, it does seem violate the assumption that it comes from a normal 
distribution. 
Part C 
res.fit <- lm(abs(fit$residuals) ~ fit$fitted.values) 
w <- 1/(res.fit$fitted.valuesˆ2) 
wls.fit <- lm(CO ~ Traffic + Wind, data=CO2df, weights=w) 
summary(wls.fit) 
## 
## Call: 
## lm(formula = CO ~ Traffic + Wind, data = CO2df, weights = w) 
## 
## Weighted Residuals: 
## Min 1Q Median 3Q Max 
## -2.1957 -0.9313 -0.2387 0.6485 3.0661 
## 
## Coefficients: 
## Estimate Std. Error t value Pr(>|t|) 
## (Intercept) 1.153799 0.111087 10.386 9.93e-10 *** 
## Traffic 0.019011 0.001232 15.429 6.24e-13 *** 
## Wind 0.184738 0.062544 2.954 0.00758 ** 
## --- 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 1.309 on 21 degrees of freedom 
## Multiple R-squared: 0.9672, Adjusted R-squared: 0.9641 
## F-statistic: 309.6 on 2 and 21 DF, p-value: 2.609e-16 

Frequency 





10 
12 
14 
−1 



ResidualsSolution 
4630 Assignment 1 R Code Ravish Kamath 213893664 
confint(wls.fit) 
## 2.5 % 97.5 % 
## (Intercept) 0.92278197 1.38481646 
## Traffic 0.01644902 0.02157387 
## Wind 0.05467042 0.31480494 
No the estimates and confifidence intervals have not changed much, which means that the residuals do follow a 
constant variance. 
74630 Assignment 1 R Code Ravish Kamath 213893664 
Question 2 
Recall the chest pain data set from Assignment 2. This data set is from the Duke University Cardiovascular 
Disease Databank and consists of 2258 patients and 6 variables. The patients were referred to Duke University 
Medical Center for chest pain. The variables included in the data set acath2.csv are the following: 
- sex: sex of the patient (0 = male, 1 = female) 
- age: age of the patient 
- cad.dur: duration of symptoms of coronary artery disease 
- cholest: cholesterol (in mg) 
- sigdz: signifificant coronary disease by cardiac catheterization (defifined as Ø 75% | diameter narrowing in at 
least one important coronary artery - 1 = yes, 0 = no) 
- tvdlm: severe coronary disease (defifined as three vessel or left main disease by cardiac | catheterization - 1 
= yes, 0 = no) 
(a)  Run a logistic regression model to see the effffect of cholesterol (continuous measure) on 
signifificant coronary disease (sigdz). Report the odds ratio and interpret. Calculate and interpret a 95% 
confifidence interval for the odds ratio. 
(b)Calculate the predicted probability of signifificant coronary disease for an individual with 
cholesterol equal to 400. 
(c) Do you think the expression in (b) can be used to accurately predict signifificant coronary 
disease in the general population? Explain. 
(d)  Run another logistic regression model to see the effffect of cholesterol on sigdz, but this time 
adjust for age and sex. Report the odds ratio. From this new model fifit do you think there is evidence 
that age and sex are confounders in the cholesterol/coronary disease relationship? Explain. (Hint: it 
has nothing to do with signifificance of the predictor variables). 
(e) Create an ROC curve for each of the models in part (a) and (d). Report the AUC for each 
one. Which model has better predictive accuracy? Explain. 
Solution 
Part A 
fit = glm(sigdz ~ cholest, data=df, family=binomial) 
summary(fit) 
## 
## Call: 
## glm(formula = sigdz ~ cholest, family = binomial, data = df) 
## 
## Deviance Residuals: 
## Min 1Q Median 3Q Max 
## -2.2140 -1.3669 0.8247 0.9406 1.4279 
## 
## Coefficients: 
## Estimate Std. Error z value Pr(>|z|) 
## (Intercept) -0.7525280 0.2186516 -3.442 0.000578 *** 
## cholest 0.0062268 0.0009525 6.538 6.25e-11 *** 
## --- 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for binomial family taken to be 1) 
## 
## Null deviance: 2895.3 on 2257 degrees of freedom 
## Residual deviance: 2849.7 on 2256 degrees of freedom 
8Solution 
4630 Assignment 1 R Code Ravish Kamath 213893664 
## AIC: 2853.7 
## 
## Number of Fisher Scoring iterations: 4 
beta1_OR = exp(fit$coefficients[2]) 
Since our odds ratio is 1.006246, the odds of having signifificant coronary disease by cardiac catheterization 
increases by a factor of 1.006246 as your cholesterol increases by 1 unit. 
CI = confint(fit, level = 0.95) 
## Waiting for profiling to be done... 
new_CI = c(exp(CI[2,1]), exp(CI[2,2])) 
new_CI 
## [1] 1.004389 1.008147 
Since the confifidence interval does not overlap 1, then we can say there is an association between 
signifificant coronary disease and cholesterol. 
Part B 
n.dat <- data.frame(cholest = 400) 
predict(fit, newdata = n.dat, type="response") 
## 1 
## 0.8504561 
Part C 
I would say that the expression in (b) cannot be used to accurately predict signifificant coronary disease for 
a general population because the data set used, were for patients that already had some prior chest pain. 
These are two difffferent populations of interest, hence, it is not a good predictive model. 
Part D 
fit2 = glm(sigdz ~ cholest + age + sex, data = df, family = binomial) 
summary(fit2) 
## 
## Call: 
## glm(formula = sigdz ~ cholest + age + sex, family = binomial, 
## data = df) 
## 
## Deviance Residuals: 
## Min 1Q Median 3Q Max 
## -2.4867 -0.8699 0.5259 0.7691 2.4029 
## 
## Coefficients: 
## Estimate Std. Error z value Pr(>|z|) 
## (Intercept) -4.177898 0.380091 -10.992 <2e-16 *** 
## cholest 0.009006 0.001076 8.367 <2e-16 *** 
## age 0.069987 0.005832 12.001 <2e-16 *** 
## sex -2.094385 0.113306 -18.484 <2e-16 *** 
## --- 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
9Solution 
4630 Assignment 1 R Code Ravish Kamath 213893664 
## 
## (Dispersion parameter for binomial family taken to be 1) 
## 
## Null deviance: 2895.3 on 2257 degrees of freedom 
## Residual deviance: 2347.4 on 2254 degrees of freedom 
## AIC: 2355.4 
## 
## Number of Fisher Scoring iterations: 4 
betas = as.vector(fit2$coefficients) 
exp(betas) 
## [1] 0.0153307 1.0090463 1.0724942 0.1231459 
—1 OR would be 1.0090463. 
—2 OR would be 1.0724942. 
—3 OR would be 0.1231459. 
To identify whether Age and Sex are confounders for signifificant coronary diseases and cholesterol level, we 
should see a difffference in our estimated coeffiffifficient for cholesterol. When comparing the two models, fifit and 
fifit2, we can see that the —1 coeffiffifficient for cholesterol has changed, though not by much, shows that Age and 
Sex are a potential confounder to the model. 
Part E 
pred.val1 = predict(fit) 
pred.val2 = predict(fit2) 
par(mfrow = c(1,2)) 
auc(df$sigdz, pred.val1, plot=TRUE, auc.polygon=TRUE, 
auc.polygon.col="lightblue", asp=FALSE, main = 'Model with Cholesterol') 
## Setting levels: control = 0, case = 1 
## Setting direction: controls < cases 
## Area under the curve: 0.5887 
auc(df$sigdz, pred.val2, plot=TRUE, auc.polygon=TRUE, 
auc.polygon.col="lightblue", asp=FALSE, main = 'Model with Cholest, Age & Sex') 
## Setting levels: control = 0, case = 1 
## Setting direction: controls < cases 
10Solution 
4630 Assignment 1 R Code Ravish Kamath 213893664 
Model with Cholesterol 
Specificity 
1.0 0.8 0.6 0.4 0.2 0.0 
Model with Cholest, Age & Sex 
Specificity 
1.0 0.8 0.6 0.4 0.2 0.0 
## Area under the curve: 0.7887 
I would say that the model with the added predictors, age and sex, would be the better model. If we see 
based of the ROC curves, the 2nd model to be closer to the top left corner of the plot. This signififies that we 
are getting higher value for sensitivity as well as specifificity. Furthermore, the AUC for the second model is 
0.7887 which is better than the previous model with only 0.5887. 
11 
Sensitivity 
0.0 
0.2 
0.4 
0.6 
0.8 
1.0 
Sensitivity 
0.0 
0.2 
0.4 
0.6 
0.8 
1.0

More products