$30
Question 1: Hourly carbon monoxide (CO) averages were recorded on summer week
days at a measurement station in Los Angeles. The station was established by the
Environmental Protection Agency as part of a larger study to assess the e↵ectiveness of
the catalytic converter. It was located about 25 feet from the San Diego Freeway, which
in this particular area is located at 145 degrees north. It was located such that winds
from 145 to 325 degress (which in the summer are the prevalent wind directions during
the daylight hours) transport the CO emissions from the highway toward the measure
ment station. Aggregate measurements were recorded for each hour of the day 1 to 24
and the dataset is available in the fifile CO2.txt. Note: you can load this fifile in R using
read.table() and setting the argument header=TRUE.
Hour - hour of the day, from midnight to midnight
CO - average summer weekday CO concentration (parts per million)
TD - average weekday traffiffiffic density (traffiffiffic count/traffiffiffic speed)
WS - average perpendicular wind-speed component
(wind speed ⇥ cos(wind direction - 235 degrees))
(a)] Run a linear regression model to examine the e↵ect of weekday traffiffiffic
density and wind-speed component on CO concentration. Report the estimated
slope parameters and their confifidence intervals.
(b) Examine residual plots in the model from part (a). Do you think any
of the linear regression assumptions have been violated? Explain.
(c) Run a weighted least-squares model using the same outcome and pre
dictors from part (a). Have the estimates and confifidence intervals changed much?
Explain.
Question 2: Recall the chest pain dataset from Assignment 2. This dataset is from the
Duke University Cardiovascular Disease Databank and consists of 2258 patients and 6
variables. The patients were referred to Duke University Medical Center for chest pain.
The variables included in the dataset acath2.csv are the following:
- sex: sex of the patient (0=male, 1=female)
1- age: age of the patient
- cad.dur: duration of symptoms of coronary artery disease
- cholest: cholesterol (in mg)
- sigdz: signifificant coronary disease by cardiac catheterization (defifined as ≥ 75%
diameter narrowing in at least one important coronary artery - 1=yes, 0=no)
- tvdlm: severe coronary disease (defifined as three vessel or left main disease by
cardiac catheterization - 1=yes, 0=no))
(a) Run a logistic regression model to see the e↵ect of cholesterol (contin
uous measure) on signifificant coronary disease (sigdz). Report the odds ratio and
interpret. Calculate and interpret a 95% confifidence interval for the odds ratio.
(b) Calculate the predicted probability of signifificant coronary disease for an
individual with cholesterol equal to 400.
(c) Do you think the expression in (b) can be used to accurately predict
signifificant coronary disease in the general population? Explain.
(d) Run another logistic regression model to see the e↵ect of cholesterol on
sigdz, but this time adjust for age and sex. Report the odds ratio. From this
new model fifit do you think there is evidence that age and sex are confounders in
the cholesterol/coronary disease relationship? Explain. (Hint: it has nothing to do
with signifificance of the predictor variables).
(e) [Create an ROC curve for each of the models in part (a) and (d). Report
the AUC for each one. Which model has better predictive accuracy? Explain.
24330 Assignment 3
Ravish Kamath: 213893664
07 November, 2022
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
## Loading required package: carData
Question 1
Hourly carbon monoxide (CO) averages were recorded on summer week- days at a measurement station in
Los Angles. The station was established by the Environmental Protection Agency as part of a larger study to
assess the effffectiveness of the catalytic converter. It was located about 25 feet from the San Diego Freeway,
which in this particular area is located at 145 degrees north. It was located such that winds from 145 to 325
degrees (which in the summer are the prevalent wind directions during the daylight hours) transport the CO
emissions from the highway toward the measurement station. Aggregate measurements were recorded for
each hour of the day 1 to 24 and the data set is available in the fifile CO2.txt. Note: you can load this fifile in
R using read.table() and setting the argument header = TRUE.
Hour - hour of the day, from midnight to midnight
CO - average summer weekday CO concentration (parts per million)
TD - average weekday traffiffiffic density (traffiffiffic count/traffiffiffic speed)
WS - average perpendicular wind-speed component (wind speed x cos(wind direction - 235 degrees))
(a) Run a linear regression model to examine the effffect of weekday traffiffiffic density and wind-speed
component on CO concentration. Report the estimated slope parameters and their confifidence intervals.
(b) Examine residual plots in the model from part (a). Do you think any of the linear regression
assumptions have been violated? Explain.
(c) Run a weighted least-sqaures model using the same outcome and predictors from part(a).
Have the estimates and confifidence intervals changed much? Explain.
1Solution
4630 Assignment 1 R Code Ravish Kamath 213893664
Solution
Part A
fit = lm(CO ~ Traffic + Wind, data = CO2df)
summary(fit)
##
## Call:
## lm(formula = CO ~ Traffic + Wind, data = CO2df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72858 -0.31710 -0.09629 0.22409 1.26554
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.274461 0.198137 6.432 2.25e-06 ***
## Traffic 0.018290 0.001343 13.616 6.85e-12 ***
## Wind 0.174747 0.056765 3.078 0.0057 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4987 on 21 degrees of freedom
## Multiple R-squared: 0.9495, Adjusted R-squared: 0.9447
## F-statistic: 197.5 on 2 and 21 DF, p-value: 2.419e-14
confint(fit)
## 2.5 % 97.5 %
## (Intercept) 0.86241251 1.68650968
## Traffic 0.01549692 0.02108392
## Wind 0.05669662 0.29279680
—0 = 1.274461 and C.I is (0.8624, 1.6865).
—1 = 0.018290 and C.I. is (0.0155, 0.0211)
—2 = 0.174747 and C.I. is (0.0567, 0.2928)
Part B
Residual vs. Fitted
plot(fit$fitted.values, fit$residuals, pch=19, col="darkgreen", cex=0.5, xlab = 'Fitted Values',
ylab = 'Residuals', main = 'Residual vs. Fitted')
abline(h=0, lty=2, lwd=2, col="red")
2Solution
4630 Assignment 1 R Code Ravish Kamath 213893664
2
3
4
5
6
7
8
Residual vs. Fitted
Fitted Values
For the Residual vs. Fitted plot, I would say that there is no discernible change in the variation nor any
pattern. Hence it has not violated the error variance is constant.
3
−0.5
0.0
0.5
1.0
ResidualsSolution
4630 Assignment 1 R Code Ravish Kamath 213893664
Residual vs. Predictors
plot(CO2df$Traffic, fit$residuals, pch=19, col="darkgreen", cex=0.5, xlab = 'Traffic',
ylab = 'Residuals', main = 'Residual vs. Traffic')
abline(h=0, lty=2, lwd=2, col="red")
0
50
100
150
200
250
300
Residual vs. Traffic
Traffic
4
−0.5
0.0
0.5
1.0
ResidualsSolution
4630 Assignment 1 R Code Ravish Kamath 213893664
plot(CO2df$Wind, fit$residuals, pch=19, col="darkgreen", cex=0.5, xlab = 'Wind',
ylab = 'Residuals', main = 'Residual vs. Wind')
abline(h=0, lty=2, lwd=2, col="red")
0
1
2
3
4
5
6
Residual vs. Wind
Wind
I would say for the fifirst plot shows no clear pattern, hence there is no violation, however the second plot
(Residual vs. Wind) does tend to show a parabola pattern. This may require squaring the wind variable.
#Normality Plot
par(mfrow = c(1,2))
hist(fit$residuals, xlab = 'Residuals')
qqPlot(scale(fit$residuals), main ='QQ-Plot', ylab = 'Residuals'); abline(0,1)
## [1] 11 12
5
−0.5
0.0
0.5
1.0
ResidualsSolution
4630 Assignment 1 R Code Ravish Kamath 213893664
Histogram of fit$residuals
Residuals
−1.0
0.0 0.5 1.0 1.5
−2
−1
0
1
2
QQ−Plot
norm quantiles
11
12
Based of the histogram and qqplot, it does seem violate the assumption that it comes from a normal
distribution.
Part C
res.fit <- lm(abs(fit$residuals) ~ fit$fitted.values)
w <- 1/(res.fit$fitted.valuesˆ2)
wls.fit <- lm(CO ~ Traffic + Wind, data=CO2df, weights=w)
summary(wls.fit)
##
## Call:
## lm(formula = CO ~ Traffic + Wind, data = CO2df, weights = w)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -2.1957 -0.9313 -0.2387 0.6485 3.0661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.153799 0.111087 10.386 9.93e-10 ***
## Traffic 0.019011 0.001232 15.429 6.24e-13 ***
## Wind 0.184738 0.062544 2.954 0.00758 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.309 on 21 degrees of freedom
## Multiple R-squared: 0.9672, Adjusted R-squared: 0.9641
## F-statistic: 309.6 on 2 and 21 DF, p-value: 2.609e-16
6
Frequency
0
2
4
6
8
10
12
14
−1
0
1
2
ResidualsSolution
4630 Assignment 1 R Code Ravish Kamath 213893664
confint(wls.fit)
## 2.5 % 97.5 %
## (Intercept) 0.92278197 1.38481646
## Traffic 0.01644902 0.02157387
## Wind 0.05467042 0.31480494
No the estimates and confifidence intervals have not changed much, which means that the residuals do follow a
constant variance.
74630 Assignment 1 R Code Ravish Kamath 213893664
Question 2
Recall the chest pain data set from Assignment 2. This data set is from the Duke University Cardiovascular
Disease Databank and consists of 2258 patients and 6 variables. The patients were referred to Duke University
Medical Center for chest pain. The variables included in the data set acath2.csv are the following:
- sex: sex of the patient (0 = male, 1 = female)
- age: age of the patient
- cad.dur: duration of symptoms of coronary artery disease
- cholest: cholesterol (in mg)
- sigdz: signifificant coronary disease by cardiac catheterization (defifined as Ø 75% | diameter narrowing in at
least one important coronary artery - 1 = yes, 0 = no)
- tvdlm: severe coronary disease (defifined as three vessel or left main disease by cardiac | catheterization - 1
= yes, 0 = no)
(a) Run a logistic regression model to see the effffect of cholesterol (continuous measure) on
signifificant coronary disease (sigdz). Report the odds ratio and interpret. Calculate and interpret a 95%
confifidence interval for the odds ratio.
(b)Calculate the predicted probability of signifificant coronary disease for an individual with
cholesterol equal to 400.
(c) Do you think the expression in (b) can be used to accurately predict signifificant coronary
disease in the general population? Explain.
(d) Run another logistic regression model to see the effffect of cholesterol on sigdz, but this time
adjust for age and sex. Report the odds ratio. From this new model fifit do you think there is evidence
that age and sex are confounders in the cholesterol/coronary disease relationship? Explain. (Hint: it
has nothing to do with signifificance of the predictor variables).
(e) Create an ROC curve for each of the models in part (a) and (d). Report the AUC for each
one. Which model has better predictive accuracy? Explain.
Solution
Part A
fit = glm(sigdz ~ cholest, data=df, family=binomial)
summary(fit)
##
## Call:
## glm(formula = sigdz ~ cholest, family = binomial, data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2140 -1.3669 0.8247 0.9406 1.4279
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7525280 0.2186516 -3.442 0.000578 ***
## cholest 0.0062268 0.0009525 6.538 6.25e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2895.3 on 2257 degrees of freedom
## Residual deviance: 2849.7 on 2256 degrees of freedom
8Solution
4630 Assignment 1 R Code Ravish Kamath 213893664
## AIC: 2853.7
##
## Number of Fisher Scoring iterations: 4
beta1_OR = exp(fit$coefficients[2])
Since our odds ratio is 1.006246, the odds of having signifificant coronary disease by cardiac catheterization
increases by a factor of 1.006246 as your cholesterol increases by 1 unit.
CI = confint(fit, level = 0.95)
## Waiting for profiling to be done...
new_CI = c(exp(CI[2,1]), exp(CI[2,2]))
new_CI
## [1] 1.004389 1.008147
Since the confifidence interval does not overlap 1, then we can say there is an association between
signifificant coronary disease and cholesterol.
Part B
n.dat <- data.frame(cholest = 400)
predict(fit, newdata = n.dat, type="response")
## 1
## 0.8504561
Part C
I would say that the expression in (b) cannot be used to accurately predict signifificant coronary disease for
a general population because the data set used, were for patients that already had some prior chest pain.
These are two difffferent populations of interest, hence, it is not a good predictive model.
Part D
fit2 = glm(sigdz ~ cholest + age + sex, data = df, family = binomial)
summary(fit2)
##
## Call:
## glm(formula = sigdz ~ cholest + age + sex, family = binomial,
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4867 -0.8699 0.5259 0.7691 2.4029
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.177898 0.380091 -10.992 <2e-16 ***
## cholest 0.009006 0.001076 8.367 <2e-16 ***
## age 0.069987 0.005832 12.001 <2e-16 ***
## sex -2.094385 0.113306 -18.484 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
9Solution
4630 Assignment 1 R Code Ravish Kamath 213893664
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2895.3 on 2257 degrees of freedom
## Residual deviance: 2347.4 on 2254 degrees of freedom
## AIC: 2355.4
##
## Number of Fisher Scoring iterations: 4
betas = as.vector(fit2$coefficients)
exp(betas)
## [1] 0.0153307 1.0090463 1.0724942 0.1231459
—1 OR would be 1.0090463.
—2 OR would be 1.0724942.
—3 OR would be 0.1231459.
To identify whether Age and Sex are confounders for signifificant coronary diseases and cholesterol level, we
should see a difffference in our estimated coeffiffifficient for cholesterol. When comparing the two models, fifit and
fifit2, we can see that the —1 coeffiffifficient for cholesterol has changed, though not by much, shows that Age and
Sex are a potential confounder to the model.
Part E
pred.val1 = predict(fit)
pred.val2 = predict(fit2)
par(mfrow = c(1,2))
auc(df$sigdz, pred.val1, plot=TRUE, auc.polygon=TRUE,
auc.polygon.col="lightblue", asp=FALSE, main = 'Model with Cholesterol')
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.5887
auc(df$sigdz, pred.val2, plot=TRUE, auc.polygon=TRUE,
auc.polygon.col="lightblue", asp=FALSE, main = 'Model with Cholest, Age & Sex')
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
10Solution
4630 Assignment 1 R Code Ravish Kamath 213893664
Model with Cholesterol
Specificity
1.0 0.8 0.6 0.4 0.2 0.0
Model with Cholest, Age & Sex
Specificity
1.0 0.8 0.6 0.4 0.2 0.0
## Area under the curve: 0.7887
I would say that the model with the added predictors, age and sex, would be the better model. If we see
based of the ROC curves, the 2nd model to be closer to the top left corner of the plot. This signififies that we
are getting higher value for sensitivity as well as specifificity. Furthermore, the AUC for the second model is
0.7887 which is better than the previous model with only 0.5887.
11
Sensitivity
0.0
0.2
0.4
0.6
0.8
1.0
Sensitivity
0.0
0.2
0.4
0.6
0.8
1.0