$30
Please download the bike sharing data set from the following website:
https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset. (I also include the data set
on crowdmark.) This dataset contains the hourly and daily count of rental bikes between
years 2011 and 2012 in Capital bikeshare system with the corresponding weather and sea
sonal information. There are two data sets and we will use the day.csv. Import the data
into R and perform the following exploratory analysis.
a) The variable ”registered” records the number of registered users used the bike sharing
service on a particular day. Please provide the mean value of the variable ”registered” for
each day of the week.
b) Plot the conditional density plot of the variable ”registered” conditional on each month
of the year.
c) Produce a two-dimensional levelplot of the variable ”registered” against the combina
tion of temperature (variable ”temp”) and humidity (variable ”hum”).
Question 2: Perform linear regression model on the bike sharing data set from Question
2.
a. Provide the summary result of the regression model with ”registered” as the response
variable and ”temp”, ”hum” as the predictors.
b. What other predictors do you think might be important for the modelling of the
variable ”registered”? Please construct another linear model including more predictors
and provide the summary result of the second model.
c. Perform 100 times of 5-fold cross validation and each time you randomly partition
the dataset into fifive equal parts. You will use 80% of the data as the training data
and 20% as the validating data. For models in a) and b), calculate the total sum of
squared prediction error divived by the size of the validation data and by the number
of cross-validations. Which model has better predictive power?
Question 3: In the following marketing set, we have 9 years with the sales in 10 million
euro and the advertising expenditure in million euro.
1Year Sales Advertisement
1 34 23
2 56 36
3 65 45
4 86 52
5 109 53
6 109 58
7 122 63
8 124 68
9 131 70
a) Formulate the response vector Y, which has nine entries.
b) Formulate the data matrix of X, the fifirst column should be all ones corresponding to
the intercept, and the second column should be the predictors. The dimension of X should
be 9 ⇥ 2.
c) Write R code to compute Xt X.
d) Write R code to compute ✓ = (XtX)−Xt Y. This is the estimated linear regression
coeffiffifficient of the linear model with Y as the response and X as the data matrix.
e)Run the linear regression using Y as the response and X as the predictor using lm
command in R and compare the output with your own calculation.
f) Now two additional data points arrived. They are Year 10, Sales 96, and Advertisement
53; Year 11, Sales 107, and Advertisement 63. Please use the online algorithm to update
the linear model. Use the two new observations together to perform the sequential learning
and update the model using stochastic gradient descent algorithm using the learning rate
λ = 0.01. Note here in the updating scheme ✓ˆnew = ✓ˆold
− λrEn, the term En will be the
sum of the squared prediction errors over the two new observations:
En = (y10 − yˆ10)2 + (y11 − yˆ11)2.
In this example, we implement the online algorithm when the new data come in batches, not
one at a time.
Question 4: Consider a data set with the response vector Y = (y1,...,yn)t and the data
matrix
0
B
B
B
@
1
x
11
x
12
.. .
.. .
1
x
n
1 x
n
2
1
C
C
C
A
.
We model the relationship between X and Y using the linear regression model: yi =
✓0 + ✓1xi1 + ✓2xi2 + ✏i, i = 1, . . . , n, where ✏
⇠ N(0, σ2). Let the parameter vector be
denoted as ✓ = (✓0, ✓1, ✓2)t. We wish to minimize the sum of squared residuals:
SSE =
2Pn
i=1(yi − (✓0 + ✓1xi1 + ✓2xi2))2. Let the fifitted value be denoted as ˆ
yi = ✓0 + ✓1xi1 + ✓2xi2,
and let the fifitted value vector be denoted as
Yˆ .
a) Show that SSE = Pn
i=1(yi − yˆi)2.
b) Show that SSE = (Y − Yˆ )t(Y − Yˆ ).
c) Show that Yˆ = X✓.
d) Simplify the derivative equation @SSE
@✓ = 0.
e) Find the solution of ✓ which solves the equation in part d.
Question 5: Analyze the QSAR fifish toxicity data set from the site:
https://archive.ics.uci.edu/ml/datasets/QSAR+fifish+toxicity. Perform variable selection on
the six predictors using the lasso package.
a) Based on the output of “lars”, please provide the sequence of candidate models. For
example, the fifirst model is {X5}, the second model is {X5, X3} and the third model is
{X5, X3, X10}, etc.
b) Use the cross validation method, select the best value for the fraction s based on the
plot of cross validation error againt the fraction s. The fraction s measures the ratio of the
L1 norm of the penalized estimate over the L1 norm of the regular penalized estimate.
c) Use the optimum s you select, perform the penalized regression and output the opti
mum model and the estimated coeffiffifficients.
Question 6: Simulate a data set with 100 observations yi = 30 + 5x + 2x2 + 3x3 + ✏i, where
✏i follows independent normal distribution N(0, 1).
a) Perform polynomial regression on your simulated data set and using x, I(x2), I(x3) as the
predictors. Compare the estimated coeffiffifficients with the true model and report the R-square.
b) Formulate the design matrix of this regression and write down the fifirst two rows of the
design matrix based on your data set.
c) Perform polynomial regression on your simulated data set and using x, I(x2), I(x3), I(x4)
as the predictors. Compare the estimated coeffiffifficients with the true model and report the R
square. Is the R-square increased or decreased compared to the model in part (a)? Explain
why?
Ravish Kamath: 213893664
02/04/2022
day = read.csv('/Users/ravishkamath/Desktop/University/2. York Math/1 MATH/Statistics /MATH 3333/3. Assessments/Assignments/Assignment 1/Bike-Sharing-Dataset/day.csv')
fishToxicity = read.csv('/Users/ravishkamath/Desktop/University/2. York Math/1 MATH/Statistics /MATH 3333/3. Assessments/Assignments/Assignment 1/QSAR Fish Toxicity Dataset/qsar_fish_toxicity.csv')
library(lars)
## Loaded lars 1.2
library(lattice)
library(lars)
library(locfit)
## locfit 1.5-9.4 2020-03-24
library(knitr)
library(ggplot2)
Question 1
Part A
registeredMeans = aggregate(day$registered, by = list(day$weekday),
FUN = function(x) {round(mean(x), digits = 0)})
days = matrix(c('Sunday', 'Monday', 'Tuesday', 'Wednesday',
'Thursday', 'Friday', 'Saturday'), byrow = T)
means = cbind(days, registeredMeans)
names(means) = c('Days', 'Days ID', 'Registered Mean')
means
## Days Days ID Registered Mean
## 1 Sunday 0 2891
## 2 Monday 1 3664
## 3 Tuesday 2 3954
## 4 Wednesday 3 3997
## 5 Thursday 4 4076
## 6 Friday 5 3938
## 7 Saturday 6 3085
Part B
library(lattice)
densityplot(~registered|mnth, data = day, plot.points = FALSE, col = "black")
1registered
0e+00
1e−04
2e−04
3e−04
4e−04
−2000 2000 6000
mnth
mnth
−2000 2000 6000
mnth
mnth
mnth
mnth
mnth
0e+00
1e−04
2e−04
3e−04
4e−04
mnth
0e+00
1e−04
2e−04
3e−04
4e−04
mnth
−2000 2000 6000
mnth
mnth
−2000 2000 6000
mnth
2
DensityPart C
df_brakes = tapply(day$registered,
INDEX = list(cut(day$temp, breaks = 10),
cut(day$hum, breaks = 10)), FUN = mean, na.rm = TRUE)
levelplot(df_brakes, scales = list(x = list(rot = 90)), main = '2D Levelplot of Registered',
xlab = 'temperature', ylab = 'humidity')
2D Levelplot of Registered
temperature
(−0.000973,0.0973]
(0.0973,0.195]
(0.195,0.292]
(0.292,0.389]
(0.389,0.486]
(0.486,0.584]
(0.584,0.681]
(0.681,0.778]
(0.778,0.875]
(0.875,0.973]
1000
2000
3000
4000
5000
6000
3
humidity
(0.0583,0.139]
(0.139,0.22]
(0.22,0.3]
(0.3,0.38]
(0.38,0.46]
(0.46,0.541]
(0.541,0.621]
(0.621,0.701]
(0.701,0.781]
(0.781,0.862]Question 2
Part A
mod = lm(registered ~ temp + hum, data = day)
summary(mod)
##
## Call:
## lm(formula = registered ~ temp + hum, data = day)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3687.4 -994.5 -177.7 968.0 3170.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2405.1 239.4 10.046 < 2e-16 ***
## temp 4778.5 263.1 18.162 < 2e-16 ***
## hum -1777.6 338.1 -5.257 1.93e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1291 on 728 degrees of freedom
## Multiple R-squared: 0.3175, Adjusted R-squared: 0.3156
## F-statistic: 169.3 on 2 and 728 DF, p-value: < 2.2e-16
Part B
mod1 = lm(registered ~ temp + hum + workingday + weathersit, data = day)
summary(mod1)
##
## Call:
## lm(formula = registered ~ temp + hum + workingday + weathersit,
## data = day)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2907.45 -960.55 -74.98 901.86 2799.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1948.26 229.30 8.497 < 2e-16 ***
## temp 4340.16 251.88 17.231 < 2e-16 ***
## hum -584.22 397.70 -1.469 0.142
## workingday 971.65 95.51 10.173 < 2e-16 ***
## weathersit -530.27 104.10 -5.094 4.47e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1196 on 726 degrees of freedom
## Multiple R-squared: 0.4161, Adjusted R-squared: 0.4129
## F-statistic: 129.3 on 4 and 726 DF, p-value: < 2.2e-16
4Part C
### CV for part a)
totalCVerror<-0
for ( j in 1:1){
training<-sample(1:731,584)
trainingset<-day[training,]
testingset<-day[-training,]
###fill in the codes to get the predictions errors
### update the totalCVerror
model<-lm(registered~temp+hum, data=trainingset)
prediction<-predict(model, new = testingset)
errors<-sum((testingset$registered - prediction)ˆ2)
totalCVerror<-totalCVerror+errors
}
averageCVerrors_modA<-totalCVerror/100/147
averageCVerrors_modA
## [1] 16434.51
### CV for part b)
totalCVerror<-0
for ( j in 1:1){
training<-sample(1:731,584)
trainingset<-day[training,]
testingset<-day[-training,]
###fill in the codes to get the predictions errors
### update the totalCVerror
model<-lm(registered~temp+hum + workingday + weathersit, data=trainingset)
prediction<-predict(model, new = testingset)
errors<-sum((testingset$registered - prediction)ˆ2)
totalCVerror<-totalCVerror+errors
}
averageCVerrors_modB<-totalCVerror/100/147
averageCVerrors_modB
## [1] 14343.96
#In this case, the model used in part B, has a better predictive power,
#since it has less errors.
5Question 3
Part A
Y = c(34, 56, 65, 86, 109, 109, 122, 124, 131)
matrix(Y, ncol = 1, byrow = T)
## [,1]
## [1,] 34
## [2,] 56
## [3,] 65
## [4,] 86
## [5,] 109
## [6,] 109
## [7,] 122
## [8,] 124
## [9,] 131
Part B
advertisement = c(23, 36, 45, 52, 53, 58, 63, 68, 70)
ones = c(rep(1,9))
X = cbind(ones, advertisement)
X
## ones advertisement
## [1,] 1 23
## [2,] 1 36
## [3,] 1 45
## [4,] 1 52
## [5,] 1 53
## [6,] 1 58
## [7,] 1 63
## [8,] 1 68
## [9,] 1 70
Part C
XtransX = t(X)%*%X
XtransX
## ones advertisement
## ones 9 468
## advertisement 468 26220
Part D
XtransY = t(X)%*%Y
theta = (solve(XtransX))%*%XtransY
theta
## [,1]
## ones -20.550602
## advertisement 2.181529
6Part E
mod = summary(lm(Y~X))
mod
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.618 -3.793 -1.156 4.375 13.930
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20.5506 10.2415 -2.007 0.0848 .
## Xones NA NA NA NA
## Xadvertisement 2.1815 0.1897 11.497 8.47e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.236 on 7 degrees of freedom
## Multiple R-squared: 0.9497, Adjusted R-squared: 0.9425
## F-statistic: 132.2 on 1 and 7 DF, p-value: 8.47e-06
#As you can see, when using the lm function vs. manually solving the
#Least Square method, we end up having the same model and coefficients.
Part F
#Please check the handwritten notes
7Question 5
Part A
lasso <-lars(x=as.matrix(fishToxicity[,1:6]),y=as.numeric(unlist(fishToxicity[,7])),trace=TRUE)
## LASSO sequence
## Computing X'X .....
## LARS Step 1 : Variable 6 added
## LARS Step 2 : Variable 2 added
## LARS Step 3 : Variable 3 added
## LARS Step 4 : Variable 4 added
## LARS Step 5 : Variable 5 added
## LARS Step 6 : Variable 1 added
## Computing residuals, RSS etc .....
#Based on the Output the following models will be listed based on the LASSO Method
# 1st Model: X_6 (MLOGP)
# 2nd Model: X_6 (MLOGP), X_2 (SM1_Dx(Z))
# 3rd Model: X_6 (MLOGP), X_2 (SM1_Dx(Z)), X_3(GATS1i)
# 4th Model: X_6 (MLOGP), X_2 (SM1_Dx(Z)), X_3(GATS1i), X_4(NdsCH)
# 5th Model: X_6 (MLOGP), X_2 (SM1_Dx(Z)), X_3(GATS1i), X_4(NdsCH), X_5 (NdssC)
# 6th Model: X_6 (MLOGP), X_2 (SM1_Dx(Z)), X_3(GATS1i), X_4(NdsCH), X_5 (NdssC), X_1 (CIC0)
Part B
cv.lars(x=as.matrix(fishToxicity[,1:6]),y=as.numeric(unlist(fishToxicity[,7])),K=10)
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of final L1 norm
# Based on looking at the plot, I would choose 0.8 as the best fraction s value.
8
0.8
1.2
1.6
2.0
Cross−Validated MSEPart C
lasso <-lars(x=as.matrix(fishToxicity[,1:6]),y=as.numeric(unlist(fishToxicity[,7])),trace=TRUE)
## LASSO sequence
## Computing X'X .....
## LARS Step 1 : Variable 6 added
## LARS Step 2 : Variable 2 added
## LARS Step 3 : Variable 3 added
## LARS Step 4 : Variable 4 added
## LARS Step 5 : Variable 5 added
## LARS Step 6 : Variable 1 added
## Computing residuals, RSS etc .....
coef(lasso,s=0.8,mode="fraction")
## CIC0 SM1_Dz.Z. GATS1i NdsCH NdssC MLOGP
## 0.2077567 0.9921246 -0.4651972 0.2890276 0.0354972 0.4324455
9Question 6
Part A
q = seq(0, .99, by = 0.01)
length(q)
## [1] 100
y = 30 + 5*q + 2*qˆ2 + 3*qˆ3
noise = rnorm(length(q), mean = 0, sd = 1)
noisy.y = y + noise
plot(q, noisy.y, col = 'deepskyblue4', xlab = 'q', main = 'Observed Data')
lines(q, y, col = 'firebrick1', lwd = 3)
0.0
0.2
0.4
0.6
0.8
1.0
Observed Data
q
model <- lm(noisy.y ~ q + I(qˆ2) + I(qˆ3))
summary(model)
##
## Call:
## lm(formula = noisy.y ~ q + I(q^2) + I(q^3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.84753 -0.71749 0.02206 0.71125 2.15023
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.2033 0.4075 74.115 <2e-16 ***
## q 0.2893 3.5829 0.081 0.9358
## I(q^2) 16.0228 8.4337 1.900 0.0605 .
10
30
32
34
36
38
40
noisy.y## I(q^3) -6.9555 5.5983 -1.242 0.2171
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.057 on 96 degrees of freedom
## Multiple R-squared: 0.8882, Adjusted R-squared: 0.8847
## F-statistic: 254.2 on 3 and 96 DF, p-value: < 2.2e-16
# We get the model to be 29.9287 + 0.5647x + 16.7805xˆ2 - 7.5869xˆ3
#which is quite different from the true model
#We get the R squared for the estimated model to be 0.8886
Part B
matdata = c(1, q[1], (q[1]ˆ2), (q[1]ˆ3), 1, q[2], (q[2]ˆ2), (q[2]ˆ3))
xMat = matrix(matdata, nrow = 2, ncol = 4, byrow = TRUE)
colnames(xMat) = c('1', 'x', 'xˆ2', 'xˆ3')
rownames(xMat) = c('row 1', 'row 2')
xMat
## 1 x x^2 x^3
## row 1 1 0.00 0e+00 0e+00
## row 2 1 0.01 1e-04 1e-06
Part C
model2 = lm(noisy.y ~ q + I(qˆ2) + I(qˆ3) + I(qˆ4))
summary(model2)
##
## Call:
## lm(formula = noisy.y ~ q + I(q^2) + I(q^3) + I(q^4))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.81946 -0.69717 0.00276 0.71388 2.15258
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.0869 0.5006 60.100 <2e-16 ***
## q 2.7527 7.0816 0.389 0.698
## I(q^2) 4.6941 29.3004 0.160 0.873
## I(q^3) 10.9040 44.5754 0.245 0.807
## I(q^4) -9.0200 22.3330 -0.404 0.687
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.062 on 95 degrees of freedom
## Multiple R-squared: 0.8884, Adjusted R-squared: 0.8837
## F-statistic: 189 on 4 and 95 DF, p-value: < 2.2e-16
# We get the model to be 30.324 - 7.806x + 55.279xˆ2 - 68.280xˆ3 + 30.653xˆ4
#Once again, this is quite different from the true model in-terms of the coefficients
11# We get the R squared for the estimated model to be 0.8887.
#It has increased by a little, due to the fact that we have added one more regressor variable.
#The more regressors we had, the bigger the R squared tends to 1.
12