$30
1. Consider the dataset from the Week 4 lab. As before, do the following using both matrix calculationsand R’s lm commands.
(a) Calculate 95% confidence intervals for the model parameters.
(b) Calculate a 95% confidence interval for the average income of a person who has had 18 yearsof formal education.
(c) Calculate a 95% prediction interval for the income of a single person who has had 18 years offormal education.
2. For simple linear regression, y = β0 +β1x+ε, show that a 100(1−α)% confidence interval for the mean response when x = x∗ can be written as
where
Similarly, show that a 100(1 − α)% prediction interval for a new response when x = x∗ can be written as
.
3. We can generate some data for a simple linear regression as follows:
> n <- 30
> x <- 1:n
> y <- x + rnorm(n)
Construct 95% CI’s for Ey and 95% PI’s for y, when x = 1,2,...,n. Join them up and plot them on a graph of y against x. Your plot should look something like this:
x
What proportion of the y’s should you expect to lie beyond the outer lines?
4. In generalised least squares we have y∼ N(Xβ,V ).
Let R be the principal square root of V −1 (why does R exist?). Put v = Ry, and show that v ∼ MV N(Zβ,I) for some Z. From the usual least squares estimate of β using v, obtain the generalised least squares estimate of β using y.
5. In this exercise, we look at the dangers of overfitting. Generate some observations from a simplelinear regression:
> set.seed(3)
> X <- cbind(rep(1,100), 1:100)
> beta <- c(0, 1)
> y <- X %*% beta + rnorm(100)
Put aside some of the data for testing and some for fitting:
> Xfit <- X[1:50,]
> yfit <- y[1:50]
> Xtest <- X[51:100,]
> ytest <- y[51:100]
(a) Using only the fitting data, estimate β and hence the residual sum of squares. Also calculate the residual sum of squares for the test data, that is, .
Now add 10 extra predictor variables which we know have nothing to do with the response:
> X <- cbind(X, matrix(runif(1000), 100, 10))
> Xtest <- X[51:100,]
> Xfit <- X[1:50,]
Again using only the fitting data, fit the linear model y = Xβ+ε, and show that the residual sum of squares has reduced (this has to happen). Then show that the residual sum of squares for the test data has gone up (this happens most of the time).
Explain what is going on.
(b) Repeat the above, but this time add x2, x3 and x4 terms:
> X <- cbind(X[, 1:2], (1:100)^2, (1:100)^3, (1:100)^4)
Programming questions
1. What will be the output of the following code? Try to answer this without typing it up.
fb <- function(n) {
if (n == 1 || n == 2) { return(1)
} else { return(fb(n - 1) + fb(n - 2))
} } fb(8)
2. Suppose you needed all 2n binary sequences of length n. One way of generating them is with nested for loops. For example, the following code generates a matrix binseq, where each row is a different binary sequence of length three.
> binseq <- matrix(nrow = 8, ncol = 3)
> r <- 0 # current row of binseq
> for (i in 0:1) {
+ for (j in 0:1) {
+ for (k in 0:1) {
+ r <- r + 1
+ binseq[r,] <- c(i, j, k)
+ }
+ }
+ }
> binseq
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 1
[3,] 0 1 0
[4,] 0 1 1
[5,] 1 0 0
[6,] 1 0 1
[7,] 1 1 0
[8,] 1 1 1
Clearly this approach will get a little tedious for large n. An alternative is to use recursion. Suppose that A is a matrix of size 2n × n, where each row is a different binary sequence of length n. Then the following matrix contains all binary sequences of length n + 1:
0
A
1
A
.
Here 0 is a vector of zeros and 1 is a vector of ones.
Use this idea to write a recursive function binseq, which takes as input an integer n and returns a matrix containing all binary sequences of length n, as rows of the matrix. You should find the functions cbind and rbind particularly useful.
3. Let A = (ai,j)ni,j=1 be a square matrix, and denote by A(−i,−j) the matrix with row i and column j removed. If A is a 1×1 matrix then det(A), the determinant of A, is just a1,1. For n×n matrices we have, for any i,
.
Use this to write a recursive function to calculate det(A).