$35
Geometry of Least Squares
1. Suppose we have a dataset represented with the design matrix span(X) and response vector Y. We use linear regression to solve for this and obtain optimal weights as ✓ˆ. Draw the geometric interpretation of the column space of the design matrix span(X), the response vector Y, the residuals Y X✓ˆ, and the predictions X✓ˆ(using optimal parameters) and X↵ (using an arbitrary vector ↵).
(a) What is always true about the residuals in least squares regression? Select all that apply.
⇤ A. They are orthogonal to the column space of the design matrix.
⇤ B. They represent the errors of the predictions.
⇤ C. Their sum is equal to the mean squared error.
⇤ D. Their sum is equal to zero. ⇤ E. None of the above.
1
(b) Which are true about the predictions made by OLS? Select all that apply.
⇤ A. They are projections of the observations onto the column space of the design matrix.
⇤ B. They are linear combinations of the features.
⇤ C. They are orthogonal to the residuals.
⇤ D. They are orthogonal to the column space of the features.
⇤ E. None of the above.
(c) We fit a simple linear regression to our data (xi,yi),i = 1,2,3, where xi is the independent variable and yi is the dependent variable. Our regression line is of the form yˆ = ✓ˆ0 +✓ˆ1x. Suppose we plot the relationship between the residuals of the model and the ysˆ , and find that there is a curve. What does this tell us about our model?
⇤ A. The relationship between our dependent and independent variables is well represented by a line.
⇤ B. The accuracy of the regression line varies with the size of the dependent variable.
⇤ C. The variables need to be transformed, or additional independent variables are needed.
3
Understanding Dimensions
2. In this exercise, we will examine many of the terms that we have been working with in regression (e.g. ✓ˆ) and connect them to their dimensions and to concepts that they represent.
First, we define some notation. The n⇥p design matrixX hasX corresponds top+1 features, where the additionn observations on p features. (In lecture, we stated that we sometimes say
feature is a column of all 1s for the intercept term, but strictly speaking that column doesn’t need to exist. In this problem, one of the p columns may be a column of all 1s.) Y is the response variable. It is a vector, containing the true response for all observations. We assume in this problem that we use X and Y to compute optimal parameters ✓ˆfor a linear model, and that this linear model generates predictions using Yˆ = X✓ˆ as we saw in lecture and in Question 1 of this discussion. Each of the n rows in our design matrix X contains all features for a single observation. Each of the p columns in our design matrix X contains a single feature, for all observations. We denote the rows and columns of X as follows:
X:,j jth column vector in X,j = 1,...,p Xi,: ith row vector in X,i = 1,...,n
Below, on the left, we have several expressions, labelled a through h, and on the right we have several terms, labelled 1 through 10. For each expression, determine its shape (e.g., nat all. If a specific expression is nonsensical because the dimensions don’t line up for a matrix⇥ p), and match it to one of the given terms. Terms may be used more than once or not multiplication, write “N/A” for both.
(a) X
(b) ✓ˆ
(c) X:,j
(d) X1,: · ✓ˆ
(e) X:,1 · ✓ˆ
(f) X✓ˆ
(g) (XT X) 1XT Y
1. the residuals
2. 0
3. 1st response, y1
4. 1st predicted value, yˆ1
5. 1st residual, e1
6. the estimated coefficients
7. the predicted values
(h) (I X(XT X) 1XT )Y
8. the features for a single observation
9. the value of a specific feature for all observations
10. the design matrix
As an example, for 2a, you would write: “2a. Dimension: n ⇥ p, Term: 10”.