$30
1 Bias and variance of ridge regression
Ridge regression solves the regularized least squares problem
argminβ(y − Xβ)>(y − Xβ) + τ β>β
with regularization parameter τ ≥ 0. Regularization introduces some bias into the solution in order to achieve a potentially large gain in variance. Assume that the true model is with zeromean Gaussian noise and centered features N[1] Pi Xi = 0 (note that these assumptions imply that y is also centered in expectation). Prove that expectation and covariance matrix of the regularized solution (both taken over all possible training sets of size N) are then given by
Cov
where S and Sτ are the ordinary and regularized scatter matrices:
S = X>X Sτ = X>X + τ ID
Notice that expectation and covariance reduce to the corresponding expressions of ordinary least squares (as derived in the lecture) when τ = 0:
Cov
Since Sτ is greater than S (in any norm), regularization has a shrinking e ect on both expectation and covariance.
2 Denoising of a CT image (11 points)
In the last task of exercise 5, we diminished the radiation dose for our patient by using fewer projection angles. However, this introduced noise in the resulting tomogram. Ridge regression o ers a simple possibility to reduce the noise level1.
Fundamentals of Machine Learning ' U. K the
Winter Semester 2021/2022 Exercise 6 ullrich.koethe@iwr.uni-heidelberg.de
To apply regularization to tomographic reconstruction, the D × D diagonal matrix τ ID must be appended at the bottom of the weight matrix X from exercise 5:
Correspondingly, vector y must be extended with D zeros.
Add a new parameter to the function construct_X() to activate regularization
X = construct_X(M, alphas, Np = None, tau=0)
For tau=0, the function shall behave exactly like the original version from exercise 5. If tau>0, it shall return the augmented matrix X0. Reconstruct the tomogram for 64 angles with τ = 0,1,10,100,1000,10000 and display the resulting CT images. Find the value of τ with the best compromise of image sharpness and noise.
Compare the ridge regression results with Gaussian ltering, another popular noise reduction technique, which replaces each pixel with a weighted average of its neighbors (see https://en.wikipedia. org/wiki/Gaussian_filter for more details). A suitable implementation is provided by function gaussian_filter in module scipy.ndimage.filters. Apply gaussian_filter with lter sizes sigma = 1,2,3,5,7 to the CT image with τ = 0, display the results and comment on the similarities and di erences between ridge regression and Gaussian ltering.
3 Automatic feature selection for regression
Sparse regression regularizes the solution such that unimportant elements of βb are set to zero. The corresponding columns of X could be dropped from the matrix without any e ect on the solution the remaining columns are obviously more useful as an explanation for the response y. Therefore, sparse regression can be interpreted as a method for relevant feature identi cation.
In exercise 2, you implemented dimension reduction from a full image of a digit to just two features by selecting important pixels manually[2]. In this exercise, we will automatically nd relevant pixels by means of sparse regression.
3.1 Implement Orthogonal Matching Pursuit (5 points)
Orthogonal Matching Pursuit[3] is a simple greedy sparse regression algorithm. It approximates the exact algorithms for least squares with L0 or L1 regularization (cf. the lecture for details) and is de ned as:
Initialization:
Inputs: (the desired number of non-zero elements in the nal solution β(T))
De ne the initial sets of active resp. inactive columns as
A(0) = ∅ and B(0) = {j : j = 1...D}.
De ne the initial residual r(0) = y.
Iteration: for t = 1...T do
1. Find the inactive column that has maximal correlation with the current residual:
j(t) = argmax
Fundamentals of Machine Learning ' U. K the
Winter Semester 2021/2022 Exercise 6 ullrich.koethe@iwr.uni-heidelberg.de
2. Move j(t) from the inactive set B to the active set A.
3. Form the active matrix X(t) consisting of all currently active columns of X.
4. Solve the least squares problem βb(t) = argminβ (y − X(t)β)>(y − X(t)β)
5. Update the residual r(t) = y − X(t)βb(t).
Implement this algorithm as a function solutions = omp_regression(X, y, T)
which returns βb(t) for t = 1...T as a D × T matrix, i.e. the inactive elements in each solution are not dropped, but explicitly set to zero.
3.2 Classi cation with sparse LDA (8 points)
We again use the digits dataset of scikit-learn and classify ‘1’ vs. ‘7’. For balanced training sets, LDA can be formulated as a least squares problem, where the rows Xi are the attened images of each digit, and
(
1 if instance i is a ‘1’ yi =
−1 if instance i is a ‘7’
are the desired responses. Now execute omp_regression() with su ciently big T to get the sequence of sparse LDA solutions for t = 1...T.
Report the error rate on the test set for t = 1...T. How many pixels should be used for acceptable error rates? Is it necessary/bene cial to standardize the data before training and testing?
Visualize in which order the pixels are switched to active as t increases, and show if a pixel votes in favour or against class ‘1’. What is a good criterion for this distinction? Compare these results with your hand-crafted feature selection in exercise 2 did you select the same pixels?
3.3 One-against-the-rest classi cation (8 points)
A one-against-the-rest classi er is actually a collection of C classi ers, one for each class k. To construct it, one uses the given training data to create C auxiliary training sets with target responses
(
(k) 1 if instance i has class k yi =
−1 if instance i has class 6= k
The positive subset contains all instances of class k, and the negative subset an equally sized random selection from all other classes. Use these auxiliary training sets to train a sparse LDA for the digits k = 0...9 with suitable T (all classi ers should use the same T).
At testing, a new instance is rated by all C classi ers and assigned to the class whose classi er returns the largest response, or to the additional class unknown if all classi ers have negative response. Report the confusion matrix of your predictor on the test set.
Why is it useful to introduce class unknown ? Parameterize your classi er such that a few test instances are actually assigned to this class and visualize these instances. What do you observe?
[1] You will notice that the improvements don’t look too impressive. Humans tend to prefer the noisy, but unbiased version of the image, because the eye’s built-in noise suppression is way more powerful than ridge regression. In other words, the eye and the algorithm have di erent bias-variance trade-o s.
[2] or computing your own features
[3] not to be confused with its simpler cousin Matching Pursuit