$29.99
1. Bias and Variance Decomposition for the `2-regularized Mean Estimator – 25pts. This exercise helps you become more comfortable with the bias and variance calculations and decomposition. It focuses on the simplified setting of the mean estimator.
Consider a r.v. Y with an unknown distribution p. This random variable has an (unknown) mean µ = E[Y ] and variance . Consider a dataset D = {Y1,...,Yn} with independently sampled Yi ∼ p.
(a) [3pt] Show that the sample average estimator havg is the solution of the following optimization problem:
n
havg(D) ← argmin .
m∈ i=1
Recall that we have the following bias-variance decomposition for the mean estimator h(D):
.
variance
(b) [2pt] Compute the bias and variance of havg(D).
(c) [5pt] Consider the `2-regularized mean estimator hλ(D) defined for any λ ≥ 0.
n
.
i=1
Notice that this is similar to the ridge regression, but only for a single random variable. Provide an explicit formula for the estimator hλ(D) in terms of the sample average and λ. (d) [6pt] Compute the bias and variance of this regularized estimator.
(e) [5pt] Visualize EDh|hλ(D) − µ|2i, the bias, and the variance terms for a range of λ. As a starting point, you can choose µ = 1, σ2 = 9, and n = 10.
(f) [4pt] Explain the visualization from the previous part, in terms of the effect of bias and variance on the expected squared error.
Poisson regression: Recall that the squared error has a probabilistic justification. If we assume that the target values are contaminated with a zero-mean Gaussian noise, the maximum likelihood estimate is the same as the solution of the minimizer of the least squares loss.
(2.1) .
This is the probability that the number of events Z is k ∈ {0,1,...}. It can be shown that the expected value and the variance of this random variable are both λ, that is, E[Z] = λ and Var[Z] = λ.
The Poisson distribution can be used to model the number of events in a given amount of time or space. Some examples are the number of calls to a call centre in an hour, network packages arriving at a router in a minute, meteorites hitting Earth each year, goals or scores in a soccer or hockey match, bikes used at a bicycle sharing facility per-hour, and “soldiers killed by horse-kicks each year in the Prussian cavalry”.
(a) [5pt] Suppose that we are given a dataset of {Z1,...ZN} all independently drawn from a Poisson distribution (2.1) with unknown parameter λ. Derive the maximum likelihood estimate of the λ. Show the individual steps and note where you use the data assumptions (identically and independently distributed).
Hint: Use the standard approach to derive the MLE via argmaxλ logQp(Zi;λ).
(b) [5pt] The Poisson regression model is based on considering the λ parameter of the Poisson distribution a function of x and a weight w, which should be determined. The relation is specified by
logλ(x;w) = w>x.
Here the logarithm of the rate has a linear model, as opposed to the rate itself. This ensures that the rate is always non-negative.
This rate function leads to the following statistical model for target y ∈ {0,1,...} conditioned on the input x:
p(y|x;w) = .
Derive the maximum likelihood function of this model given a dataset consisting of i.i.d. input and response variables {(x(1),y(1)),...,(x(N),y(N))}. Note that the resulting estimator does not have a closed form solution, so you only have to simplify the resulting loss function as far as possible.
Locally Weighted Regression is a non-parametric algorithm, that is, the model does not learn a fixed set of parameters as is done in regression with a linear model. Rather, parameters are computed individually for each data point x. The next two questions help you derive the locally weighted regression.
(c) [5pt] The weighted least squares cost uses positive weights a1,...,αN per datapoint to construct a parameter estimate of the form:
w∗ ← argmin.
w
Show that the solution to this optimization problem is given by the formula
w ,
where X is the design matrix (defined in class) and A is a diagonal matrix where Aii = a(i)
(d) [5pt] Locally weighted least squares combines ideas from k-NN and linear regression. For each query x, we first compute distance-based weights for each training example
for some temperature parameter τ > 0. We then construct a local solution
w∗(x) ← argmin.
w
The prediction then takes the form ˆy = w∗(x)>x. How does this algorithm behave as the temperature τ → 0? What happens as τ → ∞? What is the disadvantage of this approach compared to the least squares regression in terms of computational complexity?
3. Implementing Regression Methods in Python – 55 pts. This question will take you step-by-step through implementing and empirically studying several regression methods on the Capital Bikesharing dataset. We are interested in predicting the number of used bikes on a per-hour basis based on calendar features (e.g., time, weekday, holiday) and environmental measurements (e.g., humidity, temperature).
• sklearn
• matplotlib
• numpy
• pandas
• autograd
3.1. Initial data analysis – 15 pts.
(a) [0pt] Load the Capital Bikesharing Dataset from the downloaded hour.csv file as a pandas dataframe.
(b) [2pt] Describe and summarize the data in terms of number of data points, dimensions, and used data types.
(c) [5pt] Present a single grid containing plots for each feature against the target. Choose the appropriate axis for dependent vs. independent variables.
Hint: use the pyplot.tight_layout function to make your grid readable.
(d) [5pt] Perform a correlation analysis on the data and plot the correlation matrix as a colored image. State which feature is the most positively, most negatively, and least correlated with the target column cnt.
(e) [1pt] Drop the following columns from the dataframe: instant, atemp, registered, casual, dteday.
(f) [2pt] Shuffle the dataframe’s rows using sklearn.utils.shuffle with random state 0. Split the data into a training set and a test set on index 10000.
3.2. Regression implementations – 40 pts.
(a) [5pt] Implement the ordinary least squares regression algorithm as discussed in class. You are free to implement the closed form solution.
(b) [3pt] Fit the ordinary least squares regression model to the pre-processed data. Report the coefficient of determination, also known as the R2 score, of your model. The R2 score between the correct target labels y and the predicted target labels ˆy can be calculated as:
Hint: use pandas.get dummies.
(d) [2pt] Re-fit the model with the new data and report the updated R2 score.
(e) [5pt] Implement the locally weighted regression algorithm as described in Question 2.
(f) [5pt] Fit the locally weighted regression model to the data with τ = 1. Report the R2 score of your model. Verify whether and describe how the expected behaviour for τ → 0 and τ → ∞ as described in your answer to Question 2 holds.
(g) [2pt] Plot a histogram of the target variable. What distribution does the target follow?
(h) [5pt] Implement the Poisson regression algorithm as describe in Question 2.
(i) [3pt] Fit the Poisson model to the data. Report the fraction of explained Tweedie deviance, also known as the D score, of your model. The D score between the correct target labels y and the predicted target labels ˆy can be calculated as:
(j) [5pt] For Linear Regression and Poisson Regression, report the final weights. Which are the most and least significant features in each model? Justify your answer. If the feature is categorical, report both the feature name, as well as the 1-hot index. Write a small explanation why visualizing weight contribution is not meaningful for the locally weighted linear regression.