$30
Short Answer and “True or False” Conceptual questions
1. The answers to these questions should be answerable without referring to external materials. Please include brief explanation for T/F questions as well.
a. [2 points] In your own words, describe what bias and variance are. What is the bias-variance tradeoff?
b. [2 points] What happens to bias and variance when model complexity increases/decreases?
c. [1 points] True or False: The bias of a model increases as the amount of available training data increases.
d. [1 points] True or False: The variance of a model decreases as the amount of available training data increases.
e. [1 points] True or False: A learning algorithm will always generalize better if we use fewer features to represent our data.
f. [2 points] To obtain superior performance on new unseen data, should we use the training set or the test set to tune our hyperparameters?
g. [1 points] True or False: The training error of a function on the training set provides an overestimate of the true error of that function.
h. [1 points] True or False: Using L2 regularization when training a linear regression model encourages it to use less input features when making a prediction.
Maximum Likelihood Estimation (MLE)
2. Consider a model consisting of random variables X,Y and Z: Y = Xw+Z, where Z ∼ U[−0.5,0.5]. Assume that Z is a noise here (i.e. the independence holds) and w ∈ R is a fixed parameter (i.e. it is not random).
a. [5 points] Derive the probability density function (pdf) of Y conditioning on {X = x}.
b. [5 points] Assume that you have n points of training data {(X1,Y1),(X2,Y2),··· ,(Xn,Yn)} generated
i.i.d in the above setting. Derive a maximum likelihood estimator of w for the conditional distribution of Y conditioning on {X = x}. Assume that Xi > 0 for all i = 1,...,n and note that MLE for this case may not be unique and you are required to report only one particular estimate.
3. You’re a Reign FC fan, and the team is five games into its 2018 season. The number of goals scored by theteam in each game so far are given below:
[2,0,1,1,2].
Let’s call these scores x1,...,x5. Based on your (assumed iid) data, you’d like to build a model to understand how many goals the Reign are likely to score in their next game. You decide to model the number of goals scored per game using a Poisson distribution. The Poisson distribution with parameter λ assigns every non-negative integer x = 0,1,2,... a probability given by
x
Poi(.
So, for example, if λ = 1.5, then the probability that the Reign score 2 goals in their next game is
0.25. To check your understanding of the Poisson, make sure you have a sense of whether raising λ will mean more goals in general, or fewer.
a. [5 points] Derive an expression for the maximum-likelihood estimate of the parameter λ governing the Poisson distribution, in terms of your goal counts x1,...,x5. (Hint: remember that the log of the likelihood has the same maximum as the likelihood function itself.)
b. [5 points] Suppose the team scores 4 goals in its sixth game. Derive the same expression for the estimate of the parameter λ as in the prior example, now using the 6 games x1,...,x5,x6 = 4.
c. [5 points] Given the goal counts, please give numerical estimates of λ after 5 and 6 games.
4. [10 points] In World War 2, the Allies attempted to estimate the total number of tanks the Germans had manufactured by looking at the serial numbers of the German tanks they had destroyed. The idea was that if there were n total tanks with serial numbers {1,...,n} then it is reasonable to expect the observed serial numbers of the destroyed tanks constituted a uniform random sample (without replacement) from this set. The exact maximum likelihood estimator for this so-called German tank problem is non-trivial and quite challenging to work out (try it!). For our homework, we will consider a much easier problem with a similar flavor. Let x1,...,xn be independent, uniformly distributed on the continuous domain [0,θ] for some θ. What is the Maximum likelihood estimate for θ?
Overfitting
5. Suppose we have N labeled samples drawn i.i.d. from an underlying distribution D.
Suppose we decide to break this set into a set Strain of size Ntrain and a set Stest of size Ntest samples for our training and test set, so N = Ntrain +Ntest, and S = Strain ∪Stest. Recall the definition of the true least squares error of f:
,
where the subscript (x,y) ∼ D makes clear that our input-output pairs are sampled according to D. Our training and test losses are defined as:
1 X 2
train(f) = (f(x) − y)
b Ntrain
(x,y)∈Strain
test
(x,y)∈Stest
We then train our algorithm (for example, using linear least squares regression) using the training set to obtain a. [6 points] (bias: the test error) Define Etrain as the expectation over all training set Strain and Etest as the expectation over all testing set Stest. For all fixed f (before we’ve seen any data) show that
.
Use a similar line of reasoning to show that the test error is an unbiased estimate of our true error for fˆ. Specifically, show that:
b. [5 points] (bias: the train/dev error) Is the above equation true (in general) with regards to the training loss? Specifically, does )] equal )]? If so, why? If not, give a clear argument as to where your previous argument breaks down.
c. [8 points] Let F = (f1,f2,...) be a collection of functions and let fbtrain minimize the training error such that b ) for all f ∈ F. Show that
train,test .
(Hint: note that
Etrain,test train,test
f∈F
train train = f) f∈F f∈F
where the second equality follows from the independence between the train and test set.)
Polynomial Regression
Relevant Files[1]
• polyreg.py
• linreg closedform.py
• test polyreg univariate.py
• test polyreg learningCurve.py
• data/polydata.dat
6. [15 points] Recall that polynomial regression learns a function hθ(x) = θ0 + θ1x + θ2x2 + ... + θdxd. In this case, d represents the polynomial’s degree. We can equivalently write this in the form of a linear model
hθ(x) = θ0φ0(x) + θ1φ1(x) + θ2φ2(x) + ... + θdφd(x) , (1)
using the basis expansion that φj(x) = xj. Notice that, with this basis expansion, we obtain a linear model where the features are various powers of the single univariate x. We’re still solving a linear regression problem, but are fitting a polynomial function of the input.
Implement regularized polynomial regression in polyreg.py. You may implement it however you like, using gradient descent or a closed-form solution. However, I would recommend the closed-form solution since the data sets are small; for this reason, we’ve included an example closed-form implementation of linear regression in linreg closedform.py (you are welcome to build upon this implementation, but make CERTAIN you understand it, since you’ll need to change several lines of it). You are also welcome to build upon your implementation from the previous assignment, but you must follow the API below. Note that all matrices are actually 2D numpy arrays in the implementation.
• init (degree=1, regLambda=1E-8) : constructor with arguments of d and λ
• fit(X,Y): method to train the polynomial regression model
• predict(X): method to use the trained polynomial regression model for prediction
• polyfeatures(X, degree): expands the given n×1 matrix X into an n×d matrix of polynomial features of degree d. Note that the returned matrix will not include the zero-th power.
Note that the polyfeatures(X, degree) function maps the original univariate data into its higher order powers. Specifically, X will be an n × 1 matrix (X ∈ Rn×1) and this function will return the polynomial expansion of this data, a n × d matrix. Note that this function will not add in the zero-th order feature (i.e., x0 = 1). You should add the x0 feature separately, outside of this function, before training the model. By not including the x0 column in the matrix polyfeatures(), this allows the polyfeatures function to be more general, so it could be applied to multi-variate data as well. (If it did add the x0 feature, we’d end up with multiple columns of 1’s for multivariate data.)
Also, notice that the resulting features will be badly scaled if we use them in raw form. For example, with a polynomial of degree d = 8 and x = 20, the basis expansion yields x1 = 20 while x8 = 2.56×1010 – an absolutely huge difference in range. Consequently, we will need to standardize the data before solving linear regression. Standardize the data in fit() after you perform the polynomial feature expansion. You’ll need to ap-
ply the same standardization transformation in predict() before you apply it to new data.
Figure 1: Fit of polynomial regression with λ =
0 and d = 8
Run test polyreg univariate.py to test your implementation, which will plot the learned function. In this case, the script fits a polynomial of degree d = 8 with no regularization λ = 0. From the plot, we see that the function fits the data well, but will not generalize well to new data points. Try increasing the amount of regularization, and examine the resulting effect on the function.
7. [15 points] In this problem we will examine the bias-variance tradeoff through learning curves. Learning curves provide a valuable mechanism for evaluating the bias-variance tradeoff. Implement the learningCurve() function in polyreg.py to compute the learning curves for a given training/test set. The learningCurve(Xtrain, ytrain, Xtest, ytest, degree, regLambda) function should take in the training data (Xtrain, ytrain), the testing data (Xtest, ytest), and values for the polynomial degree d and regularization parameter λ.
The function should return two arrays, errorTrain (the array of training errors) and errorTest (the array of testing errors). The ith index (start from 0) of each array should return the training error (or testing error) for learning with i + 1 training instances. Note that the 0th index actually won’t matter, since we typically start displaying the learning curves with two or more instances.
When computing the learning curves, you should learn on Xtrain[0:i] for i = 1,...,numInstances(Xtrain)+1, each time computing the testing error over the entire test set. There is no need to shuffle the training data, or to average the error over multiple trials – just produce the learning curves for the given training/testing sets with the instances in their given order. Recall that the error for regression problems is given by
. (2)
Notice the following:
• The y-axis is using a log-scale and the ranges of the y-scale are all different for the plots. The dashed black line indicates the y = 1 line as a point of reference between the plots.
• The plot of the unregularized model with d = 1 shows poor training error, indicating a high bias (i.e., it is a standard univariate linear regression fit).
• The plot of the unregularized model (λ = 0) with d = 8 shows that the training error is low, but that the testing error is high. There is a huge gap between the training and testing errors caused by the model overfitting the training data, indicating a high variance problem.
• As the regularization parameter increases (e.g., λ = 1) with d = 8, we see that the gap between the training and testing error narrows, with both the training and testing errors converging to a low value. We can see that the model fits the data well and generalizes well, and therefore does not have either a high bias or a high variance problem. Effectively, it has a good tradeoff between bias and variance.
• Once the regularization parameter is too high (λ = 100), we see that the training and testing errors are once again high, indicating a poor fit. Effectively, there is too much regularization, resulting in high bias.
Please include both your code and the generated plots in your homework. Make absolutely certain that you understand these observations, and how they relate to the learning curve plots. In practice, we can choose the value for λ via cross-validation to achieve the best bias-variance tradeoff.
Ridge Regression on MNIST
8. In this problem we will implement a regularized least squares classifier for the MNIST data set. The task is to classify handwritten images of numbers between 0 to 9.
You are NOT allowed to use any of the prebuilt classifiers in sklearn. Feel free to use any method from numpy or scipy. Remember: if you are inverting a matrix in your code, you are probably doing something wrong (Hint: look at scipy.linalg.solve).
Get the data from https://pypi.python.org/pypi/python-mnist. Load the data as follows: from mnist import MNIST
def load_dataset():
mndata = MNIST(’./data/’)
X_train, labels_train = map(np.array, mndata.load_training())
X_test, labels_test = map(np.array, mndata.load_testing())
X_train = X_train/255.0
X_test = X_test/255.0
Each example has features xi ∈ Rd (with d = 28∗28 = 784) and label zj ∈ {0,...,9}. You can visualize a single example xi with imshow after reshaping it to its original 28 × 28 image shape (and noting that the label zj is accurate). We wish to learn a predictor fbthat takes as input a vector in Rd and outputs an index in {0,...,9}.
We define our training and testing classification error on a predictor f as
X btrain1{f(x) 6= z}
(x,z)∈Training Set
1 X
btest( ) = N test 1{f(x) 6= z}
(x,z)∈Test Set
We will use one-hot encoding of the labels, i.e. of (x,z) the original label z ∈ {0,...,9} is mapped to the standard basis vector ez where ez is a vector of all zeros except for a 1 in the zth position. We adopt the notation where we have n data points in our training objective with features xi ∈ Rd and label one-hot encoded as yi ∈ {0,1}k where in this case k = 10 since there are 10 digits.
a. [10 points] In this problem we will choose a linear classifier to minimize the regularized least squares objective:
n
Wc = argmin
i=0
Note that kWkF corresponds to the Frobenius norm of W, i.e. . To classify a point xi we will use the rule argmaxj=0,...,9 eTj WcTxi. Note that if then
where and . Show that
Wc = (XTX + λI)−1XTY
b. [10 points]
• Code up a function train that takes as input X ∈ Rn×d, Y ∈ {0,1}n×k, λ > 0 and returns Wc.
• Code up a function predict that takes as input W ∈ Rd×k, X0 ∈ Rm×d and returns an m-length vector with the ith entry equal to argmax where is a column vector representing the ith example from X0.
• Train Wc on the MNIST training data with λ = 10−4 and make label predictions on the test data. What is the training and testing error? Note that they should both be about 15%.
[1] Bold text indicates files or functions that you will need to complete; you should not need to modify any of the other files.