$30
1. Linear Classifiers (logistic regression and GDA)
In this problem, we cover two probabilistic linear classifiers we have covered in class so far. First, a discriminative linear classifier: logistic regression. Second, a generative linear classifier: Gaussian discriminant analysis (GDA). Both the algorithms find a linear decision boundary that separates the data into two classes, but make different assumptions. Our goal in this problem is to get a deeper understanding of the similarities and differences (and, strengths and weaknesses) of these two algorithms.
For this problem, we will consider two datasets, provided in the following files:
i. data/ds1_{train,valid}.csv ii. data/ds2_{train,valid}.csv
Each file contains m examples, one example (x(i),y(i)) per row. In particular, the i-th row contains columns x(0i) ∈ R, x(1i) ∈ R, and y(i) ∈ {0,1}. In the subproblems that follow, we will investigate using logistic regression and Gaussian discriminant analysis (GDA) to perform binary classification on these two datasets.
(a) In lecture we saw the average empirical loss for logistic regression:
,
where y(i) ∈ {0,1}, hθ(x) = g(θTx) and g(z) = 1/(1 + e−z).
Find the Hessian H of this function, and show that for any vector z, it holds true that
zTHz ≥ 0.
Hint:′ You may want to start by showing that Pi Pj zixixjzj = (xTz)2 ≥ 0. Recall also that g (z) = g(z)(1 − g(z)).
Remark: This is one of the standard ways of showing that the matrix H is positive semidefinite, written “ 0.” This implies that J is convex, and has no local minima other than the global one. If you have some other way of showing 0, you’re also welcome to use your method instead of the one above. Answer:
(b) Coding problem. Follow the instructions in src/p01blogreg.py to train a logistic regression classifier using Newton’s Method. Starting with θ = ~0, run Newton’s Method until the updates to θ are small: Specifically, train until the first iteration k such that kθk − θk−1k1 < ǫ, where ǫ = 1 ×10−5. Make sure to write your model’s predictions to the file specified in the code. Answer:
(c) Recall that in GDA we model the joint distribution of (x,y) by the following equations:
,
where φ, µ0, µ1, and Σ are the parameters of our model.
Suppose we have already fit φ, µ0, µ1, and Σ, and now want to predict y given a new point x. To show that GDA results in a classifier that has a linear decision boundary, show the posterior distribution can be written as
,
where θ ∈ Rn and θ0 ∈ R are appropriate functions of φ, Σ, µ0, and µ1. Answer:
(d) For this part of the problem only, you may assume n (the dimension of x) is 1, so that Σ = [σ2] is just a real number, and likewise the determinant of Σ is given by |Σ| = σ2. Given the dataset, we claim that the maximum likelihood estimates of the parameters are given by
1 m { (i) = 1} m X
φ =1 y
i=1
1 y
mi=1 { (i) = 0}x(i)
µ0 =mi=1 {y(i) = 0}
Pm 1{y(i) = 1}x(i) 1
µ1 =i=1mi=1 1{y(i) = 1}
m
Σ = 1 X (i) − µy(i))(x(i) − µy(i))T
(x m
i=1
The log-likelihood of the data is
.
By maximizing ℓ with respect to the four parameters, prove that the maximum likelihood estimates of φ, µ0,µ1, and Σ are indeed as given in the formulas above. (You may assume that there is at least one positive and one negative example, so that the denominators in the definitions of µ0 and µ1 above are non-zero.) Answer:
(e) Coding problem. In src/p01egda.py, fill in the code to calculate φ, µ0, µ1, and Σ, use these parameters to derive θ, and use the resulting GDA model to make predictions on the validation set. Answer:
(f) For Dataset 1, create a plot of the validation set with x1 on the horizontal axis, and x2 on the vertical axis. To visualize the two classes, use a different symbol for examples x(i) with y(i) = 0 than for those with y(i) = 1. On the same figure, plot the decision boundary found by logistic regression in part (b). Make an identical plot with the decision boundary found by GDA in part (e). Answer:
(g) Repeat the steps in part (f) for Dataset 2. On which dataset does GDA seem toperform worse than logistic regression? Why might this be the case? Answer:
(h) For the dataset where GDA performed worse in parts (f) and (g), can you find a transformation of the x(i)’s such that GDA performs significantly better?
What is this transformation? Answer:
2. Incomplete, Positive-Only Labels
In this problem we will consider training binary classifiers in situations where we do not have full access to the labels. In particular, we consider a scenario, which is not too infrequent in real life, where we have labels only for a subset of the positive examples. All the negative examples and the rest of the positive examples are unlabelled.
That is, we assume a dataset , where t(i) ∈ {0,1} is the “true” label, and where
1 x(i) is labeled 0 otherwise.
y(i) = (
All labeled examples are positive, which is to say p(t(i) = 1 | y(i) = 1) = 1, but unlabeled examples may be positive or negative. Our goal in the problem is to construct a binary classifier h of the true label t, with only access to the partial labels y. In other words, we want to construct h such that h(x(i)) ≈ p(t(i) = 1 | x(i)) as closely as possible, using only x and y.
Real world example: Suppose we maintain a database of proteins which are involved in transmitting signals across membranes. Every example added to the database is involved in a signaling process, but there are many proteins involved in cross-membrane signaling which are missing from the database. It would be useful to train a classifier to identify proteins that should be added to the database. In our notation, each example x(i) corresponds to a protein, y(i) = 1 if the protein is in the database and 0 otherwise, and t(i) = 1 if the protein is involved in a cross-membrane signaling process and thus should be added to the database, and 0 otherwise.
(a) Suppose that each y(i) and x(i) are conditionally independent given t(i):
p(y(i) = 1 | t(i) = 1,x(i)) = p(y(i) = 1 | t(i) = 1).
Note this is equivalent to saying that labeled examples were selected uniformly at random from the set of positive examples. Prove that the probability of an example being labeled differs by a constant factor from the probability of an example being positive. That is, show that p(t(i) = 1 | x(i)) = p(y(i) = 1 | x(i))/α for some α ∈ R. Answer:
(b) Suppose we want to estimate α using a trained classifier h and a held-out validation set V . Let V+ be the set of labeled (and hence positive) examples in V , given by V+ = {x(i) ∈ V | y(i) = 1}. Assuming that h(x(i)) ≈ p(y(i) = 1 | x(i)) for all examples x(i), show that h(x(i)) ≈ α for all x(i) ∈ V+.
You may assume that p(t(i) = 1 | x(i)) ≈ 1 when x(i) ∈ V+. Answer:
(c) Coding problem. The following three problems will deal with a dataset which we have provided in the following files:
data/ds3_{train,valid,test}.csv
Each file contains the following columns: x1, x2, y, and t. As in Problem 1, there is one example per row.
First we will consider the ideal case, where we have access to the true t-labels for training. In src/p02cdeposonly, write a logistic regression classifier that uses x1 and x2 as input features, and train it using the t-labels (you can ignore the y-labels for this part). Output the trained model’s predictions on the test set to the file specified in the code. Answer:
(d) Coding problem. We now consider the case where the t-labels are unavailable, so you only have access to the y-labels at training time. Add to your code in p02cdeposonly.py to re-train the classifier (still using x1 and x2 as input features), but using the y-labels only. Answer:
(e) Coding problem. Using the validation set, estimate the constant α by averaging your classifier’s predictions over all labeled examples in the validation set:
.
Add code in src/p02cdeposonly.py to rescale your classifier’s predictions from part (d) using the estimated value for α.
Finally, using a threshold of p(t(i) = 1 | x(i)) = 0.5, make three separate plots with the decision boundaries from parts (c) - (e) plotted on top of the test set. Plot x1 on the horizontal axis and x2 on the vertical axis, and use two different symbols for the positive (t(i) = 1) and negative (t(i) = 0) examples. In each plot, indicate the separating hyperplane with a red line. Answer:
Remark: We saw that the true probability p(t | x) was only a constant factor away from p(y | x). This means, if our task is to only rank examples (i.e. sort them) in a particular order (e.g, sort the proteins in order of being most likely to be involved in transmitting signals across membranes), then in fact we do not even need to estimate α. The rank based on p(y | x) will agree with the rank based on p(t | x).
3. Poisson Regression
(a) Consider the Poisson distribution parameterized by λ:
.
Show that the Poisson distribution is in the exponential family, and clearly state the values for b(y), η, T(y), and a(η). Answer:
(b) Consider performing regression using a GLM model with a Poisson response variable. What is the canonical response function for the family? (You may use the fact that a Poisson random variable with parameter λ has mean λ.) Answer:
(c) For a training set {(x(i),y(i)); i = 1,...,m}, let the log-likelihood of an example be logp(y(i)|x(i);θ). By taking the derivative of the log-likelihood with respect to θj, derive the stochastic gradient ascent update rule for learning using a GLM model with Poisson responses y and the canonical response function. Answer:
(d) Coding problem. Consider a website that wants to predict its daily traffic. The website owners have collected a dataset of past traffic to their website, along with some features which they think are useful in predicting the number of visitors per day. The dataset is split into train/valid/test sets and follows the same format as Datasets 1-3:
data/ds4_{train,valid}.csv
We will apply Poisson regression to model the number of visitors per day. Note that applying Poisson regression in particular assumes that the data follows a Poisson distribution whose natural parameter is a linear combination of the input features (i.e., η = θTx). In src/p03dpoisson.py, implement Poisson regression for this dataset and use gradient ascent to maximize the log-likelihood of θ. Answer:
4. Convexity of Generalized Linear Models
In this question we will explore and show some nice properties of Generalized Linear Models, specifically those related to its use of Exponential Family distributions to model the output.
Most commonly, GLMs are trained by using the negative log-likelihood (NLL) as the loss function. This is mathematically equivalent to Maximum Likelihood Estimation (i.e., maximizing the log-likelihood is equivalent to minimizing the negative log-likelihood). In this problem, our goal is to show that the NLL loss of a GLM is a convex function w.r.t the model parameters. As a reminder, this is convenient because a convex function is one for which any local minimum is also a global minimum.
To recap, an exponential family distribution is one whose probability density can be represented
p(y;η) = b(y)exp(ηTT(y) − a(η)),
where η is the natural parameter of the distribution. Moreover, in a Generalized Linear Model, η is modeled as θTx, where x ∈ Rn are the input features of the example, and θ ∈ Rn are learnable parameters. In order to show that the NLL loss is convex for GLMs, we break down the process into sub-parts, and approach them one at a time. Our approach is to show that the second derivative (i.e., Hessian) of the loss w.r.t the model parameters is Positive Semi-Definite (PSD) at all values of the model parameters. We will also show some nice properties of Exponential Family distributions as intermediate steps.
For the sake of convenience we restrict ourselves to the case where η is a scalar. Assume p(Y |X;θ) ∼ ExponentialFamily(η), where η ∈ R is a scalar, and T(y) = y. This makes the exponential family representation take the form
p(y;η) = b(y)exp(ηy − a(η)).
(a) [5 points] Derive an expression for the mean of the distribution. Show that
(note that E[Y ;η] = E[Y | X;η] since η = θTx). In other words, show that the mean of an exponential family distribution is the first derivative of the log-partition function with respect to the natural parameter.
Hint: Start with observing that .
Answer:
(b) Next, derive an expression for the variance of the distribution. In particular,show that Var( ) (again, note that Var(Y ;η) = Var(Y | X;θ)). In other words, show that the variance of an exponential family distribution is the second derivative of the log-partition function w.r.t. the natural parameter.
Hint: Building upon the result in the previous sub-problem can simplify the derivation. Answer:
(c) Finally, write out the loss function ℓ(θ), the NLL of the distribution, as a function of θ. Then, calculate the Hessian of the loss w.r.t θ, and show that it is always PSD. This concludes the proof that NLL loss of GLM is convex.
Hint 1: Use the chain rule of calculus along with the results of the previous parts to simplify your derivations.
Hint 2: Recall that variance of any probability distribution is non-negative. Answer:
Remark: The main takeaways from this problem are: • Any GLM model is convex in its model parameters.
• The exponential family of probability distributions are mathematically nice. Whereas calculating mean and variance of distributions in general involves integrals (hard), surprisingly we can calculate them using derivatives (easy) for exponential family.
5. Locally weighted linear regression
(a) Consider a linear regression problem in which we want to “weight” different training examples differently. Specifically, suppose we want to minimize
.
In class, we worked out what happens for the case where all the weights (the w(i)’s) are the same. In this problem, we will generalize some of those ideas to the weighted setting.
i. Show that J(θ) can also be written
J(θ) = (Xθ − y)TW(Xθ − y)
for an appropriate matrix W, and where X and y are as defined in class. Clearly specify the value of each element of the matrix W. ii. [4 points] If all the w(i)’s equal 1, then we saw in class that the normal equation is
XTXθ = XTy,
and that the value of θ that minimizes J(θ) is given by (XTX)−1XTy. By finding the derivative ∇θJ(θ) and setting that to zero, generalize the normal equation to this weighted setting, and give the new value of θ that minimizes J(θ) in closed form as a function of X, W and y.
iii. Suppose we have a dataset {(x(i),y(i)); i = 1...,m} of m independent examples, but we model the y(i)’s as drawn from conditional distributions with different levels of variance (σ(i))2. Specifically, assume the model
That is, each y(i) is drawn from a Gaussian distribution with mean θTx(i) and variance (σ(i))2 (where the σ(i)’s are fixed, known, constants). Show that finding the maximum likelihood estimate of θ reduces to solving a weighted linear regression problem. State clearly what the w(i)’s are in terms of the σ(i)’s.
Answer:
(b) Coding problem. We will now consider the following dataset (the formatting matches that of Datasets 1-4, except x(i) is 1-dimensional):
data/ds5_{train,valid,test}.csv
In src/p05blwr.py, implement locally weighted linear regression using the normal equations you derived in Part (a) and using
.
Train your model on the train split using τ = 0.5, then run your model on the valid split and report the mean squared error (MSE). Finally plot your model’s predictions on the validation set (plot the training set with blue ‘x’ markers and the validation set with a red ‘o’ markers). Does the model seem to be under- or overfitting? Answer:
(c) Coding problem. We will now tune the hyperparameter τ. In src/p05ctau.py, find the MSE value of your model on the validation set for each of the values of τ specified in the code. For each τ, plot your model’s predictions on the validation set in the format described in part (b). Report the value of τ which achieves the lowest MSE on the valid split, and finally report the MSE on the test split using this τ-value. Answer: