$25
This homework contains two questions. The first question is about parameter estimation. In the second question, you will implement logistic regression using stochastic gradient descent (SGD). The maximum number of points is 100 points. For this homework, you should review some material on parameter estimation, logistic regression, stochastic gradient decent, feature normalization, and performance measurement.
1 Question 1 – Parameter estimation (45 points)
This question uses a discrete probability distribution known as the Poisson distribution. A discrete random variable X follows a Poisson distribution with parameter λ if
The Poisson distribution is a useful discrete distribution which can be used to model the number of occurrences of something per unit time. For example, it can be used to model the number of customers arriving at a bank per hour, or the number of calls handled by a telephone operator per minute.
1.1 Question 1.1 – MLE (15 points)
Assume the wait time for calling an Uber car is Poisson distributed (i.i.d) with parameter λ. You used Uber seven times and record the wait times in each trip.
Trip 1 2 3 4 5 6 7
Wait time 4 12 3 5 6 9 17
Let X = (X1,...,Xn) be a random vector where Xi is the number of delay minutes for your ith trip:
1. (5 points) Give the log-likelihood function of X given λ.
2. (5 points) Compute the MLE for λ in the general case.
3. (5 point) Compute the MLE for λ using the observed X.
1.2 Question 1.2 – MAP (15 points)
Now let’s be Bayesian and put a prior over the parameter λ. You talk to your party-goer friends, who tell you about the expected delays that they experience. You plot the distribution of the expected delays and your extensive experience in statistics tells you that the plot resembles a Gamma distribution pdf. So you believe a good prior distribution for λ may be a Gamma distribution. The Gamma distribution has pdf:
. (1)
Γ(α) is the Gamma function, which is the normalization factor that is introduced to have a proper probability density function (i.e., sum to 1). Do not worry if you don’t know the explicit format of Γ(α); it is not important for this question. Also, if λ ∼ Γ(α,β), then it has mean α/β and the mode is (α − 1)/β for α > 1. Assume the prior distribution of λ is Γ(λ|α,β).
1. (8 points) Compute the posterior distribution over λ. Hint:
looks like a Gamma distribution! Is the rest of the expression constant with respect to λ? Working out a messy integral can lead to the answer but it is unnecessary.
2. (7 points) Derive an analytic expression for the maximum a posterior (MAP) estimate of λ.
1.3 Question 1.3 – Estimator Bias (15 points)
In class, we saw that the maximum likelihood estimator for the variance of a Gaussian distribution:
is biased, and that an unbiased estimator of variance is:
For the Gaussian distribution, these estimators give similar results for large enough N, and it is unclear whether one estimator is preferable to the other. In this problem, we will explore an example in which the maximum likelihood estimate is dramatically superior to any unbiased estimator.
Suppose we are not quite interested in estimating λ. We are more interested in estimating a quantity that is a nonlinear function of λ, namely η = e−2λ. Suppose we want to estimate η from a single observation X ∼ Poisson(λ).
1. (5pts) Let ηˆ = e−2X. Show that ηˆ is the maximum likelihood estimate of η.
2. (5pts) Show that the bias of ηˆ is e−(1−1/e2)λ − e−2λ. The following Taylor expansion may be useful:
3. (5pts) It turns out that (−1)X is the only unbiased estimate of η. Prove that it is indeed unbiased and briefly explain why this is a bad estimator to use.
2 Question 2 – Logistic Regression (55 points)
In this question, you will implement Logistic Regression using Stochastic Gradient Descent (SGD). Suppose the training data is {(X1,Y 1),··· ,(Xn,Y n)}, where Xi is a column vector of d dimensions and Y i is the target label. For a column vector X, let X¯ denotes [X;1], the vector obtained by appending 1 to the end of X. θ is the set of parameters θ1,θ2,...,θk−1. Logistic regression for k classes assumes the following probability function:
for i = 1,··· ,k − 1, (2)
. (3)
Logistic regression minimizes the negative average conditional log likelihood:
. (4)
Here, log(·) is the natural logarithm. To minimize this loss function, we can use gradient descent:
, for c = 1,...,k − 1 (5)
where (6)
This gradient is computed by enumerating over all training data. This gradient can be approximated using a batch of training data. Suppose B is a subset of {1,2,··· ,n}
(7)
This leads to the following stochastic gradient descent algorithm:
Algorithm 1 Stochastic gradient descent for Logistic Regression
1: Inputs: (for data), m (for batch size), ηstart (initial learning rate), ηend (final learning rate), max epoch (maximum number of epoches), (learning rate reduction criterion)
2: (i1,...,in) = permute(1,...,n)
3: Divide (i1,...,in) into batches of size m or m + 1 . n might not be divisible by m,
4: . so some batches might have size m + 1
5: Randomly initialize θ
6: η ← ηstart
7: fordo
8:
9: for each batch B do
10: Update θ using Eqs. (5) and (7)
11: end for
12: ifthen // not much progress, try smaller learning rate
13:
14: end if
15: if η < ηend then Break
16: end if
17: end for
18: Outputs: θ.
2.1 Question 2.1 – Derivation (10 points)
Prove that:
. (8)
where δ(c = Y i) is the indicator function which takes a value of 1 if the class c equals the ground truth label Y i, and 0 otherwise. Use Equation (8) to derive the gradient of the loss with respect to the parameters θ1,θ2,...,θk−1.
2.2 Question 2.2 – Implementation (30 points)
Your task is to implement Logistic Regression with k classes using SGD.
2.2.1 (10 points)
Write a Python function with the signature:
[W,b] = logreg fit(X,y,m,eta start,eta end,epsilon,max epoch = 1000)
where
Inputs:
• X: a two dimensional Numpy array of size n × d, where n is the number of data points, and d the dimension of the feature vectors.
• y: a Numpy vector of length n. y[i] is a categorical label corresponding to the data point X[i,:], y[i] ∈ {0,1,...,k − 1}. You can assume that the number of classes k is the maximum entry of y plus 1.
• m,eta start,eta end,epsilon,max epoch are scalar parameters for the batch size, the starting learning rate, the ending learning rate, the learning rate reduction criterion, and the maximum number of epochs as described in the algorithm above.
Outputs:
• W: a (k − 1)×d Numpy matrix, W[i,:] is the learned weight vector θi+1.
• b: a Numpy vector of length k − 1 for the biases.
2.2.2 (10 points)
Write a Python function with the following signature.
yˆ = logreg predict class(W,b,X)
This function takes the learned parameters W and b of a logistic regression classifier and test data X and output the predicted categorical label yˆ for the test data X. See Question 2.2.1 for the expected formats for W,b,X. The output yˆ should be a Numpy vector of length n, where n is the number of rows of data matrix X. yˆ[i] is the predicted label for data point X[i,:].
2.2.3 (10 points)
Write a Python function with the following signature.
P = logreg predict prob(W,b,X)
This function takes the learned parameters W and b of a logistic regression classifier and test data X and output the predicted probabilities. P is a Numpy matrix of size n × k, where P[i,j] is the probability that X[i,:] belongs to class j.
2.2.4 Hints
Hint 1: You might want to implement Question 2.2.3 first. Questions 2.2.1 and 2.2.2 should make use of
2.2.3. You have to implement these functions yourself. You cannot use existing implementation in Python, e.g., sklearn.linear model.LogisticRegression
Hint 2: To compute the probability values, you should use Eqs (2) and (3). To compute a probability value of the form , you can also use the equivalent formula: , where
a = max{a0,...,ak−1}. In theory, the two formulations are the same, but the latter will help you avoid numerical problems in the calculation. Note also that, exp(0) = 1.
Hint 3: Useful numpy functions that you can use for your implementation: np.dot, np.exp, np.log, np.sum, np.mean.
2.3 Question 2.3 – Running and reporting results (15 points)
In this question, you will train a Logistic Regression model using the functions you implemented in Question 2, in order to predict whether a patient will recover from COVID-19. The COVID-19 dataset was compiled using the data provided by the Stony Brook University Hospital. This dataset contains real data from patients that were diagnosed with COVID-19. For each patient, it includes a number of features, including age, gender, blood pressure and other lab results collected at the admission time. All these are the features can be used in order to forecast if a patient is likely to recover or not. This is a two-class classification problem, and we will refer to the Death Outcome as the positive class.
Step 1
The first step is to load your training and test data that are provided in the files train.csv and test.csv correspondingly. (Tip: To load the data from a csv file, you can use numpy.genfromtxt or csv.reader).
Inspect your data carefully: The first row of each csv file is the csv header. Each following row corresponds to a patient. The columns correspond to the different measurements per patient. The last column indicates if a patient died or not.
Load all the columns of train.csv except the last one to your feature matrix Xtrain and the last column to your labels vector ytrain. Both Xtrain and ytrain should be numpy arrays. Convert Xtrain to float using Xtrain = Xtrain.astype(np.float64) and ytrain to int using ytrain = ytrain.astype(int). Do the same for your test data.
Step 2
In general, features can have different types of values and ranges, which can significantly affect the performance of the classifier. As a result, the first step after loading the data is to pre-process them and scale the features in a similar range, using normalization techniques.
To normalize the features, you will use Z-score normalization (or Standardization). This method scales the features by removing the mean µ and dividing by their standard deviation σ:
You can use sklearn.preprocessing.StandardScaler for this step. In that case, you should use fit transform() for Xtrain to fit and scale your training data, and then transform() for Xtest to scale the test data based on the training mean and standard deviation. This is important, as the test data are seen as unknown during training, so you should not normalize them using knowledge from them.
Step 3
Train a Logistic Regression classifier using your normalized training data Xtrain and ytrain, and the function logreg fit() that you implemented in Question 2.2.1. Use m = 256, eta start = 0.01, eta end = 0.00001, epsilon = 0.0001 and max epoch = 1000.
2.3.1 Question
(a) Plot the loss function, as given in Eq. (3), against the number of epochs during training. Comment on what you see.
(b) Use the learned parameters (weights W and biases b) to predict the labels yˆtrain for the training set Xtrain and yˆtest for the test set Xtest. For each set, calculate and report the following performance metrics:
• The accuracy, i.e. the number of correctly predicted labels over the total number of samples. (You can use sklearn.metrics.accuracy score).
• The confusion matrix (You can use sklearn.metrics.confusion matrix).
• The accuracy computed based on the diagonal of the confusion matrix.
(c) Use the learned parameters (weights W and biases b) to estimate the probabilities of the positive class (y = 1) for the test set. Calculate and report the following performance metrics for the test set:
• The average precision (you can use sklearn.metrics.average precision score).
• Plot the precision recall curve (you can use sklearn.metrics.precision recall curve).
2.3.2 Question
(a) Remove the feature normalization and train the classifier again. Plot the loss function against the number of epochs during training as in Question 2.3.1(a). Do you see any difference with scaling and not scaling the data (e.g. in terms of the speed of convergence, the number of epochs, the final performance)?
(b) With feature normalization, try different values for the hyperparameters eta start, eta end, max epoch and m. Comment on what you see when increasing or decreasing these parameters. For example, what impact has a higher learning rate on the training, the number of epochs or the final performance?
(c) Choose your own values for eta start, eta end, max epoch and m and train the classifier again. Plot the loss function against the number of epochs during training as in Question 2.3.1(a). Why did you choose these values?