$25
All parts of this exercise must be done within a Notebook, with text answers (and other discussion) provided as markdown / LATEX cells. Please make sure that the version of the notebook by you has fully executed cells
Restrictions: You can only use numpy and pandas packages within your code. Also use matplotlib.pyplot for plotting purposes.
Notebook Preamble: You can import the required libraries as follows (but you are allowed to use any other names of your liking):
import numpy as np import pandas as pd import matplotlib.pyplot as plt
1. Logistic regression.
Let m = 1000 and n = 4.
(a) Generate the data matrix X ∈ Rm×n with entries drawn independently such that each is distributed as N(0,1).
(b) Let θ = [1,−1,2,−5]T. Generate labels y ∈ Rm, such that
yi = sign(θTx(i) + 0.5),
where sign(x) is equal to 1 if x ≥ 0, and −1 otherwise.
(c) Use gradient descent (as derived in class) to estimate the logistic regression coefficients for estimating P(y = 1|x). Use step size α = 0.1/m and report the coefficients after 1000 iterations. Compare your result with the parameters used for generating the data in part (b).
(d) Redo the previous part, this time with some noisy data, where now
yi = sign(θTx(i) + Zi + 0.5),
with Zi ∼ N(0,2). The noise Zi for different samples are drawn independently.
(e) Recall that in logistic regression our function h(x) is supposed to estimate P(Y = 1|x). Therefore, the generated output is a soft decision. Use the following rule to map the output of your logistic regression to +1 and −1 values: if h(x) ≥ 0.5, let ˆy = 1, otherwise ˆy = −1. Compute the error probability of your classifier on training dataset of part (c). Next compute the error probability corresponding to the dataset you generated for part (d). (Note that for each dataset you need to use a different set of learned coefficients.) Compare the two error probabilities and report your observations.
Remark: Given labels y(i) and predicted labels ˆy(i), the error probability on training data is computed as
.
In other words, the error probability shows the fraction of training samples that are wrongly labeled by our algorithm.
2. Logistic regression using Python. In this problem we are going to use the same synthetic datasets we used in the previous problem. Let X ∈ Rm×n denote the input matrix and y and y denote the noise-free and noisy outputs labels, respectively. You need to noisy
import LogisticRegression as from sklearn.linear_model import LogisticRegression
(a) Define two logistic regression models corresponding to the two datasets. For X and labels y, the model can be defined as
model = LogisticRegression().fit(X.T,Y)
(b) Apply model.predict(X.T,Y) and model.score(X.T,Y) on both models and compare the scores for noisy and noise-free labels.
3. Non-centered Data and Principal Component Analysis (PCA)
(a) Create a matrix A ∈ R3×2 whose individual entries are drawn from a Gaussian distribution with mean 0 and variance 1 in an independent and identically distributed (iid) fashion. Also, create a vector c ∈ R3 whose individual entries are iid and drawn from a Gaussian distribution with mean 0 and variance 3. Once generated, both A and c should not be changed for the rest of the problems in this section.
(b) Generate a synthetic dataset with 250 data samples as follows. Each data sample x(i) ∈ R3 in the dataset is generated as x(i) = Abi +c, where bi ∈ R2 is a random vector whose entries are iid Gaussian with mean 0 and variance 1. Note that we will have a different bi for each data sample x(i) (i.e., unlike A and c, it is not fixed for each data sample). Store the data samples into a data matrix X ∈ R3×250, such that each data sample is a column in this data matrix.
(c) What is the rank of data matrix X? Verify this by printing the rank of X.
(d) Verify the importance of centering the data as an essential preprocessing step forPCA by carrying out the following steps:
i. Compute the top two principal component directions U = [u1,u2] of the dataset without centering the data. Compute the corresponding reconstruction error as
.
ii. Center the data and then repeat the previous part and compare the results.