$24.99
E6690: Statistical Learning for Bio & Info Systems
P1. Suppose we collect data for a group of students in a statistics class with variables X1 (hours studied), X2 (undergrad GPA), and Y (receive an A). We fit a logistic regression and produce estimated coefficient, βˆ0 = −6, βˆ1 = 0.05, βˆ2 = 1.
(a) (5pt) Estimate the probability that a student who studies for 40 hours and has an undergrad GPA of 3.5 gets an A in the class.
(b) (5pt) How many hours would the student in part (a) need to study to have a 50% chance of getting an A in the class?
P2. (10pt) Suppose that we wish to predict whether a given stock will issue a dividend this year (”Yes” or ”No”) based on X, last year’s percent profit. We examine a large number of companies and discover that the mean value of X for companies that issued a dividend was X¯ = 10, while the mean for those that didn’t was X¯ = 0. In addition, the variance of X for these two sets of companies was σˆ2 = 36. Finally, 80% of companies issued dividends. Assuming that X follows a normal distribution, predict the probability that a company will issue a dividend this year given that its percentage profit was X = 4 last year.
Hint: Use Bayes’ theorem.
P3. (20pt) Consider X = [0.0 0.2 0.4 0.6 0.8 1.0]> as the independent variable and y = [false false false true false true]> as the response. Write down the log-likelihood function, l(β0,β1), for the logistic regression problem and first order optimality conditions. Note that this problem of finding optimal (β0,β1) can only be solved numerically. Use NewtonRaphson algorithm from Section 4.4.1, pp. 120-121, [ESL] book and perform 10 iterations. Hint: Use library(matlib) for calculating matrix inverses.
P4. (20pt) Let X ∼ N(0,Σ). Find the distribution of Y = AX. For the bivariate case with
,
find a 2 × 2 A such that cov(Y ) is an identity matrix.
Hint: Consider eigen-decomposition.
and ,
nk is the sample size of population k and the pair (xi,yi) corresponds to the ith observation; yi ∈ {1,...,K}. For α1,...,αK, such that , define an unbiased pooled variance estimator:
K
σˆ2 = Xαkσˆk2.
k=1
1
If , show that αk = (nk−1)/(n−K) minimizes the variance of σˆ2 under the Gaussian assumption.
P6. (10pt) Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of X, produce 10 estimates of P[Class is Red|X]:
0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, and 0.75.
There are two common ways to combine these results together into a single class prediction. One is the majority vote approach. The second approach is to classify based on the average probability. In this problem, what is the final classification under each of these two approaches?
P7. This problem involves OJ data set which is part of the ISLR package.
(a) (2pt) First run set.seed(1000), and then create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
(b) (2pt) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?
(c) (3pt) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
(d) (2pt) Create a plot of the tree, and interpret the results.
(e) (3pt) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
(f) (2pt) Apply the cv.tree() function to the training set in order to determine the optimal tree size.
(g) (3pt) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis. (h) (1pt) Which tree size corresponds to the lowest cross-validated classification error rate?
(i) (3pt) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
(j) (2pt) Compare the training error rate between the pruned and unpruned tree. Which is higher?
(k) (2pt) Compare the test error rates between the pruned and unpruned trees. Which is higher?
2