Starting from:

$29.99

STA4364 Homework 4 Solution

Submission Format: Please submit your homework as 1) a HTML or pdf document, and 2) also submit the source file in either R Markdown or Jupyter notebook format (at most one of each type of file).
Problems can be done in Python or R. ISL = Introduction to Statistical Learning textbook.
Problem 1 In this problem, you will predict tumor type from gene expression data. Since there are many more gene features than observations of patients, we will use ridge and LASSO regularization for logistic regression to reduce overfitting and help select the most relevant features out of a large group of features. This dataset has a multi-class outcome variable. The possible tumor types are BRCA, COAD, KIRC, LUAD, or PRAD. You will analyze this dataset by building a multinomial regression model with `1 and `2 regularization. The recommended approach is the glmnet package in R, which is covered in the code in class. You can check the “Multinomial Regression” section found at this link for specific information about multinomial regression in glmnet.
Note: This is quite a large dataset so models will take a minute or two to fit.
(a) Load the labels and data with read.csv. Remove any columns with missing entries. Remove any columns with variance less than 0.001. Standardize each gene predictor column to have mean 0 and standard deviation 1 (this is important when doing regularized regression). Split the dataset randomly into a training and validation set.
(b) Use ridge logistic regression with 10-fold cross validation to model the response given the gene expression predictors. What is your optimal value of the regularization parameter λ? Apply your model to give predictions using the optimal value of λ. Make a confusion matrix showing the accuracy of your model on the training and test set.
(c) Use LASSO logistic regression with 10-fold cross validation to model the response given the geneexpression predictors. What is your optimal value of the regularization parameter λ? Apply your model to give predictions using the optimal value of λ. Make a confusion matrix showing the accuracy of your model on the training and test set.
(d) Give a list of the top 20 most relevant genes that are selected by your LASSO model at the optimalvalue of λ. The coefficients for a multinomial regression model will be a p×C matrix where C is the number of classes and p is the number of feature columns. What relation do your selected genes have to tumor expression? You can determine this by looking at which of the C coefficients associated with a certain gene are non-zero. Positive values in a certain index correspond to a high probability of the tumor associated with that index, while negative values correspond to a lower probability.
Note: If columns are highly correlated, LASSO will often arbitrarily select a single column, so a full report of relevent genes would involve predictors selected by LASSO and genes that are highly correlated. Other techniques like group LASSO can select subsets of related genes, these will not be covered in this class. You could also try to combine the `1 and `2 penalties to get representation of meaningful predictors that are also correlated (see the reference material above).
1

More products