Problem 01
This problem is focused on setting up the log-posterior function for a linear model. You will program the function using matrix math, such that you can easily scale your code from a linear relationship with a single input up to complex linear basis function models. You will assume independent Gaussian priors on all β-parameters with a shared prior mean µβ and shared prior standard deviation, τβ. An Exponential prior with rate parameter λ will be assumed for the likelihood noise, σ. The complete probability model for the response, yn, is shown below using the linear basis notation. The n-th row of the basis design matrix, Φ is denoted as φn,:. It is assumed that the basis is of order J.
1a)
The code chunk below reads in a data set consisting of two variables, an input x and a response y. As shown by the
PROBLEM Create a scatter plot between the response and the input using ggplot(). In addition to using geom_point(), include a geom_smooth() layer to your graph. Set the method argument to 'lm' in the call to geom_smooth(). Based on the figure what type of relationship do you think exists between the response and the input?
train_01 %% ggplot(mapping = aes(x = x, y = y )) +
geom_point() + geom_smooth(method = 'lm')
1b)
In your response to Problem 1a), you should see a “best fit line” and it’s associated confidence interval displayed with the scatter plot. Behind the scenes, ggplot2() fits a linear model between the response and the input with Maximum Likelihood Estimation, and plots the result on the figure. You will now work through a full Bayesian linear model. Before coding the log-posterior function, you will start out by creating the list of required information, info_01, which defines the data and hyperparameters that you will ultimately pass into the log-posterior function.
You will need to create a design matrix assuming a linear relationship between the input and the response. The mean trend function is written for you below:
µn = β0 + β1xn
PROBLEM Create the design matrix assuming a linear relationship between the input and the response, and assign the object to the Xmat_01 variable. Complete the info_01 list by assigning the response to yobs and the design matrix to design_matrix. Specify the shared prior mean, mu_beta, to be 0, the shared prior standard deviation, tau_beta, as 5, and the rate parameter on the noise, sigma_rate, to be 1.
1c)
You will now define the log-posterior function lm_logpost(). You will continue to use the log-transformation on σ, and so you will actually define the log-posterior in terms of the mean trend β-parameters and the unbounded noise parameter, ϕ = log[σ].
The comments in the code chunk below tell you what you need to fill in. The unknown parameters to learn are contained within the first input argument, unknowns. You will assume that the unknown β-parameters are listed before the unknown ϕ parameter in the unknowns vector. You will assume that all variables contained in the my_info list (the second argument to lm_logpost()) are the same fields in the info_01 list you defined in Problem 1b).
PROBLEM Define the log-posterior function by completing the code chunk below. You must calculate the mean trend, mu, using matrix math between the design matrix and the unknown β column vector. After you complete the function, test that it out by evaluating the logposterior at two different sets of parameter values. Try out values of -1 for all parameters, and then try out values of 1 for all parameters.
HINT: If you have successfully completed the log-posterior function, you should get a value of -296.5826 for the -1 guess values, and a value of -123.9015 for the +1 guess values.
HINT: Don’t forget about useful data type conversion functions such as as.matrix() and as.vector().
1d)
The my_laplace() function is started for you in the code chunk below. You will complete the missing parts and then you will fit the Bayesian linear model from a startig guess of zero for all parameters.
PROBLEM Complete the my_laplace() function below and then fit the Bayesian linear model using a starting guess of zero for all parameters. Print the posterior mode and posterior standard deviations to the screen. Should you be concerned about the initial guess impacted the posterior results?
1e)
The generate_lm_post_samples() function is started for you in the code chunk below. The first argument, mvn_result, is the Laplace Approximation result object returned from the my_laplace() function. The second argument, length_beta, specifies the number of mean trend β-parameters to the model. The last argument, num_samples, specifies the total number of posterior samples to generate. The function is nearly complete, except you must back-transform from the unbounded ϕ parameter to the noise σ. After completing the function, generate 2500 posterior samples from your model stored in the laplace_01 object. Be careful to specify the number of β-parameters correctly!
After generating the posterior samples you will study the posterior distribution on the slope, β1.
PROBLEM Complete the generate_lm_post_samples() function by back-transforming from varphi to sigma. After completing the function, generate 2500 posterior samples from your laplace_01 model. Create a histogram with 55 bins using ggplot2() for the slope, beta_01, and then calculate the probability that the slope is positive.
Problem 02
Now that you can fit a Bayesian linear model, it’s time to work with making posterior predictions from the model. You will use those predictions to calculate and summarize the errors of the model relative to observations. Since RMSE and R-squared have been discused throughout lecture, you will work with the Mean Absolute Error (MAE) metric.
2a)
The code chunk below starts the post_lm_pred_samples() function. This function generates posterior mean trend predictions and posterior predictions of the response. The first argument, Xnew, is a test design matrix. The second argument, Bmat, is a matrix of posterior samples of the β-parameters, and the third arugment, sigma_vector, is a vector of posterior samples of the likelihood noise. The Xnew matrix has rows equal to the number of predictions points, M, and the Bmat matrix has rows equal to the number of posterior samples S.
You must complete the function by performing the necessary matrix math to calculate the matrix of posterior mean trend predictions, Umat, and the matrix of posterior response predictions, Ymat. You must also complete missing arguments to the definition of the Rmat and Zmat matrices. The Rmat matrix replicates the posterior likelihood noise samples the correct number of times. The Zmat matrix is the matrix of randomly generated standard normal values. You must correctly specify the required number of rows to the Rmat and Zmat matrices.
The post_lm_pred_samples() returns the Umat and Ymat matrices contained within a list.
PROBLEM Perform the necessary matrix math to calculate the matrix of posterior predicted mean trends Umat and posterior predicted responses Ymat. Specify the number of required rows to create the Rmat and Zmat matrices.
2b)
The code chunk below is completed for you. The function make_post_lm_pred() is a wrapper which calls the post_lm_pred_samples() function. It contains two arguments. The first, Xnew, is a test design matrix.
The second, post, is a data.frame of posterior samples. The function extracts the β-parameter posterior samples and converts the object to a matrix. It also extracts the posterior samples on σ and converts to a vector.
make_post_lm_pred <- function(Xnew, post)
{
Bmat <- post %% select(starts_with("beta_")) %% as.matrix() sigma_vector <- post %% pull(sigma)
post_lm_pred_samples(Xnew, Bmat, sigma_vector) }
You now have enough pieces in place to generate posterior predictions from your model.
PROBLEM Make posterior predictions on the training set. What are the dimensions of the returned Umat and Ymat matrices? Do the columns correspond to the number of prediction points?
2c)
You will now use the model predictions to calculate the error between the model and the training set observations. Since you generated 2500 posterior samples, you have 2500 different sets of predictions! So, to get started you will focus on the first 3 posterior samples.
PROBLEM Calculate the error between the predicted mean trend and the training set observations for each of the first 3 posterior predicted samples. Assign the errors to separate vectors, as indicated in the code chunk below.
Why are you considering the mean trend when calculating the error with the response, and not the predicted response values?
2d)
You will now calculate the Mean Absolute Error (MAE) associated with each of the three error samples calculated in Problem 2c). However, before calculating the MAE, first consider the dimensionality of the error_01_post_01. What is the length of the error_01_post_01 vector? When you take the absolute value and then average across all elements in that vector, what are you averaging over?
PROBLEM What is the length of the error_01_post_01 vector? Calculate the MAE associated with each of the 3 error vectors you calculated in Problem 2c. What are you averaging over when you calculate the mean absolute error? Are the three MAE values the same? If not, why would they be different?
HINT: The absolute value can be calculated with the abs() function.
length(error_01_post_01)
2e)
In Problem 2d) you calculated the MAE associated with the first 3 posterior samples. However, you can calculate the MAE associated with every posterior sample. Although it might seem like you need to use a for-loop to do so, R will simplify the operation for you. If you perform an addition or subtraction between a matrix and a vector, R will find the dimension that that matches between the two and then repeat the action over the other dimension. Consider the code below, which has a vector, a_numeric, subtracted from a matrix a_matrix: a_matrix - a_numeric
Assuming that a_matrix has 10 rows and 25 columns and a_numeric is length 10, R will subtract a_numeric from each column in a_matrix. The result will be another matrix with the same dimensionality as a_matrix.
To confirm this is the case, consider the example below where a vector of length 2 is subtracted from a matrix of 2 rows and 4 columns. The resulting dimensionality is 2 rows by 4 columns.
PROBLEM Calculate the absolute value of the error between the mean trend matrix and the training set response. Print the dimensionality of the absE01mat matrix to screen. absE01mat <- abs(post_pred_samples_01$Umat-train_01$y)
2f)
You must now summarize the absolute value errors by averaging them appropriately. Should you average across the rows or down the columns? In R the colMeans() will calculate the average value associated with each column in a matrix and returns a vector. Likewise, the rowMeans() function calculates the average value along each row and returns a vector. Which function should you use to calculate the MAE associated with each posterior sample?
PROBLEM Calculate the MAE associated with each posterior sample and assign the result to the MAE_01 object. Print the data type (the class) of the MAE_01 to the screen and display its length. Check your result is consistent with the MAEs you previously calculated in Problem 2d).
MAE_01 <- colMeans(absE01mat) class(MAE_01)
2g)
You have calculated the MAE associated with each posterior sample, and thus represented the uncertainty in the MAE! Why is the MAE uncertain?
PROBLEM Use the quantile() function to print out summary statistics associated with the MAE. You can use the default arguments, and thus pass in MAE_01 into quantile() without setting any other argument. Why is the MAE uncertain? Or put another way, what causes the MAE to be uncertain?
Problem 03
You will now make use of the model fitting and prediction functions you created in the previous problems to study the behavior of a more complicated non-linear modeling task. The code chunk below reads in two data sets. Both consist of two continuous variables, an input x and a response y. The first, train_02, will serve as the training set, and the second, test_02, will serve as a hold-out test set. You will only fit models based on the training set.
.
PROBLEM Create a scatter plot between the response and the input with ggplot2. Include both the training and test sets together in one graph. Use the marker color to distinguish between the two.
3b)
You will fit 25 different models to the training set. You will consider a first-order spline up to a 25th order spline. Your goal will be to find which spline is the “best” in terms of generalizing from the training set to the test set. To do so, you will calculate the MAE on the training set and on the test set for each model. It will be rather tedious to set up all of the necessary information by hand, manually train each model, generate posterior samples, and make predictions from each model. Therefore, you will work through completing functions that will enable you to programmatically loop over each candidate model.
You will start out by completing a function to create the training set and test set for a desired spline basis. The function make_spline_basis_mats() is started for you in the first code chunk below. The first argument is the desired spline basis order, J. The second argument is the training set, train_data, and the third argument is the hold-out test set, test_data.
The second through fifth code chunks below are provided to check that you completed the function correctly. The second code chunk uses purrr::map_dfr() to create the training and test matrices for all 25 models. A glimpse of the resulting object, spline_matrices, is displayed to the screen for you in the third code chunk.
It is printed to the screen in the fourth code chunk. You should see a tibble consisting of two variables, design_matrix and test_matrix. Both variables are lists containing matrices. The matrices contained in the spline_matrices$design_matrix variable are the different training design matrices, while the matrices contained in spline_matrices$test_matrix are the associated hold-out test basis matrices.
The fifth code chunk below prints the dimensionality of the 1st and 2nd order spline basis matrices to the screen. It shows that to access a specific matrix, you need to use the [[]] notation.
PROBLEM Complete the code chunk below. You must specify the splines::ns() function call correctly such that the degrees-of-freedom, df argument equals the desired spline basis order and that the basis is applied to the x variable within the user supplied train_data argument. The knots are extracted for you and saved to the knots_use_basis object. Create the training design matrix by calling the model.matrix() function with the splines::ns() function to create the basis for the x variable with knots equal to knots_use_basis. Make sure you assign the data sets correctly to the data argument of model.matrix().
How many rows are in the training matrices and how many rows are in the test matrices?
3c)
Each element in the spline_matrices$design_matrix object is a separate design matrix. You will use this structure to programmatically train each model. The first code chunk creates a list of information which stores the training set responses and defines the prior hyperparameters. The second code chunk below defines the manage_spline_fit() function. The first argument is a design matrix Xtrain, the second argument is the log-posterior function, logpost_func, and the third argument is my_settings. manage_spline_fit() sets the initial starting values to the β parameters by generating random values from a standard normal. The initial value for the unbounded ϕ parameter is set by log-transforming a random draw from the prior on σ. After creating the initial guess values, the my_laplace() function is called to fit the model.
You will complete both code chunks in order to programmatically train all 25 spline models. After completing the first two code chunks, the third code chunk performs the training for you. The fourth code chunk below shows how to access training results associated with the second-order spline by using the [[]] operator. The fifth code chunk checks that each model converged.
PROBLEM Complete the first two code chunks below. In the first code chunk, assign the training responses to the yobs variable within the info_02_train list. Specify the prior mean and prior standard deviation on the β-parameters to be 0 and 20, respectively. Specify the rate parameter on the unknown σ to be 1.
Complete the second code chunk by generating a random starting guess for all β-parameters from a standard normal. Create the random initial guess for ϕ by generating a random number from the Exponential prior distribution on σ and log-transforming the variable. Complete the call the my_laplace() function by passing in the initial values as a vector of correct size.
HINT: How can you determine the number of unknown β-parameters if you know the training design matrix?
3d)
With all 25 spline models fit, it is time to assess which model is the best. Several different approaches have been discussed in lecture for how to identify the “best” model. You will start out by calculating the MAE on the training set and the test set. You went through the steps to generate posterior samples, make posterior predictions and to calculate the posterior MAE distribution in Problem 2. You will now define a function which performs all of those steps together.
The function calc_mae_from_laplace() is started for you in the first code chunk below. The first argument, mvn_result, is the result of the my_laplace() function for a particular model. The second and third arguments, Xtrain and Xtest, are the training and test basis matrices associated with the model, respectively. The fourth and fifth arguments, y_train and y_test, are the observations on the training set and test set, respectively. The last argument, num_samples, is the number of posterior samples to generate.
You will complete the necessary steps to generate posterior samples from the model, predict the training set, predict the test. Then you will calculate the training set MAE and test set MAE, associated with each posterior sample. The last portion of the calc_mae_from_laplace() function is mostly completed for you.
After you complete the calc_mae_from_laplace(), the second code chunk applies the function to all 25 spline models. It is nearly complete. You must specify the arguments which define the training set observations, y_train, the test set observations, y_train, and the number of posterior samples, num_samples.
PROBLEM Complete all steps to calculate the MAE on the training set and test sets in the first code chunk below. Complete the lines of code in order to: generate posterior samples from the supplied mvn_result object, make posterior predictions on the training set, make posterior predictions on the test, and then calculate the MAE associated with each posterior sample on the training and test sets. In the book keeping portion of the function, you must specify the order of the spline model.
You must specify the training set and test set observed responses correctly in the second code chunk. You must specify the number of posterior samples to be 2500.
HINT: Remember that the result of the make_post_lm_pred() function is a list!
3e)
If you completed the calc_mae_from_laplace() function correctly, you should have an object with 125000 rows and just 4 columns. The object was structured in a “tall” or “long-format” in order to allow summarizing all of the posterior MAE samples with ggplot2. You will summarize the MAE posterior distributions with boxplots and by focusing on the median MAE. To focus on the median MAE, you will use the stat_summary() function. This is a flexible function capable of creating many different geometric objects. It consists of arguments such as geom to specify the type of geometric object to display and fun.y to specify what function to apply to the y aesthetic.
PROBLEM Complete the two code chunks below. In the first code chunk, pipe the all_spline_mae_results object into ggplot(). Set the x aesthetic to be as.factor(J) and the y aesthetic to be mae. In the geom_boxplot() call, map the fill aesthetic to the dataset variable. Use the scale scale_fill_brewer() with the palette argument set equal to "Set1".
In the second code chunk, pipe the all_spline_mae_results object into a filter() call. Keep only the models with the spline order greater than 3. Pipe the result into a ggplot() call where you set the x aesthetic to J and the y aesthetic to mae. Use the stat_summary() call to calculate the median MAE for each spline order. You must set the geom argument within stat_summary() to be 'line' and the fun.y argument to be 'median'. Rather than coloring by dataset, use facet_wrap() to specify separate facets (subplots) for each dataset.
Based on these visualizations which models appear to overfit the training data?
3f)
By comparing the posterior MAE values on the training and test splits you are trying to assess how well the models generalize to new data. As discussed in lecture, other metrics exist for trying to assess how well a model generalizes, based just on the training set performance. The Evidence or marginal likelihood is attempting to evaluate generalization by integrating the likelihood over all a-priori allowed parameter combinations. If the Evidence can be calculated, it can be used to weight all models relative to each other. Thereby allowing you to assess which model appears to be “most probable”.
You will now calculate the posterior model weights associated with each model. The first code chunk below is completed for you, by extracting the log-evidence associated with model into the numeric vector spline_evidence. You will use the log-evidence to calculate the posterior model weights and visualize the results with a bar graph.
PROBLEM Calculate the posterior model weights associated with each spline model and assign the weights to the spline_weights variable. The spline_weights vector is assigned to the w variable in a tibble and the result is piped into tibble::rowid_to_column() where the row identification label is named J to correspond to the spline basis order. Pipe the result into a ggplot() call where you set the x aesthetic to as.factor(J) and the y aesthetic to w. Include a geom_bar() geometric object where the stat argument is set to "identity".
Based on your visualization, which model is considered the best?
0.00
0.25
0.50
0.75
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
as.factor(J)
w
3g)
You have compared the models several different ways, are your conclusions the same?
PROBLEM How well do the assessments from the data split comparison compare to the Evidence-based assessment? If the conclusions are different, why would they be different?
Problem 04
In lecture we discussed how basis functions allow extending the linear model to handle non-linear relationships. It was also discussed how to generalize the linear modeling approach to binary outcomes with logistic regression. In this problem you will define the log-posterior function for logistic regression. By doing so, you will be able to directly contrast what you did to define the log-posterior function for the linear model in previous problems within this assignment.
4a)
The complete probability model for logistic regression consists of the likelihood of the response yn given the event probability µn, the inverse link function between the probability of the event, µn, and the linear predictor, ηn, and the prior on all linear predictor model coefficients β.
As in lecture, you will assume that the β-parameters are a-priori independent Gaussians with a shared prior mean µβ and a shared prior standard deviation τβ.
PROBLEM Write out complete probability model for logistic regression. You must write out the n-th observation’s linear predictor using the inner product of the n-th row of a design matrix xn,: and the unknown β-parameter column vector. You can assume that the number of unknown coefficients is equal to D + 1.
You are allowed to separate each equation into its own equation block.
HINT: The “given” sign, the vertical line, |, is created by typing \mid in a latex math expression. The product symbol (the giant PI) is created with prod_{}ˆ{}.
4b)
The code chunk below loads in a data set consisting of two variables. A continuous input x and a binary outcome y. The binary outcome is encoded as 0 if the event does not occur and 1 if the event does occur. The count() function is used to count the number of observations associated with each binary class in the second code chunk. As shown below, the classes are roughly balanced.
train_03 <- readr::read_csv("https://raw.githubusercontent.com/jyurko/CS_1675_Spring_2020/master/hw_data
## Parsed with column specification:
## cols(
## x = col_double(),
## y = col_double()
## )
train_03 %% glimpse()
## Observations: 100
## Variables: 2
## $ x <dbl -0.99886904, -1.35011298, 0.95776645, -0.60482885, 1.90268647, -0.1... ## $ y <dbl 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1...
train_03 %% count(y)
## # A tibble: 2 x 2
## y n ## <dbl <int
## 1 0 54 ## 2 1 46
You will fit two logistic regression models to the dataset. The first will be a linear relationship for the linear predictor and the second will be a cubic relationship. The linear predictor expressions are written for you in the equation blocks below.
ηn = β0 + β1xn
ηn = β0 + β1xn + β2x2n + β3x3n
Before creating the log-posterior function for logistic regression, you will create lists of necessary information in the same style that you did in the previous regression problems. You must create the design matrices associated with the linear and cubic relationships, Xmat_03_line and Xmat_03_cube. You must then complete the lists info_03_line and info_03_cube by setting the observed responses to the yobs variable, the design matrices to the design_matrix variable, and also set the β prior hyperparameters.
PROBLEM Create the design matrices for the linear and cubic relationships. Specify the lists of required information for both models. Specify the prior mean to be 0 and the prior standard deviation to be 5.
4c)
You will now define the log-posterior function for logistic regression, glm_logpost(). The first argument to glm_logpost() is the vector of unknowns and the second argument is the list of required information. You will assume that the variables within the my_info list are those contained in the info_03_line list you defined previously.
PROBLEM Complete the code chunk to define the glm_logpost() function. The comments describe what you need to fill in. Do you need to separate out the β-parameters from the vector of unknowns?
After you complete glm_logpost(), test it by setting the unknowns vector to be a vector of -1’s for the linear relationship case. If you have succesfully programmed the function you should get a result equal to -118.3933.
glm_logpost <- function(unknowns, my_info)
{
# extract the design matrix and assign to X
X <- my_info$design_matrix # calculate the linear predictor
eta <- as.matrix(X) %*% as.matrix(unknowns)
# calculate the event probability mu <- exp(eta)/(1+exp(eta)) # evaluate the log-likelihood
log_lik <- sum(my_info$yobs*log(mu) + (1-my_info$yobs)*log(1-mu))
# evaluate the log-prior
log_prior <- sum(dnorm(unknowns, my_info$mu_beta, my_info$tau_beta, log=TRUE))
# sum together log_lik+log_prior
}
4d)
Execute the Laplace Approximation for the linear and cubic models. Use initial guess values of zero for both models. After fitting both models, calculate the middle 95% uncertainty intervals for the cubic model parameters.
PROBLEM Perform the Laplace Approximation for the linear and cubic models. Set the initial guesses to vectors of zero for both models. Should you be concerned that the initial guess will impact the results?
After fitting, calculate the middle 95% uncertainty interval on the β parameters for both models. Which parameters contain zero in the middle 95% uncertainty interval?
laplace_03_line <- my_laplace(rep(0,ncol(Xmat_03_line)),glm_logpost,info_03_line) laplace_03_cube <- my_laplace(rep(0,ncol(Xmat_03_cube)),glm_logpost, info_03_cube)
SOLUTION Calculate the posterior middle 95% uncertainty interval on the parameters associated with each
4e)
Let’s compare the performance of the models using the Evidence-based assessment. Since we only have two models, calculate the Bayes Factor between the linear model and the cubic model. Based on the Bayes Factor which model do you think is better?
PROBLEM Calculate the Bayes Factor with the linear model in the numerator and the cubic model in the denominator. Which model is supported more from the data? exp(laplace_03_line$log_evidence)/exp(laplace_03_cube$log_evidence)
4f)
You will now spend a little more time with the cubic relationship results. Calculate the posterior correlation matrix between the unknown β parameters. Are any of the parameter highly correlated or anti-correlated?
PROBLEM Calculate the posterior correlation matrix between the parameters in the cubic model.
cov2cor(solve(t(Xmat_03_cube) %*% Xmat_03_cube))
Problem 05
In Problem 04, you compared the linear and cubic relationships based on the Evidence. Your assessment considered how well the model “fit” the data via the likelihood, based on the constraints imposed by the prior. The likelihood examines how likely the binary outcome is given the event probability. Thus, the Evidence is considering if the observations are consistent with the modeled event probability. In this problem, you will consider point-wise error metrics by calculating the confusion matrix associated with the training set. Confusion matrices are useful because the accuracy and errors are in the same “units” of the data.
However, the logistic regression model predicts the event probability via the log-odds ratio. In order to move from the probability to the binary outcome a decision must be made. As discussed during the Applied Machine Learning portion of the course, the decision consists of comparing the predicted probability to a threshold value. If the predicted probability is greater than the threshold, classify the outcome as the event. Otherwise, classify the outcome as the non-event.
In order to classify the training points, you must make posterior predictions with the logistic regression models you fit in Problem 04.
5a)
Although you were able to apply the my_laplace() function to both the regression and logistic regression settings, you cannot directly apply the generate_lm_post_samples() function. You will therefore adapt generate_lm_post_samples() and define generate_glm_post_samples(). The code chunk below starts the function for you and uses just two input arguments, mvn_result and num_samples. You must complete the function.
PROBLEM Why can you not directly use the generate_lm_post_samples() function? Since the length_beta argument is NOT provided to generate_glm_post_samples(), how can you determine the number of β-parameters? Complete the code chunk below by first assigning the number of β-parameters to the length_beta variable. Then generate the random samples from the MVN distribution. You do not have to name the variables, you only need to call the correct random number generator.
generate_glm_post_samples <- function(mvn_result, num_samples)
{
# specify the number of unknown beta parameters length_beta <- ncol(mvn_result$var_matrix)
# generate the random samples
beta_samples <- MASS::mvrnorm(n = num_samples,
mu = mvn_result$mode,
Sigma = mvn_result$var_matrix) # change the data type and name beta_samples %%
as.data.frame() %% tbl_df() %% purrr::set_names(sprintf("beta_%02d", (1:length_beta) - 1)) }
5b)
You will now define a function which calculates the posterior prediction samples on the linear predictor and the event probability. The function, post_glm_pred_samples() is started for you in the code chunk below. It consists of two input arguments Xnew and Bmat. Xnew is a test design matrix where rows correspond to prediction points. The matrix Bmat stores the posterior samples on the β-parameters, where each row is a posterior sample.
PROBLEM Complete the code chunk below by using matrix math to calculate the linear predictor at every posterior sample. Then, calculate the event probability for every posterior sample.
The eta_mat and mu_mat matrices are returned within a list, similar to how the Umat and Ymat matrices were returned for the regression problems.
HINT: The boot::inv.logit() can take a matrix as an input. When it does, it returns a matrix as a result.
5c)
The code chunk below defines a function summarize_glm_pred_from_laplace() which manages the actions necessary to summarize posterior predictions of the event probability. The first argument, mvn_result, is the Laplace Approximation object. The second object is the test design matrix, Xtest, and the third argument, num_samples, is the number of posterior samples to make. You must follow the comments within the function in order to generate posterior prediction samples of the linear predictor and the event probability, and then to summarize the posterior predictions of the event probability.
The result from summarize_glm_pred_from_laplace() summarizes the posterior predicted event probability with the posterior mean, as well as the 5th and 95th quantiles. If you have completed the post_glm_pred_samples() function correctly, the dimensions of the mu_mat matrix should be consistent with those from the Umat matrix from the regression problems. The posterior summary statistics summarize over all posterior samples. You must therefore choose between colMeans() and rowMeans() as to how to calculate the posterior mean event probability for each prediction point. The posterior quantiles are calculated for you.
PROBLEM Follow the comments in the code chunk below to complete the definition of the summarize_glm_pred_from_laplace() function. You must generate posterior samples, make posterior predictions, and then
HINT: The result from post_glm_pred_samples() is a list.
5d)
Summarize the posterior predicted event probability associated with the training set for both the linear and cubic relationships. After making the predictions, a code chunk is provided for you which generates a figure showing how the posterior predicted probability summaries compare with the observed binary outcomes. Which of the two models appears to capture the trends in the binary outcomes better?
PROBLEM Call summarize_glm_pred_from_laplace() for the linear and cubic relationships on the training set. Specify the number of posterior samples to be 2500. Print the dimensions of the resulting objects to the screen. How many rows are in each data set?
The third code chunk below uses the prediction summaries to visualize the posterior predicted event probability on the training set. Which relationship seems more in line with the observations?
There are 100 rows per summary.
The figure below is created for you. The posterior predicted mean event probabilities are displayed by the navyblue curves. The posterior predicted middle 90% uncertainty intervals on the event probabilities are shown by the light blue ribbon.
post_pred_summary_03_line %% mutate(type = "linear relationship") %% bind_rows(post_pred_summary_03_cube %% mutate(type = "cubic relationship")) %%
left_join(train_03 %% tibble::rowid_to_column("pred_id"), by = "pred_id") %%
ggplot(mapping = aes(x = x)) + geom_ribbon(mapping = aes(ymin = mu_q05, ymax = mu_q95, group = type),
fill = "steelblue", alpha = 0.5) +
geom_line(mapping = aes(y = mu_avg, group = type),
color = "navyblue", size = 1.15) +
geom_point(mapping = aes(y = y), size = 2.5, alpha = 0.25) +
facet_grid( . ~ type) +
labs(y = "y or probability") +
theme_bw()
5e)
You will now consider classifying the predictions based upon a threshold value of 0.5. You will compare that threshold value to the posterior predicted event probabilities associated with the training set. Although the Bayesian model provides a full posterior predictive distribution, you will work just with the posterior mean value. Thus, you will create a single confusion matrix, rather than considering the uncertainty in the confusion matrix.
Creating the confusion matrix is rather simple compared to some of the previous tasks in this assignment. The first step is to classify the prediction as event or non-event, which can be accomplished with an ifstatement. The ifelse() function provides an “Excel-like” conditional statement, and is a simple way to perform the classification task. The syntax for ifelse() consists of three arguments, shown below:
ifelse(<conditional test, <return if condition is TRUE, <return if condition is FALSE)
The first argument is the conditional test you wish to apply. The second argument is what will be returned if the condition is true, and the third argument is what will be returned if the condition is false.
You will use the ifelse() function to compare the posterior predicted mean event probability to the assumed threshold value of 0.5.
PROBLEM Pipe the post_pred_summary_03_line object into a mutate() call and create a new variable pred_y which is the result of an ifelse() operation. For the conditional test, return a value of 1 if the posterior predicted mean event probability is greater than 0.5, and return 0 otherwise. Repeat the process for the post_pred_summary_03_cube object.
5f)
The code chunk below uses the left_join() function to merge the training data set, train_03 with each of the posterior prediction summary objects. The results, post_pred_summary_03_line_c and post_pred_summary_03_cube_c both now have predicted classifications, pred_y, and observed outcomes y.
post_pred_summary_03_line_c <- post_pred_summary_03_line_b %% left_join(train_03 %% tibble::rowid_to_column("pred_id"), by = "pred_id")
post_pred_summary_03_cube_c <- post_pred_summary_03_cube_b %% left_join(train_03 %% tibble::rowid_to_column("pred_id"),
by = "pred_id")
You now have everything you need to calculate the confusion matrix for the linear and cubic relationship models. A simple way to do this is with the count() function from dplyr, which counts the unique combinations of the variables you provide to it. count() returns a data.frame with the combination of the columns used to perform the grouping and counting operation, as well as a new column n which stores the number of rows associated with each combination.
PROBLEM Use the count() function to determine the confusion matrix associated with each relationship. How many true-positives, true-negatives, false-positives, and false-negatives does each relationship have?