Problem 1
You are tasked by a manufacturing company to study the variability of a component that they produce. This particular component has an important figure of merit that is measured on every single manufactured piece. The engineering teams feel that the measurement is quite noisy. Your goal will be to learn the unknown mean and unknown noise of this figure of merit based on the noisy observations.
The data set that you will work with is loaded in the code chunk below and assigned to the df_01 object. As the glimpse() function shows, df_01 consists of two variables, obs_id and x. There are 405 rows or observations. obs_id is the “observation index” and x is the measured value.
4.0119121, -0.5633241, -8.2168261, -1.3070010, 1.44...
1a)
Let’s start out by visualizing and summarizing the available measurements.
PROBLEM Create a “run chart” of the measurments by piping the df_01 data set into ggplot(). Map the x aesthetic to obs_id and the y aesthetic to x. Use geom_line() and geom_point() geometric objects to display the measured value with respect to the observation index.
Visually, do any values appear to stand out from the “bulk” of the measurements?
df_01 %% ggplot(mapping = aes(x = obs_id, y = x)) +
geom_line() + geom_point()
1b)
PROBLEM Create a histogram with geom_histogram(), by mapping the x aesthetic to the variable x. Set the number of bins to be 25.
Based on the histogram and the run chart, where are most of the measurement values concentrated? What shape does the distribution look like?
df_01 %% ggplot(mapping = aes(x = x)) +
geom_histogram(bins = 25)
1c)
PROBLEM Use the summary() function to summarize the df_01 data set. Then calculate the 5th and 95th quantile estimates of the variable x. Lastly, calculate the standard deviation on the x variable. Are the summary statistics in line with your conclusions based on the run chart and histogram?
HINT: The quantiles can be estimated via the quantile() function. Type ?quantile into the R Console for help with the arguments to the quantile function.
### apply the summary function df_01 %% summary()
DISCUSSION: The 5th percentile value of around -7.8 looks to be around the correct value based on the run chart and histogram. So too does the 95th percentile value of around 9.2. 1d)
The summary statistics in Problem 1c) were estimated by using all of the 405 observations. The manufacturing company produces several other components which could also be measured. It will take a considerable effort to measure the same number of components as are available in the current data set. Therefore, it is important to understand the behavior of the sample average with respect to the number of observations.
In dplyr we can calculate “rolling” sample averages with the cummean() function. This function “accumulates” all of the observations up to a current index and calculates the sample average. We can call the cummean() function directly within a mutate() call, which allows considerable flexibility, since we do not have use for-loops to create this operation. Let’s visualize how the “rolling sample average” changes over time.
PROBLEM Pipe the df_01 data set into a mutate() call. Create a new column (variable) in the data set, rolling_avg, which is equal to the rolling average of x. Pipe the result into ggplot() and map the x aesthetic to obs_id and the y aesthetic to rolling_avg. Display the rolling average with respect to the observation index with the geom_line() geometric object.
Which portion of the rolling average appears the most variable? Does the rolling average steady-out at all?
df_01 %% mutate(rolling_avg = cummean(x)) %%
ggplot(mapping = aes(x = obs_id, y = rolling_avg)) + geom_line()
1e)
In this problem, you will focus on two specific sample sizes. You will calculate the sample average after 5 observations and 45 observations.
PROBLEM What does the sample average equal to after 5 observations? What does the sample average equal to after 45 observations?
Problem 2
With a simple Exploratory Data Analysis (EDA) completed, it’s time to begin modeling. You are interested in learning the unknown mean of the figure of merit and the unknown noise in the measurement process. Because the manufacturing company is interested in understanding the behavior of the estimates “over time”, it will be critical to assess uncertainty in the unknowns at low sample sizes. Therefore, you use a Bayesian approach to estimating the unknown mean, µ, and unknown noise, σ.
The figure of merit, x, is a continuous variable, and after discussions with the engineering teams, it seems that a Guassian likelihood is a reasonable assumption to make. You will assume that each observation is conditionally independent of the others given µ and σ. You will also assume that the two unknowns are apriori independent. The joint posterior distribution on µ and σ conditioned on the observations x is therefore proportional to:
2a)
You will be applying the Laplace Approximation to learn the unknown µ and σ. In lecture, and in homework 02, it was discussed that it is a good idea to first transform σ to a new variable ϕ, before executing the Laplace Approximation.
PROBLEM Why is it a good idea to transform σ before performing the Laplace Approximation?
on the given distribution, therefore tranforming kills two birds with one stone.
2b)
Consider a general transformation, or link function, applied to the unknown σ:
ϕ = g (σ)
The inverse link function which allows “back-transforming” from ϕ to σ is:
σ = g−1 (ϕ)
PROBLEM Write out the joint-posterior between the unknown µ and unknown transformed parameter ϕ by accounting for the probability change-of-variables formula. You must keep the prior on µ and σ general for now, and make use of the general inverse link function notation. Denote the likelihood as a Gaussian likelihood, as shown in the Problem 2 description.
2c)
After discussions with the engineering teams at the manufacturing company, it was decided that a normal prior on µ and an Exponential prior on σ are appropriate. The normal prior is specified in terms of a prior mean, µ0, and prior standard deviation, τ0. The Exponential prior is specified in terms of a rate parameter, λ. The prior distributions are written out for you below.
p(µ) = normal(µ | µ0,τ0)
p(σ) = Exp(σ | λ)
Because you will be applying the Laplace approximation, you decide to use a log-transformation as the link function:
ϕ = log(σ)
The engineering team feels fairly confident that the measurement process is rather noisy. They feel the noise is obscuring a mean figure of merit value that is actually negative. For that reason, it is decided to set the hyperparameters to be µ0 = −1/2, τ0 = 1, and λ = 1/3. You are first interested in how their prior beliefs change due to the first 5 observations.
You will define a function to calculate the un-normalized log-posterior for the unknown µ and ϕ. You will follow the format discussed in lecture, and so the hyperparameters and measurements will be supplied within a list. The list of the required information is defined for you in the code chunk below. The hyperparameter values are specified, as well as the first 5 observations are stored in the variable xobs. The list is assigned to the object info_inform_N05 to denote the prior on the unknown mean is informative and the first 5 observations are used.
PROBLEM You must complete the code chunk below in order to define the my_logpost() function. This function is in terms of the unknown mean µ and unknown transformed parameter ϕ. Therefore you MUST account for the change-of-variables transformation. The first argument to my_logpost() is a vector, unknowns, which contains all of the unknown parameters to the model. The second argument my_info, is the list of required information. The unknown mean and unknown transformed parameter are extracted from unknowns, and are assigned to the lik_mu and lik_varphi variables, respectively. You must complete the rest of the code chunk. Comments provide hints for what calculations you must perform.
You ARE allowed to use built-in R functions to evaluate densities.
HINT: the info_inform_N05 list of information provides to you the names of the hyperparameter and observation variables to use in the my_logpost() function. This way all students will use the same nomenclature for evaluating the log-posterior.
HINT: The second code chunk below provides three test cases for you. If you correctly coded the my_logpost() function, the first test case which sets unknowns = c(0, 0) will yield a value of -76.75337. The second test case which sets unknowns = c(-1, -1) yields a value of -545.4938. The third test case which sets unknowns = c(1, 2) yields a value of -19.49936. If you do not get three values very close to those printed here, there is a typo in your function.
2d)
Because this problem consists of just two unknowns, we can graphically explore the true log-posterior surface. We did this in lecture by studying in both the original (µ,σ) space, as well as the transformed and unbounded (µ,ϕ) space. However, in this homework assignment, you will continue to focus on the transformed parameters µ and ϕ.
In order to visualize the log-posterior surface, you must define a grid of parameter values that will be applied to the my_logpost() function. A simple way to create a full-factorial grid of combinations is with the expand.grid() function. The basic syntax of the expand.grid() function is shown in the code chunk below for two variables, x1 and x2. The x1 variable is a vector of just two values c(1, 2), and the variable x2 is a vector of 3 values 1:3. As shown in the code chunk output printed to the screen the expand.grid() function produces all 6 combinations of these two variables. The variables are stored as columns, while their combinations correspond to a row within the generated object. The expand.grid() function takes care of the “book keeping” for us, to allow varying x2 for all values of x1.
PROBLEM Complete the code chunk below. The variables mu_lwr and mu_upr correspond to the lower and upper bounds in the grid for the unknown µ parameter. Specifiy the lower bound to be the 0.01 quantile (1st percentile) based on the prior on µ. Specify the upper bound to the 0.99 quantile (99th percentile) based on the prior on µ. The variables varphi_lwr and varphi_upr are the lower and upper bounds in the grid for the unknown ϕ parameter. Specify the lower bound to be the log of the 0.01 quantile (1st percentile) based on the prior on σ. Specificy the upper bound to be the log of the 0.99 quantile(99th percentile) based on the prior on σ.
Create the grid of candidate values, param_grid, by setting the input vectors to the expand.grid() function to be 201 evenly spaced points between the defined lower bound and upper bounds on the paraemters.
HINT: remember that probability density functions in R each have their own specific quantile functions...
2e)
The my_logpost() function accepts a vector as the first input argument, unknowns. Thus, you cannot simply pass in the separate columns of the param_grid data set. To overcome this, you will define a “wrapper” function, which manages the call to the log-posterior function. The wrapper, eval_logpost() is started for you in the code chunk below. The first argument to eval_logpost() is a value for the unknown mean, the second argument is a value to the unknown ϕ parameter, the third argument is the desired log-posterior function, and the fourth argument is the list of required information.
This problem tests that you understand how to call a function, and how the input arguments to the logposterior function are structured. You will need to understand that structure in order to perform the Laplace Approximation later on.
PROBLEM Complete the code chunk below such that the user supplied my_func function has the mu_val and varphi_val variables combined into a vector with the correct order.
Once you complete the eval_logpost() function, the second code chunk below uses the purrr package to calculate the log-posterior over all combinations in the grid. The result is stored in a vector log_post_inform_N05.
2f)
The code chunk below defines the viz_logpost_surface() function for you. It generates the log-posterior surface contour plots in the style presented in lecture. You are required to interpret the log-posterior surface and describe the most probable values and uncertainty in the parameters.
}
PROBLEM Call the viz_logpost_surface() function by setting the log_post_result argument equal to the log_post_inform_N05 vector, and the grid_values argument equal to the param_grid object.
Which values of the transformed unbounded ϕ parameter are completely ruled out, even after 5 observations? Which values for the unknown mean, µ, appear to be the most probable? Is the posterior uncertainty on µ small compared to the prior uncertainty?
viz_logpost_surface(log_post_inform_N05, param_grid)
2g)
The log-posterior visualization in Problem 2f) is based just on the first 5 observations. You must now repeat the analysis with 45 and all 405 observations. The code chunk below defines the two lists of required information, assuming the informative prior on µ. You will use these two lists instead of the info_inform_N05 to recreate the log-posterior surface contour visualization.
info_inform_N45 <- list(
xobs = df_01$x[1:45], mu_0 = -0.5, tau_0 = 1, sigma_rate = 1/3
)
info_inform_N405 <- list(
xobs = df_01$x,
mu_0 = -0.5, tau_0 = 1, sigma_rate = 1/3 )
PROBLEM Complete the two code chunks below. The first code chunk requires you to complete the purrr::map2_dbl() call, which evaluates the log-posterior for all combinations of the parameters. By careful that you use the correct list of required information. The log_post_inform_N45 object corresponds to the 45 sample case, while the log_post_info_N405 object corresponds to using all of the samples.
The second and third code chunks require that you call the viz_logpost_surface() function for the posterior based on 45 samples, and the posterior based on all 405 samples, respectively.
How does the log-posterior surface change as the sample size increases? Does the most probable values of the two parameters change as the sample size increases? Does the uncertainty in the parameters decrease?
Problem 3
It’s now time to work with the Laplace Approximation. The first step is to find the posterior mode, or the maximum a posteriori (MAP) estimate. An iterative optimization scheme is required to do so for most of the posteriors you will use in this course. As described in lecture, you will use the optim() function to perform the optimization.
3a)
The code chunk below defines two different initial guesses for the unknowns. You will try out both initial guesses and compare the optimization results. You will focus on the 5 observation case.
3b)
You tried two different starting guesses...will you get the same optimized results?
PROBLEM Compare the two optimization results from Problem 3a). Are the identified optimal parameter values the same? Are the Hessian matrices the same? Was anything different?
What about the log-posterior surface gave you a hint about how the two results would compare?
3c)
Finding the mode is the first step in the Laplace Approximation. The second step uses the negative inverse of the Hessian matrix as the approximate posterior covariance matrix. You will use a function, my_laplace(), to perform the complete Laplace Approximation. This way, this one function is all that’s needed in order to perform all steps of the Laplace Approximation.
PROBLEM Complete the code chunk below. The my_laplace() function is adapted from the laplace() function from the LearnBayes package. Fill in the missing pieces to double check that you understand which portions of the optimization result correspond to the mode and which are used to approximate the posterior covariance matrix.
You now have all of the pieces in place to perform the Laplace Approximation. Execute the Laplace Approximation for the 5 sample, 45 sample, and 405 sample cases.
PROBLEM Call the my_laplace() function in order to perform the Laplace Approximation based on the 3 sets of observations you studied with the log-posterior surface visualizations. Check that each converged.
3e)
The MVN approximate posteriors that you solved for in Problem 3d) are in the (µ,ϕ) space. In order to help the manufacturing company, you will need to back-transform from ϕ to σ, while accounting for any potential posterior correlation with µ. A simple way to do so is through random sampling. The MVN approximate posteriors are a known type of distribution, a Multi-Variate Normal (MVN). You are therefore able to call a random number generator which uses the specified mean vector and specified covariance matrix to generate random samples from a MVN. Back-transforming from ϕ to σ is accomplished by simply using the inverse link function.
The code chunk below defines the generate_post_samples() function for you. The user provides the
Laplace Approximation result as the first argument, and the number of samples to make as the second argument. The MASS::mvrnorm() function is used to generate the posterior samples. Almost all pieces of the function are provided to you, except you must complete the back-transformation from ϕ to σ by using the correct inverse link function.
PROBLEM Complete the code chunk below by using the correct inverse link function to back-transform from ϕ to σ.
PROBLEM Use the summary() function to summarize the posterior samples based on each of the three observation sample sizes. What are the posterior mean values on µ and σ as the number of observations increase? Visualize the posterior histograms on µ and σ based on each of the three observation sample sizes.
post_samples_inform_N05 %% summary()
distribution.
3g)
Another benefit of working with posterior samples rather than the posterior density is that it is relatively straight forward to answer potentially complex questions. At the start of this project, the engineering teams felt that mean value of the figure of merit was negative. You can now tell them the probability of that occuring. Additionally, the engineering teams felt that the measurement process was rather noisy. You can provide uncertainty intervals on the unknown σ. Depending on the amount of uncertainty about the noise, that might help them decide if company should invest in more precise measurement equipment.
PROBLEM What is the posterior probability that µ is positive for each of the three observation sample sizes? What are the 0.05 and 0.95 quantiles on σ for each of the three observation sample sizes?
SOLUTION The posterior probability that µ is positive for 5 observations is: 0.3995590. The posterior probability that µ is positive for 45 observations is: 0.8904088. The posterior probability that µ is positive for 405 observations is: 0.9999444.
The 0.05 quantile on σ for 5 observations is: -2.7593428. The 0.95 quantile on σ for 5 observations is: 2.0199059. The 0.05 quantile on σ for 45 observations is: -0.2170140. The 0.95 quantile on σ for 45 observations is: 1.4985096. The 0.05 quantile on σ for 405 observations is 0.4045902. The 0.95 quantile on σ for 405 observations is: 1.0041649.
Problem 4
In this problem you get first hand experience working with the lm() function in R. You will make use of that function to fit simple to complex models on low and high noise data sets. You will then use cross-validation to assess if the models are overfitting to a training set.
Two data sets are in for you in the code chunks below. Both consist of an input x and a continuous response y. The two data sets are synthetic. Both come from the same underlying true functional form. The true functional form is not given to you. You are given random observations generated around that true functional form. The df_02_low data set corresponds to a random observations with low noise, while the df_02_high data set corresponds to random observations with high noise.
4a)
Create scatter plots in ggplot2 between the response y and the input x.
PROBLEM Use ggplot() with geom_point() to visualize scatter plots between the response and the input for both the low and high noise data sets. Since you know that df_02_low corresponds to low noise, can you make a guess about the true functional form that generated the data?
df_02_low %% ggplot(mapping = aes(x = x)) + geom_point(mapping = aes(y = y),
size = 1)
4b)
The lm() function in R is quite flexible. You can use a formula interface to specify models of various functional relationships between the response and the inputs. To get practice working with the formula interface you will create three models for the low noise case and three models for the high noise case. You will specificy a linear relationship between the response and the input, a quadratic relationship, and an 8th order polynomial.
There are many ways the polynomials can be created. For this assignment though, you will use the I() function to specify the polynomial terms. This approach will seem quite tedious, but the point is to get practice working with the formula interface. We will worry about efficiency later in the semester. The formula interface for stating the response, y, has a cubic relationship with the input x is just: y ~ x + I(xˆ2) + I(xˆ3)
The ~ operator reads as “is a function of”. Thus, the term to the left of ~ is viewed as a response, and the expression to the right of ~ is considered to be the features or predictors. With the formula specified the only other required argument to lm() is the data object. Thus, when using the formula interface, the syntax to creating a model with lm() is:
<model object <- lm(<output ~ <input expression, data = <data object)
PROBLEM Create a linear relationship, quadratic relationship, and 8th order polynomial model for the low and high noise cases. Use the formula interface to define the relationships.
4c)
There are many different approaches to inspect the results of lm(). A straightforward way is to use the summary() function to display a print out of each model’s fit to the screen. Here, you will use summary() to read the R-squared performance metric associated with each model.
PROBLEM Use the summary() function to print out the summary of each model’s fit. Which of the three relationships had the highest R-squared for the low noise case? Which of the three relationships had the highest R-squared for the high noise case?
mod_low_linear %% summary()
4d)
As mentioned previously, there are many methods in R to extract the performance of a lm() model. The modelr package, which is within the tidyverse, includes some useful functions for calculating the Root Mean Squared Error (RMSE) and R-squared values. The syntax for calculating the RMSE with the modelr::rmse() function is: modelr::rmse(<model object, <data set)
The data supplied to modelr::rmse() can be a new data set, as long as the response is included in the data set. modelr::rmse() manages the book keeping for making predictions, comparing those predictions to the observed responses, calculating the errors, and summarizing. We will be discussing all of these steps later in the semester. For now, it’s practical to know a function to quickly compute an important quantity such as RMSE.
We will use the modelr package for the remainder of Problem 4, so the code chunk below loads it into the current session.
library(modelr)
PROBLEM Use the modelr::rmse() function to calculate the RMSE for each of the models on their corresponding training sets. Thus, calculate the RMSE of the low noise case models with respect to the low noise data set. Calculate the RMSE of the high noise case models with respect to the high noise data set.
4e)
The previous performance metrics were calculated based on the training set alone. We know that only considering the training set performance metrics such as RMSE and R-squared will prefer overfit models, which simply chase the noise in the training data. To assess if a model is overfit, we can breakup a data set into multiple training and test splits via cross-validation. For the remainder of this problem, you will work with 5-fold cross-validation to get practice working with multiple data splits.
You will not have create the data splits on your own. In fact, the data splits are created for you in the code chunk below, for the low noise case. The modelr::crossv_kfold() function has two main input arguments, data and k. The data argument is the data set we wish to performance k-fold cross-validation on. The k argument is the number of folds to create. There is a third argument, id, which allows the user to name the fold “ID” labels, which by default are named ".id". As previously stated, the code chunk below creates the 5-fold data splits for the low noise case for you.
set.seed(23413)
cv_info_k05_low <- modelr::crossv_kfold(df_02_low, k = 5)
The contents of the cv_info_k05_low object are printed for you below. As you can see, cv_info_k05_low contains 3 columns, train, test, and .id. Although the object appears to be a data frame or tibble, the contents are not like most data frames. The train and test columns are actually lists which contain complex data objects. These complex objects are pointer-like in that they store how to access the training and test data sets from the original data set. In this way, the resampled object can be more memory efficient than just storing the actual data splits themselves. The .id column is just an ID, and so each row in cv_info_k05_low is a particular fold. cv_info_k05_low
## # A tibble: 5 x 3
## train test .id
## <named list <named list <chr
## 1 <resample <resample 1 ## 2 <resample <resample 2 ## 3 <resample <resample 3 ## 4 <resample <resample 4 ## 5 <resample <resample 5
There are several ways to access the data sets directly. You can convert the resampled objects into integers, which provide the row indices associated with the train or test splits for each fold. To access the indices you need to use the [[]] format since you are selecting an element in a list. For example, to access the row indices for all rows selected in the first fold’s training set:
as.integer(cv_info_k05_low$train[[1]])
## [1] 2 3 4 5 6 8 9 10 11 13 14 15 16 17 18 21 22 23 24 25 26 27 29 30
Likewise to access the row indices for the third fold’s test set:
as.integer(cv_info_k05_low$test[[3]])
## [1] 4 11 17 21 26 30
By storing pointers, the resample objects are rather memory efficient. We can make use of functional programming techniques to quickly and efficiently train and test models across all folds. In this assignment, though, you will turn the resampled object into separate data sets. Although not memory efficient, doing so allows you to work with each fold directly.
PROBLEM Convert each training and test split within each fold to a separate data set. To do so, use the as.data.frame() function instead of the as.integer() function. The object names in the code chunks below denote training or test and the particular fold to assign. You only have to work with the resampled object based on the low noise data set.
4f)
With the training and test splits available, now it’s time to train the models on each training fold split. You can ignore the linear relationship model for the remainder of the assignment, and focus just on the quadratic relationship and 8th order polynomial.
PROBLEM Fit or train the quadratic relationship and 8th order polynomial using each fold’s training split. Use the formula interface the define the relationship in each lm() call.
4g)
Let’s compare the RMSE of the quadratic relationship to the 8th order polynomial, within each training fold. In this way, we can check that even after splitting the data, comparing models based on their training data still favors more complex models.
PROBLEM Calculate the RMSE for the quadratic relationship, using each fold’s model fit, relative to the training splits. Thus, you should calculate 5 quantities, and store the 5 quantities in a vector named cv_low_quad_fold_train_rmse. Perform the analogous operation for the 8th order polynomial fits, and store in the vector named cv_low_8th_fold_train_rmse.
Calculate the average RMSE across the 5-folds for both relationships. Which relationship has the lowest average RMSE?