$30
Homework 5: Conditional Probability Models
Instructions: Your answers to the questions below, including plots and mathematical work, should be submitted as a single PDF file. It’s preferred that you write your answers using software that typesets mathematics (e.g. LATEX, LYX, or MathJax via iPython), though if you need to you may scan handwritten work. You may find the minted package convenient for including source code in your LATEX document. If you are using LYX, then the listings package tends to work better.
1 Introduction
In this homework we’ll be investigating conditional probability models, with a focus on various interpretations of logistic regression, with and without regularization. Along the way we’ll discuss the calibration of probability predictions, both in the limit of infinite training data and in a more bare-hands way. On the Bayesian side, we’ll recreate from scratch the Bayesian linear gaussian regression example we discussed in lecture. We’ll also have several optional problems that work through many basic concepts in Bayesian statistics via one of the simplest problems there is: estimating the probability of heads in a coin flip. Later we’ll extend this to the probability of estimating click-through rates in mobile advertising. Along the way we’ll encounter empirical Bayes and hierarchical models.
2 From Scores to Conditional Probabilities[1]
Let’s consider the classification setting, in which (x1,y1),...,(xn,yn) ∈ X × {−1,1} are sampled i.i.d. from some unknown distribution. For a prediction function f : X → R, we define the margin on an example (x,y) to be m = yf(x). Since our class predictions are given by sign(f(x)), we see that a prediction is correct iff m(x) > 0. It’s tempting to interpret the magnitude of the score |f(x)| as a measure of confidence. However, it’s hard to interpret the magnitudes beyond saying one prediction score is more or less confident than another, and without any scale to this “confidence score”, it’s hard to know what to do with it. In this problem, we investigate how we can translate the score into a probability, which is much easier to interpret. In other words, we are looking for a way to convert score function f(x) ∈ R into a conditional probability distribution x 7→ p(y = 1 | x).
In this problem we will consider margin-based losses, which are loss functions of the form (y,f(x)) 7→ `(yf(x)), where m = yf(x) is called the margin. We are interested in how we can go from an empirical risk minimizer for a margin-based loss, fˆ = argmin , to a conditional probability estimator πˆ(x) ≈ p(y = 1 | x). Our approach will be to try to find a way to use the Bayes[2] prediction function[3] f∗ = argminf Ex,y [`(yf(x)] to get the true conditional probability π(x) = p(y = 1 | x), and then apply the same mapping to the empirical risk minimizer. While there is plenty that can go wrong with this “plug-in” approach (primarily, the empirical risk minimizer from a [limited] hypothesis space F may be a poor estimate for the Bayes prediction function), it is at least well-motivated, and it can work well in practice. And please note that we can do better than just hoping for success: if you have enough validation data, you can directly assess how well “calibrated” the predicted probabilities are. This blog post has some discussion of calibration plots: https://jmetzen.github.io/2015-04-14/calibration.html.
It turns out it is straightforward to find the Bayes prediction function f∗ for margin losses, at least in terms of the data-generating distribution: For any given x ∈ X, we’ll find the best possible prediction yˆ. This will be the yˆ that minimizes
Ey [`(yyˆ) | x].
If we can calculate this yˆ for all x ∈ X, then we will have determined f∗(x). We will simply take
f∗(x) = argminEy [`(yyˆ) | x].
yˆ
Below we’ll calculate f∗ for several loss functions. It will be convenient to let π(x) = p(y = 1 | x) in the work below.
1. Write Ey [`(yf(x)) | x] in terms of π(x), `(−f(x)), and `(f(x)). [Hint: Use the fact that y ∈ {−1,1}.]
2. Show that the Bayes prediction function f∗(x) for the exponential loss function `(y,f(x)) =
e−yf(x) is given by
,
where we’ve assumed π(x) ∈ (0,1). Also, show that given the Bayes prediction function f∗, we can recover the conditional probabilities by
.
[Hint: Differentiate the expression in the previous problem with respect to f(x). To make things a little less confusing, and also to write less, you may find it useful to change variables a bit: Fix an x ∈ X. Then write p = π(x) and yˆ = f(x). After substituting these into the expression you had for the previous problem, you’ll want to find yˆ that minimizes the expression. Use differential calculus. Once you’ve done it for a single x, it’s easy to write the solution as a function of x.]
3. Show that the Bayes prediction function f∗(x) for the logistic loss function `(y,f(x)) = is given by
and the conditional probabilities are given by
.
Again, we may assume that π(x) ∈ (0,1).
4. [Optional] Show that the Bayes prediction function f∗(x) for the hinge loss function `(y,f(x)) = max(0,1 − yf(x)) is given by
.
Note that it is impossible to recover π(x) from f∗(x) in this scenario. However, in practice we work with an empirical risk minimizer, from which we may still be able to recover a reasonable estimate for π(x). An early approach to this problem is known as “Platt scaling”: https://en.wikipedia.org/wiki/Platt_scaling.
3 Logistic Regression
3.1 Equivalence of ERM and probabilistic approaches
In lecture we discussed two different ways to end up with logistic regression.
ERM approach: Consider the classification setting with input space X = Rd, outcome space Y± = {−1,1}, and action space AR = R, with the hypothesis space of linear score functions
Fscore . Consider the margin-based loss function `logistic(m) = log(1 + e−m) and the training data D = ((x1,y1),...,(xn,yn)). Then the empirical risk objective function for hypothesis space Fscore and the logistic loss over D is given by
n
logistic(yiwTxi)
.
Bernoulli regression with logistic transfer function: Consider the conditional probability modeling setting with input space X = Rd, outcome space Y0/1 = {0,1}, and action space A[0,1] = [0,1], where an action corresponds to the predicted probability that an outcome is 1.
Define the standard logistic function as φ(η) = 1/(1 + e−η) and the hypothesis space Fprob = . Suppose for every yi in the dataset D above, we define , and let D0 be the resulting collection of (xi,yi0) pairs. Then the negative log-likelihood (NLL) objective function for Fprob and D0 is give by
NLL
If wˆprob minimizes NLL(w), then x 7→ φ(xTwˆprob) is a maximum likelihood prediction function over the hypothesis space Fprob for the dataset D0.
Show that nRˆn(w) = NLL(w) for all w ∈ Rd. And thus the two approaches are equivalent, in that they produce the same prediction functions.
3.2 Numerical Overflow and the log-sum-exp trick
Suppose we want to calculate log(exp(η)) for η = 1000.42. If we compute this literally in Python, we will get an overflow (try it!), since numpy gets infinity for e1000.42, and log of infinity is still infinity. On the other hand, we can help out with some math: obviously log(exp(η)) = η, and there’s no issue.
It turns out, log(exp(η)) and the problem with its calculation is a special case of the LogSumExp function that shows up frequently in machine learning. We define
LogSumExp(x1,...,xn) = log(ex1 + ··· + exn).
Note that this will overflow if any of the xi’s are large (more than 709). To compute this on a computer, we can use the “log-sum-exp trick”. We let x∗ = max(x1,...,xn) and compute LogSumExp as
∗ h x −x
LogSumExp(x1,...,xn) = x + log e 1 ∗ + ··· + exn−x∗i.
1. Show that the new expression for LogSumExp is valid.
2. Show that exp(xi − x∗) ∈ (0,1] for any i, and thus the exp calculations will not overflow.
3. Above we’ve only spoken about the exp overflowing. However, the log part can also have problems by becoming negative infinity for arguments very close to 0. Explain why the log term in our expression will never be “-inf”.
4. In the objective functions for logistic regression, there are expressions of the form log(1 + e−s) for some s. Note that a naive implementation gives 0 for s > 36 and inf for s < −709. Show how to use the numpy function logaddexp to correctly compute log(1 + e−s).
3.3 Regularized Logistic Regression
For a dataset D = ((x1,y1),...,(xn,yn)) drawn from Rd×{−1,1}, the regularized logistic regression objective function can be defined as
Jlogistic
.
1. Prove that the objective function Jlogistic(w) is convex. You may use any facts mentioned in the convex optimization notes.
2. Complete the f_objective function in the skeleton code, which computes the objective function for Jlogistic(w). Make sure to use the log-sum-exp trick to get accurate calculations and to prevent overflow.
3. Complete the fit_logistic_regression_function in the skeleton code using the minimize function from scipy.optimize. ridge_regression.py from Homework 2 gives an example of how to use the minimize function. Use this function to train a model on the provided data. Make sure to take the appropriate preprocessing steps, such as standardizing the data and adding a column for the bias term.
4. Find the `2 regularization parameter that minimizes the log-likelihood on the validation set. Plot the log-likelihood for different values of the regularization parameter.
5. Based on the Bernoulli regression development of logistic regression, it seems reasonable to interpret the prediction as the probability that y = 1, for a randomly drawn pair (x,y). Since we only have a finite sample (and we are regularizing, which will bias things a bit) there is a question of how well “calibrated” our predicted probabilities are. Roughly speaking, we say f(x) is well calibrated if we look at all examples (x,y) for which f(x) ≈ 0.7 and we find that close to 70% of those examples have y = 1, as predicted... and then we repeat that for all predicted probabilities in (0,1). To see how well-calibrated our predicted probabilities are, break the predictions on the validation set into groups based on the predicted probability (you can play with the size of the groups to get a result you think is informative). For each group, examine the percentage of positive labels. You can make a table or graph. Summarize the results. You may get some ideas and references from scikit-learn’s discussion.
6. [Optional] If you can, create a dataset for which the log-sum-exp trick is actually necessary for your implementation of regularized logistic regression. If you don’t think such a dataset exists, explain why. If you like, you may consider the case of SGD optimization. [This problem is intentionally open-ended. You’re meant to think, explore, and experiment. Points assigned for interesting insights.]
4 Bayesian Logistic Regression with Gaussian Priors
Let’s return to the setup described in Section 3.1 and, in particular, to the Bernoulli regression setting with logistic transfer function. We had the following hypothesis space of conditional probability functions:
.
Now let’s consider the Bayesian setting, where we induce a prior on Fprob by taking a prior p(w) on the parameter w ∈ Rd.
1. For the dataset D0 described in Section 3.1, give an expression for the posterior density p(w | D0) in terms of the negative log-likelihood function
n
NLL
i=1
and a prior density p(w) (up to a proportionality constant is fine).
2. Suppose we take a prior on w of the form w ∼ N(0,Σ). Find a covariance matrix Σ such that MAP estimate for w after observing data D0 is the same as the minimizer of the regularized logistic regression function defined in Section 3.3 (and prove it). [Hint: Consider minimizing the negative log posterior of w. Also, remember you can drop any terms from the objective function that don’t depend on w. Also, you may freely use results of previous problems.]
3. In the Bayesian approach, the prior should reflect your beliefs about the parameters before seeing the data and, in particular, should be independent on the eventual size of your dataset. Following this, you choose a prior distribution w ∼ N(0,I). For a dataset D of size n, how should you choose λ in our regularized logistic regression objective function so that the minimizer is equal to the mode of the posterior distribution of w (i.e. is equal to the MAP estimator).
5 Bayesian Linear Regression - Implementation
In this problem, we will implement Bayesian Gaussian linear regression, essentially reproducing the example from lecture, which in turn is based on the example in Figure 3.7 of Bishop’s Pattern Recognition and Machine Learning (page 155). We’ve provided plotting functionality in "support_code.py". Your task is to complete "problem.py". The implementation uses np.matrix objects, and you are welcome to use[4] the np.matrix.getI method.
1. Implement likelihood_func.
2. Implement get_posterior_params.
3. Implement get_predictive_params.
4. Run “python problem.py” from inside the Bayesian Regression directory to do the regression and generate the plots. This runs through the regression with three different settings for the prior covariance. You may want to change the default behavior in support_code.make_plots from plt.show, to saving the plots for inclusion in your homework submission.
5. Comment on your results. In particular, discuss how each of the following change with sample size and with the strength of the prior: (i) the likelihood function, (ii) the posterior distribution, and (iii) the posterior predictive distribution.
6. Our work above was very much “full Bayes”, in that rather than coming up with a single prediction function, we have a whole distribution over posterior prediction functions. However, sometimes we want a single prediction function, and a common approach is to use the MAP estimate – that is, choose the prediction function that has the highest posterior likelihood. As we discussed in class, for this setting, we can get the MAP estimate using ridge regression. Use ridge regression to get the MAP prediction function corresponding to the first prior covariance , per the support code). What value did you use for the regularization coefficient?
Why?
6 [Optional] Coin Flipping: Maximum Likelihood
1. [Optional] Suppose we flip a coin and get the following sequence of heads and tails:
D = (H,H,T)
Give an expression for the probability of observing D given that the probability of heads is θ. That is, give an expression for p(D | θ). This is called the likelihood of θ for the data D.
2. [Optional] How many different sequences of 3 coin tosses have 2 heads and 1 tail? If we toss the coin 3 times, what is the probability of 2 heads and 1 tail? (Answer should be in terms of θ.)
3. [Optional] More generally, give an expression for the likelihood p(D | θ) for a particular sequence of flips D that has nh heads and nt tails. Make sure you have expressions that make sense even for θ = 0 and nh = 0, and other boundary cases. You may use the convention that 00 = 1, or you can break your expression into cases if needed.
4. [Optional] Show that the maximum likelihood estimate of θ given we observed a sequence with nh heads and nt tails is
θˆMLE .
You may assume that nh + nt ≥ 1. (Hint: Maximizing the log-likelihood is equivalent and is often easier. )
7 [Optional] Coin Flipping: Bayesian Approach with Beta Prior
We’ll now take a Bayesian approach to the coin flipping problem, in which we treat θ as a random variable sampled from some prior distribution p(θ). We’ll represent the ith coin flip by a random variable Xi ∈ {0,1}, where Xi = 1 if the ith flip is heads. We assume that the Xi’s are conditionally independent given θ. This means that the joint distribution of the coin flips and θ factorizes as follows:
p(x1,...,xn,θ)
=
p(θ)p(x1,...,xn | θ) (always true)
=
n p(θ)Yp(xi | θ) (by conditional independence).
i=1
1. [Optional] Suppose that our prior distribution on θ is Beta(h,t), for some h,t > 0. That is, p(θ) ∝ θh−1 (1 − θ)t−1. Suppose that our sequence of flips D has nh heads and nt tails. Show that the posterior distribution for θ is Beta(h + nh,t + nt). That is, show that
p(θ | D) ∝ θh−1+nh (1 − θ)t−1+nt .
We say that the Beta distribution is conjugate to the Bernoulli distribution since the prior and the posterior are both in the same family of distributions (i.e. both Beta distributions).
2. [Optional] Give expressions for the MLE, the MAP, and the posterior mean estimates of θ. [Hint: You may use the fact that a Beta(h,t) distribution has mean h/(h + t) and has mode (h − 1)/(h + t − 2) for h,t > 1. For the Bayesian solutions, you should note that as h + t gets very large, and assuming we keep the ratio h/(h+t) fixed, the posterior mean and MAP approach the prior mean h/(h + t), while for fixed h and t, the posterior mean approaches the MLE as the sample size n = nh + nt → ∞.
3. [Optional] What happens to θˆMLE , θˆMAP, and θˆPOSTERIOR MEAN as the number of coin flips n = nh + nt approaches infinity?
4. [Optional] The MAP and posterior mean estimators of θ were derived from a Bayesian perspective. Let’s now evaluate them from a frequentist perspective. Suppose θ is fixed and unknown. Which of the MLE, MAP, and posterior mean estimators give unbiased estimates of θ, if any? [Hint: The answer may depend on the parameters h and t of the prior. Also, let’s consider the total number of flips n = nh + nt to be given (not random), while nh and nt are random, with nh = n − nt.]
5. [Optional] Suppose somebody gives you a coin and asks you to give an estimate of the probability of heads, but you can only toss the coin 3 times. You have no particular reason to believe this is an unfair coin. Would you prefer the MLE or the posterior mean as a point estimate of θ? If the posterior mean, what would you use for your prior?
8 [Optional] Hierarchical Bayes for Click-Through Rate Estimation
In mobile advertising, ads are often displayed inside apps on a phone or tablet device. When an ad is displayed, this is called an “impression.” If the user clicks on the ad, that is called a “click.” The probability that an impression leads to a click is called the “click-through rate” (CTR).
Suppose we have d = 1000 apps. For various reasons[5], each app tends to have a different overall CTR. For the purposes of designing an ad campaign, we want estimates of all the app-level CTRs, which we’ll denote by θ1,...,θ1000. Of course, the particular user seeing the impression and the particular ad that is shown have an effect on the CTR, but we’ll ignore these issues for now. [Because so many clicks on mobile ads are accidental, it turns out that the overall app-level CTR often dominates the effect of the particular user and the specific ad.]
If we have enough impressions for a particular app, then the empirical fraction of clicks will give a good estimate for the actual CTR. However, if we have relatively few impressions, we’ll have some problems using the empirical fraction. Typical CTRs are less than 1%, so it takes a fairly large number of observations to get a good estimate of CTR. For example, even with 100 impressions, the only possible CTR estimates given by the MLE would be 0%,1%,2%,...,100%. The 0% estimate is almost certainly much too low, and anything 2% or higher is almost certainly much too high. Our goal is to come up with reasonable point estimates for θ1,...,θ1000, even when we have very few observations for some apps.
If we wanted to apply the Bayesian approach worked out in the previous problem, we could come up with a prior that seemed reasonable. For example, we could use the following Beta(3,400) as a prior distribution on each θi:
In this basic Bayesian approach, the parameters 3 and 400 would be chosen by the data scientist based on prior experience, or “best guess”, but without looking at the new data. Another approach would be to use the data to help you choose the parameters a and b in Beta(a,b). This would not be a Bayesian approach, though it is frequently used in practice. One method in this direction is called empirical Bayes. Empirical Bayes can be considered a frequentist approach, in which we estimate a and b from the data D using some estimation technique, such as maximum likelihood. The proper Bayesian approach to this type of thing is called hierarchical Bayes, in which we put another prior distribution on a and b. We’ll investigate each of these approaches below.
Mathematical Description
We’ll now give a mathematical description of our model, assuming the prior parameters a and b are directly chosen by the data scientist. Let n1,...,nd be the number of impressions we observe for each of the d apps. In this problem, we will not consider these to be random numbers. For the ith app, let be indicator variables determining whether or not each impression was clicked. That is, cji = 1(jth impression on ith app was clicked). We can summarize the data on the ith app by is the total number of impressions that were clicked for app i. Let θ = (θ1,...,θd), where θi is the CTR for app i.
In our Bayesian approach, we act as though the data were generated as follows:
1. Sample θ1,...,θd i.i.d. from Beta(a,b).
2. For each app i, sample i.i.d. from Bernoulli(θi).
8.1 [Optional] Empirical Bayes for a single app
We start by working out some details for Bayesian inference for a single app. That is, suppose we only have the data Di from app i, and nothing else. Mathematically, this is exactly the same setting as the coin tossing setting above, but here we push it further.
1. Give an expression for p(Di | θi), the likelihood of Di given the probability of click θi, in terms of θi, xi and ni.
2. We will take our prior distribution on θi to be Beta(a,b). The corresponding probability density function is given by
p(θi) = Beta ,
where B(a,b) is called the Beta function. Explain (without calculation) why we must have
.
3. Give an expression for the posterior distribution p(θi | Di). In this case, include the constant of proportionality. In other words, do not use the “is proportional to” sign ∝ in your final expression. You may reference the Beta function defined above. [Hint: This problem is essentially a repetition of an earlier problem.]
4. Give a closed form expression for p(Di), the marginal likelihood of Di, in terms of the a,b,xi, and ni. You may use the normalization function B(·,·) for convenience, but you should not have any integrals in your solution. (Hint: p(Di) = R p(Di | θi)p(θi)dθi, and the answer will be a ratio of two beta function evaluations.)
5. The maximum likelihood estimate for θi is xi/ni. Let pMLE(Di) be the marginal likelihood of Di when we use a prior on θi that puts all of its probability mass at xi/ni. Note that
pMLE
Figure 1: A plot of p(Di | a,b) as a function of a and b.
Explain why, or prove, that pMLE(Di) is larger than p(Di) for any other prior we might put on θi. If it’s too hard to reason about all possible priors, it’s fine to just consider all Beta priors. [Hint: This does not require much or any calculation. It may help to think about the integral p(Di) = R p(Di | θi)p(θi)dθi as a weighted average of p(Di | θi) for different values of θi, where the weights are p(θi).]
6. One approach to getting an empirical Bayes estimate of the parameters a and b is to use maximum likelihood. Such an empirical Bayes estimate is often called an ML-2 estimate, since it’s maximum likelihood, but at a higher level in the Bayesian hierarchy. To emphasize the dependence of the likelihood of Di on the parameters a and b, we’ll now write it as p(Di | a,b)[6]. The empirical Bayes estimates for a and b are given by
(a,ˆ ˆb) = argmax p(Di | a,b).
(a,b)∈(0,∞)×(0,∞)
To make things concrete, suppose we observed xi = 3 clicks out of ni = 500 impressions. A plot of p(Di | a,b) as a function of a and b is given in Figure 1. It appears from this plot that the likelihood will keep increasing as a and b increase, at least if a and b maintain a particular ratio. Indeed, this likelihood function never attains its maximum, so we cannot use ML-2 here. Explain what’s happening to the prior as we continue to increase the likelihood. [Hint: It is a property of the Beta distribution (not difficult to see), that for any θ ∈ (0,1), there is a Beta distribution with expected value θ and variance less than ε, for any ε > 0. What’s going in here is similar to what happens when you attempt to fit a gaussian distribution N(µ,σ2) to a single data point using maximum likelihood.]
8.2 [Optional] Empirical Bayes Using All App Data
In the previous section, we considered working with data from a single app. With a fixed prior, such as Beta(3,400), our Bayesian estimates for θi seem more reasonable to me[7] than the MLE when our sample size ni is small. The fact that these estimates seem reasonable is an immediate consequence of the fact that I chose the prior to give high probability to estimates that seem reasonable to me, before ever seeing the data. Our earlier attempt to use empirical Bayes (ML-2) to choose the prior in a data-driven way was not successful. With only a single app, we were essentially overfitting the prior to the data we have. In this section, we’ll consider using the data from all the apps, in which case empirical Bayes makes more sense.
1. Let D = (D1,...,Dd) be the data from all the apps. Give an expression for p(D | a,b), the marginal likelihood of D. Expression should be in terms of a,b,xi,ni for i = 1,...,d. Assume data from different apps are independent. (Hint: This problem should be easy, based on a problem from the previous section.)
2. Explain why p(θi | D) = p(θi | Di), according to our model. In other words, once we choose values for parameters a and b, information about one app does not give any information about other apps.
3. Suppose we have data from 6 apps. 3 of the apps have a fair number of impressions, and 3 have relatively few. Suppose we observe the following:
Num Clicks
Num Impressions
App 1
50
10000
App 2
160
20000
App 3
180
60000
App 4
0
100
App 5
0
5
App 6
1
2
Compute the empirical Bayes estimates for a and b. (Recall, this amounts to computing
(a,ˆ ˆb) = argmax(a,b)∈R>0×R>0 p(D | a,b).) This will require solving an optimization problem, for which you are free to use any optimization software you like (perhaps scipy.optimize would be useful). The empirical Bayes prior is then Beta(a,ˆ ˆb), where aˆ and ˆb are our ML-2 estimates. Give the corresponding prior mean and standard deviation for this prior.
4. Complete the following table:
NumClicks
NumImpressions
MLE
MAP
PosteriorMean
PosteriorSD
App 1
50
10000
0.5%
App 2
160
20000
0.8%
App 3
180
60000
0.3%
App 4
0
100
0%
App 5
0
5
0%
App 6
1
2
50%
Make sure to take a look at the PosteriorSD values and note which are big and which are small.
8.3 [Optional] Hierarchical Bayes
In Section 8.2 we managed to get empirical Bayes ML-II estimates for a and b by assuming we had data from multiple apps. However, we didn’t really address the issue that ML-II, as a maximum likelihood method, is prone to overfitting if we don’t have enough data (in this case, enough apps). Moreover, a true Bayesian would reject this approach, since we’re using our data to determine our prior. If we don’t have enough confidence to choose parameters for a and b without looking at the data, then the only proper Bayesian approach is to put another prior on the parameters a and b. If you are very uncertain about values for a and b, you could put priors on them that have high variance.
1. [Optional] Suppose P is the Beta(a,b) distribution. Conceptually, rather than putting priors on a and b, it’s easier to reason about priors on the mean m and the variance v of P. If we parameterize P by its mean m and the variance v, give an expression for the density function Beta(θ;m,v). You are free to use the internet to get this expression – just be confident it’s correct. [Hint: To derive this, you may find it convenient to write some expression in terms of η = a + b.]
2. [Optional] Suggest a prior distribution to put on m and v. [Hint: You might want to use one of the distribution families given in this lecture.
3. [Optional] Once we have our prior on m and v, we can go “full Bayesian” and compute posterior distributions on θ1,...,θd. However, these no longer have closed forms. We would have to use approximation techniques, typically either a Monte Carlo sampling approach or a variational method, which are beyond the scope of this course[8]. After observing the data D, m and v will have some posterior distribution p(m,v | D). We can approximate that distribution by a point mass at the mode of that distribution (mMAP,vMAP) = argmaxm,v p(m,v | D). Give expressions for the posterior distribution p(θ1,...,θd | D), with and without this approximation. You do not need to give any explicit expressions here. It’s fine to have expressions like p(θ1,...,θd | m,v) in your solution. Without the approximation, you will probably need some integrals. It’s these integrals that we need sampling or variational approaches to approximate. While one can see this approach as a way to approximate the proper Bayesian approach, one could also be skeptical and say this is just another way to determine your prior from the data. The estimators (mMAP,vMAP) are often called MAP-II estimators, since they are MAP estimators at a higher level of the Bayesian hierarchy.
[1] This problem is based on Section 7.5.3 of Schapire and Freund’s book Boosting: Foundations and Algorithms.
[2] Don’t be confused – it’s Bayes as in “Bayes optimal”, as we discussed at the beginning of the course, not Bayesian as we’ve discussed more recently.
[3] In this context, the Bayes prediction function is often referred to as the “population minimizer.” In our case, “population” refers to the fact that we are minimizing with respect to the true distribution, rather than a sample. The term “population” arises from the context where we are using a sample to approximate some statistic of an entire population (e.g. a population of people or trees).
[4] However, in practice we are usually interested in computing the product of a matrix inverse and a vector, i.e. X−1b. In this case, it’s usually faster and more accurate to use a library’s algorithms for solving a system of linear equations. Note that y = X−1b is just the solution to the linear system Xy = b. See for example John Cook’s blog post for discussion.
[5] The primary reason is that different apps place the ads differently, making it more or less difficult to avoid clicking the ad.
[6] Note that this is a slight (though common) abuse of notation, because a and b are not random variables in this setting. It might be more appropriate to write this as p(Di;a,b) or pa,b(Di). But this isn’t very common.
[7] I say “to me”, since I am the one who chose the prior. You may have an entirely different prior, and think that my estimates are terrible.
[8] If you’re very ambitious, you could try out a package like PyStan to see what happens.