Starting from:

$25

 B555- Bayesian GLM Solved


 

You can write your code in any programming language so long as we are able to test it on SICE servers. We plan to run some or all or submitted code for further testing and validation.

Overview: Bayesian GLM
In this programming project we will be working with Generalized Linear Models as covered in class, including Logistic regression, Poisson regression and Ordinal regression. Your goal is to use one generic implementation for the main algorithm that works for multiple observation likelihoods.

Data
Data for this assignment is provided in a zip file pp3data.zip on Canvas. Each dataset is given in two files with the data in one and the labels in the other file. We will use the datasets A and usps for classification with logistic regression. We will use the datasets AP for count prediction with Poisson regression. We will use the datasets AO for ordinal prediction with ordinal regression.

The datasets A, AP, AO were artificially generated with labels which are not perfectly matched to any linear predictor yet they are generated to be somewhat predictable. The examples in usps represent 16×16 bitmaps of the characters 3 and 5 and are taken from the well known usps dataset (representing data originally used for zip code classification).

Implementing our variant of GLM
In this assignment we will use a Bayesian (or regularized) version of the algorithm with w ∼ ), where α = 10, to calculate the MAP solution wMAP .

•    As discussed in class logistic regression (and by extension GLM) relies on a free parameter (w0) to capture an appropriate separating hyperplane. Therefore, you will need to add a feature fixed at one (also known as an intercept) to all datasets in the assignment. To match the test case below please add this as the first column in the data matrix.

•    The vector of first derivatives of the log posterior is g = ∂LogL∂w = Pi diφ(xi)−αw = ΦT d−αw where d is a vector whose elements are di.

•    The matrix of second derivatives of the log posterior is H = ∂w∂w∂LogLT = −Pi riφ(xi)φ(xi)T − αI = −ΦT RΦ − αI where R is a matrix with elements ri on the diagonal.

•    The GLM algorithm initializes the weight vector as w = 0 and then repeatedly applies an update with Netwon’s method w ← w − H−1g until w converges.

•    For this assignment we consider that w has converged if . If w has not converged in 100 iterations we stop and output the last w as our solution.

•    The final vector, when the algorithm stops is wMAP . In this assignment we will use wMAP for prediction (i.e. we will not calculate a predictive distribution).

Likelihood models
•    In order to apply the algorithm for any likelihood model and to evaluate its predictions we need to specify 4 items: (1) di, (2) ri, (3) how to compute our prediction tˆ for test example z, and (4) how to calculate the error when we predict tˆand the true label is t.

•    For the logistic likelihood we have: yi = σ(wT φ(xi)) and the first derivative term is di = ti−yi or d = t − y. The second derivative term is ri = yi(1 − yi). For test example z we predict t = 1 iff p(t = 1) = σ(wMAPT φ(z)) ≥ 0.5. The error is 1 if tˆ6= t.

•    Note that for the logistic model the update formula as developed in class is wn+1 ← wn − (−αI − ΦT RΦ)−1[ΦT (t − y) − αwn]. You might want to start developing your code and testing it with this special case and then generalize it to handle all likelihoods. To help you test your implementation of this algorithm we provide an additional dataset, irlstest, and solution weight vector in irlsw (for α = 0.1). The first entry in irlsw corresponds to w0.

•    For the Poisson likelihood we have: yi = e(wTφ(xi)) and the first derivative term is di = ti−yi or d = t−y. The second derivative term is ri = yi. For test example z we have p(t) = Poisson(λ) where a = wMAPT φ(z) and λ = ea. We predict the mode t = bλc. For this assignment we will use the absolute error: err = |tˆ− t|.

•    For the ordinal model with K levels we have parameters s and φ0 = −∞ < φ1 < ... < φK−1 < φK = ∞ where for this assignment we will use K = 5, s = 1 and φ0 = −∞ < φ1 = −2 < φ2 = −1 < φ3 = 0 < φ4 = 1 < φ5 = ∞. The model is somewhat sensitive to the setting of hyperparameters so it is important to use these settings.

Here ai = wT φ(xi) and for potential label j ∈ {1,...K} we have yi,j = σ(s∗(φj −ai)). Using this notation, for example i with label ti we have di = yi,ti + yi,(ti−1) − 1. For the second

derivative we have ri = s2[yi,ti(1 − yi,ti) + yi,(ti−1)(1 − yi,(ti−1))].

To predict for test example z we first calculate the y values: a = wMAPT φ(z) and for potential label j ∈ {1,...K} we have yj = σ(s∗(φj −a)). We then calculate pj = yj −yj−1 and select tˆ= argmaxj pj. For this assignment we will use the absolute error, or the number of levels we are off in prediction, that is, err = |tˆ− t|.

While you could implement these as three separate algorithms, you are expected to provide one implementation of the main optimization which is given access to procedures calculating the 4 items above to make a concrete instance of GLM.

Evaluating the implementation
Your task is to implement the GLM algorithm and generate learning curves with error bars (i.e., ±1σ) as follows.

Repeat 30 times

Step 1) Set aside 1/3 of the total data (randomly selected) to use as a test set.

Step 2) Permute the remaining data and record the test set error rate as a function of increasing training set portion (0.1,0.2, ...,1 of the total size).

Calculate the mean and standard deviation for each size and plot the result. In addition record the number of iterations and the run time untill convergence in each run and report their averages.

In your submission provide 4 plots, one for each dataset, and corresponding runtime/iterations statistics, and provide a short discussion of the results. Are the learning curves as expected? how does learning time vary across datasets for classification? and across the likelihood models? what are the main costs affecting these (time per iteration, number of iterations)?

Extra Credit
Explore some approach for model selection for α in all models and/or for s and φ in the ordinal model and report your results. You may want to generate your own data with known parameters in order to test the success of algorithms in identifying good parameters.

More products