$25
We consider a class optimization problems of the type:
minJ(θ) s.t. ) = 0 and θ
For this class of problem, we can build the Lagrangian:
m l
L(θ,β,λ) = J(θ) + Xβigi(θ) + Xλihi(θ).
i=1 i=1
where (βi)i and (λi)i are the dual variables. According to the Karush-Kuhn-Tucker (KKT) conditions, it is necessary for a solution of this optimization problem that the following constraints are satisfied (in addition to the original constraints of the optimization problem):
= 0 (stationarity)
∀li=1 : λi ≥ 0 (dual feasibility)
∀li=1 : λihi(θ) = 0 (complementary slackness)
We will make use of these conditions to derive the dual form of the kernel ridge regression problem.
Exercise 1: Kernel Ridge Regression with Lagrange Multipliers (10+20+10+10 P)
Let x1,...,xN ∈Rd be a dataset with labels y1,...,yN ∈R. Consider the regression model f(x) = w>φ(x) where φ: Rd →Rh is a feature map and w is obtained by solving the constrained optimization problem
s.t. and
where equality constraints define the errors of the model, where the objective function penalizes these errors, and where the inequality constraint imposes a regularization on the parameters of the model.
(a) Construct the Lagrangian and state the KKT conditions for this problem (Hint: rewrite the equality constraint as ξi − w>φ(xi) + yi = 0.)
(b) Show that the solution of the kernel regression problem above, expressed in terms of the dual variables (βi)i, and λ is given by:
β = (K + λI)−1λy
where K is the kernel Gram matrix.
(c) Express the prediction f(x) = w>φ(x) in terms of the parameters of the dual.
(d) Explain how the new parameter λ can be related to the parameter C of the original formulation.
Exercise 2: Programming (50 P)
Download the programming files on ISIS and follow the instructions.
Gaussian Processes
In this exercise, you will implement Gaussian process regression and apply it to a toy and a real dataset. We use the notation used in the paper "Rasmussen (2005). Gaussian Processes in Machine Learning" linked on ISIS.
Let us first draw a training set X = (x1, …, xn) and a test set X , …, xm⋆) from a d-dimensional input distribution. The Gaussian Process is a model under which the real-valued outputs f = (f1, …, fn) and f , …, fm⋆) associated to X and X⋆ follow the Gaussian distribution:
where
Σ = k(X, X) + σ2I
Σ = k(X, X )
⋆ ⋆
Σ = k(X , X ) + σ2I
⋆⋆ ⋆ ⋆
and where k( ⋅ , ⋅ ) is the Gaussian kernel function. (The kernel function is implemented in utils.py .) Predicting the output for new data points X⋆ is achieved by conditioning the joint probability distribution on the training set. Such conditional distribution called posterior distribution can be written as:
⏟ ⏟ μ⋆ C⋆
Having inferred the posterior distribution, the log-likelihood of observing for the inputs X the outputs y is given by evaluating the distribution f | f at y :
⋆ ⋆ ⋆ ⋆
where | ⋅ | is the determinant. Note that the likelihood of the data given this posterior distribution can be measured both for the training data and the test data.
Part 1: Implementing a Gaussian Process (30 P)
Tasks:
Create a class GP_Regressor that implements a Gaussian process regressor and has the following three methods:
def __init__(self,Xtrain,Ytrain,width,noise): Initialize a Gaussian process with noise parameter σ and width parameter w. The
variable Xtrain is a two-dimensional array where each row is one data point from the training set. The Variable Ytrain is a vector containing the associated targets. The function must also precompute the matrix Σ −1 for subsequent use by the method predict() and loglikelihood() .
def predict(self,Xtest): For the test set X⋆ of m points received as parameter, return the mean vector of size m and covariance matrix of size m × m of the corresponding output, that is, return the parameters (μ , C ) of the Gaussian distribution f | f.
⋆ ⋆ ⋆ def loglikelihood(self,Xtest,Ytest): For a data set X⋆ of m test points received as first parameter, return the loglikelihood of observing the outputs y received as second parameter.
⋆
Part 2: Application to the Yacht Hydrodynamics Data Set (20 P)
In the second part, we would like to apply the Gaussian process regressor that you have implemented to a real dataset: the Yacht Hydrodynamics Data
Set available on the UCI repository at the webpage http://archive.ics.uci.edu/ml/datasets/Yacht+Hydrodynamics
(http://archive.ics.uci.edu/ml/datasets/Yacht+Hydrodynamics). As stated on the web page, the input variables for this regression problem are:
1. Longitudinal position of the center of buoyancy
2. Prismatic coefficient
3. Length-displacement ratio
4. Beam-draught ratio
5. Length-beam ratio
6. Froude number
and we would like to predict from these variables the residuary resistance per unit weight of displacement (last column in the file yacht_hydrodynamics.data ).
Tasks:
Load the data using datasets.yacht() and partition the data between training and test set using the function utils.split() .
Standardize the data (center and rescale) so that each dimension of the training data and the labels have mean 0 and standard deviation 1 over the training set.
Train several Gaussian processes on the regression task using various combinations of width and noise parameters.
Draw two contour plots where the training and test log-likelihood are plotted as a function of the noise and width parameters. Choose suitable ranges of parameters so that the best parameter combination for the test set is in the plot. Use the same ranges and contour levels for the training and test plots. The contour levels can be chosen linearly spaced between e.g. 50 and the maximum log-likelihood value
Width params: 0.050 0.135 0.220 0.304 0.389 0.474 0.559 0.643 0.728 0.813 0.898 0.983 1.067 1.152 1.
237 1.322 1.407 1.491 1.576 1.661 1.746 1.830 1.915 2.000