$25
Exercise 1: Dual formulation of the Soft-Margin SVM (5+20+10+5 P) The primal program for the linear soft-margin SVM is
N
i
subject to
and ξi ≥ 0
where k.k denotes the Euclidean norm, φ is a feature map, w ∈ Rd,b ∈ R are the parameter to optimize, and xi ∈ Rd,yi ∈ {−1,1} are the labeled data points regarded as fixed constants. Once the hard-margin SVM has been learned, prediction for any data point x ∈ Rd is given by the function
f(x) = sign(w>φ(x) + b).
(a) State the conditions on the data under which a solution to this program can be found from the Lagrange dual formulation (Hint: verify the Slater’s conditions).
(b) Derive the Lagrange dual and show that it reduces to a constrained quadratic optimization problem. State both the objective function and the constraints of this optimization problem.
(c) Describe how the solution (w,b) of the primal program can be obtained from a solution of the dual program.
(d) Write a kernelized version of the dual program and of the learned decision function.
Exercise 2: SVMs and Quadratic Programming (10 P)
We consider the CVXOPT Python software for convex optimization. The method cvxopt.solvers.qp solves quadratic optimization problems given in the matrix form:
x
subject to h
and
Ax = b.
Here, denotes the element-wise inequality: (h ). Note that the meaning of the variables x and b is different from that of the same variables in the previous exercise.
(a) Express the matrices and vectors P,q,G,h,A,b in terms of the variables of Exercise 1, such that this quadratic minimization problem corresponds to the kernel dual SVM derived above.
Exercise 3: Programming (50 P)
Download the programming files on ISIS and follow the instructions.
Kernel Support Vector Machines
In this exercise sheet, we will implement a kernel SVM. Our implementation will be based on a generic quadratic programming optimizer provided in CVXOPT ( python-cvxopt package, or directly from the website www.cvxopt.org ). The SVM will then be tested on the UCI breast cancer dataset, a simple binary classification dataset accessible via the scikit-learn library.
1. Building the Gaussian Kernel (5 P)
As a starting point, we would like to implement the Gaussian kernel, which we will make use of in our kernel SVM implementation. It is defined as:
′ ‖x − x ′ ‖2
k(x, x ) = exp( − )
2σ2
Implement a function getGaussianKernel that returns for a Gaussian kernel of scale σ, the Gram matrix of the two data sets given as argument.
2. Building the Matrices for the CVXOPT Quadratic Solver (20 P)
We would like to learn a nonlinear SVM by optimizing its dual. An advantage of the dual SVM compared to the primal SVM is that it allows to use nonlinear kernels such as the Gaussian kernel. The dual SVM consists of solving the following quadratic program: max
We would like to rely on a CVXOPT solver to obtain a solution to our SVM dual. The function cvxopt.solvers.qp solves an optimization problem of the type:
\begin{align*} \min_{\boldsymbol{x}} \quad &\frac12 \boldsymbol{x}^\top P \boldsymbol{x} + \boldsymbol{q}^\top \boldsymbol{x}\\ \text{subject to} \quad & G
\boldsymbol{x} \preceq \boldsymbol{h}\\ \text{and} \quad & A \boldsymbol{x} = \boldsymbol{b}. \end{align*}
which is of similar form to our dual SVM (note that \boldsymbol{x} will correspond to the parameters (\alpha_i)_i of the SVM). We need to build the data structures (vectors and matrices) that makes solving this quadratic problem equivalent to solving our dual SVM.
Implement a function getQPMatrices that builds the matrices P , q , G , h , A , b (of type cvxopt.matrix ) that need to be passed as argument to the optimizer cvxopt.solvers.qp .
3. Computing the Bias Parameters (10 P)
Given the parameters (\alpha_i)_i the optimization procedure has found, the prediction of the SVM is given by: f(x) = \text{sign}\Big(\sum_{i=1}^N \alpha_i y_i k(x,x_i) + \theta\Big)
Note that the parameter \theta has not been computed yet. It can be obtained from any support vector that lies exactly on the margin, or equivalently, whose associated parameter \alpha is not equal to 0 or C. Calling one such vector "x_M", the parameter \theta can be computed as:
\theta = y_M - \sum_{j=1}^N \alpha_j y_j k(x_M,x_j)
Implement a function getTheta that takes as input the Gram Matrix used for training, the label vector, the solution of our quadratic program, and the hyperparameter C. The function should return the parameter \theta.
4. Implementing a class GaussianSVM (15 P)
All functions that are needed to learn the SVM have now been built. We would like to implement a SVM class that connects them and make the SVM easily usable. The class structure is given below and contains two functions, one for training the model, and one for applying it to test data.
Implement the function fit that makes use of the functions getGaussianKernel , getQPMatrices , getTheta you have already implemented. The function should learn the SVM model and store the support vectors, their label, (\alpha_i)_i and \theta into the object ( self ).
Implement the function predict that makes use of the stored information to compute the SVM output for any new collection of data points
5. Analysis
The following code tests the SVM on some breast cancer binary classification dataset for a range of scale and soft-margin parameters. For each combination of parameters, we output the number of support vectors as well as the train and test accuracy averaged over a number of random train/test splits. Running the code below should take approximately 1-2 minutes.
scale= 30.0 C= 1000.0 nSV: 184 train: 1.000 test: 0.918 scale= 30.0 C= 10000.0 nSV: 182 train: 1.000 test: 0.918
scale= 100.0 C= 10.0 nSV: 117 train: 0.965 test: 0.935 scale= 100.0 C= 100.0 nSV: 97 train: 0.987 test: 0.940 scale= 100.0 C= 1000.0 nSV: 85 train: 0.998 test: 0.932 scale= 100.0 C= 10000.0 nSV: 71 train: 1.000 test: 0.926
scale= 300.0 C= 10.0 nSV: 88 train: 0.939 test: 0.924 scale= 300.0 C= 100.0 nSV: 48 train: 0.963 test: 0.943 scale= 300.0 C= 1000.0 nSV: 36 train: 0.978 test: 0.946 scale= 300.0 C= 10000.0 nSV: 32 train: 0.991 test: 0.941
scale= 1000.0 C= 10.0 nSV: 66 train: 0.926 test: 0.916 scale= 1000.0 C= 100.0 nSV: 55 train: 0.935 test: 0.929 scale= 1000.0 C= 1000.0 nSV: 49 train: 0.956 test: 0.946 scale= 1000.0 C= 10000.0 nSV: 38 train: 0.971 test: 0.951
scale= 3000.0 C= 10.0 nSV: 87 train: 0.912 test: 0.903 scale= 3000.0 C= 100.0 nSV: 68 train: 0.926 test: 0.919 scale= 3000.0 C= 1000.0 nSV: 58 train: 0.934 test: 0.929 scale= 3000.0 C= 10000.0 nSV: 49 train: 0.953 test: 0.943
We observe that the highest accuracy is obtained with a scale parameter that is neither too small nor too large. Best parameters are also often associated to a low number of support vectors.