$30
Exercise 1: One-Class SVM (5+5+20+10+10 P)
The one-class SVM is given by the minimization problem:
N
i
s.t. and ξi ≥ 0
where x1,...,xn are the training data and φ(xi) ∈ Rd is a feature space representation.
(a) Show that strong duality holds (i.e. verify the Slater’s conditions).
(b) Write the Lagrange function associated to this optimization problem.
(c) Show the dual program for the one-class SVM is given by:
N
s.t. Xαi = 1 and
i=1
(d) Show that the problem can be equivalently rewritten in canonical matrix form as:
s.t. 1>α = 1 and
where K is the Gram matrix whose elements are defined as Kij = k(xi,xj).
(e) The decision rule in the primal for classifying a point as an outlier is given by:
hφ(x),wi < ρ
Also, one can verify that for any data point xi whose associated dual variable satisfies the strict inequalities 0 < αi < Nν1 , and calling one such point a support vector xSV, the following equality holds:
hφ(xSV),wi = ρ
Show that the outlier detection rule can be expressed as:
N N
X X
αik(x,xi) < αik(xSV,xi)
i=1 i=1
Exercise 2: Programming (50 P)
Download the programming files on ISIS and follow the instructions.
Exercise sheet 8 (programming)
Implementing Anomaly Detection Models
In this exercise sheet, several kernel-based anomaly detection models will be implemented and their behavior compared on a simple two-dimensional dataset. The following code builds a dataset generated as a mixture of several Gaussian blobs.
Kernel Density Estimation
The first anomaly detection model is based on kernel density estimation (KDE). KDE builds the function
N 1
f(x) = ∑k(x,xn)
N n=1
where the output forms here an unnormalized probability density function. Note that if only interested in producing an ordering of points from least to most outlier, we don't need to normalize f(x). However, because f(x) is more a measure of inlierness than outlierness, we can define the outlier score o(x) as a decreasing function of f(x) and also make sure the function goes to infinity for very remote data points. This can be achieved with the scoring function: o(x) = −log(f(x))
We now would like to implement KDE using an interface similar to how ML algorithms are provided in scikit-learn, in particular, by defining a class that implements a fit function for training based on some training data X and a predict function for computing the prediction for a new set of points X . The KDE class is initialized with a kernel function (typically a Gaussian kernel). Its functions for training and predicting are incomplete.
Task:
Implement the functions fit(self,X) and predict(self,X) of the class KDE .
The KDE model can now be tested on our two-dimensional data. The code below passes to the KDE model a Gaussian kernel of scale γ = 0.25 (i.e. the bandwidth is slightly larger than for the default Gaussian kernel), train the model on the Gaussian blobs data, and apply the model to a grid dataset for the purpose of building a contour plot.
We observe that model behaves as expected, i.e. the regions outside the data are highlighted in red, which corresponds to high outlier scores.
Uncentered Kernel PCA Anomaly Detection
Another model for anomaly detection is based on Kernel PCA. Here, we consider an uncentered version of Kernel PCA where we do not subtract the mean of the data in feature space. Because it is not possible to compute exactly the eigenvectors from finite data, we resort to an empirical approximation based on the Gram matrix:
[K]nn′ = k(xn,xn′ )
and diagonalizing it to get empirical eigenvectors and eigenvalues:
K = UΛU ⊤
The matrix Λ is diagonal and contains all eigenvalues λ,…,λN sorted in descending order. The columns of the matrix U are the corresponding eigenvectors. For the training data, projection of the nth data point on the ith principal component is readily given by proji(xn) = Un,i ⋅ λ0.5i
For new data points x ∈ Rd, such projection is not readily available and we can resort instead to the following interpolation scheme:
proji(x) = k(x,X)
The latter produces equivalent results for points (xn)n in the dataset but it generalizes the projection to any other point x ∈ Rd.
Once the data has been projected on the principal components, the outlier score can be computed as:
a
o(x) = k(x,x) − ∑(proji(x))2
i=1
An incomplete version of uncentered kernel PCA anomaly detection is given below. Like for KDE, it receives a kernel as input, but one also needs to needs to specify the number of dimensions used in the Kernel PCA model.
Task:
Implement the functions fit(self,X) and predict(self,X) of the class UKPCA .
The kernel PCA approach can now be tested. We first consider a kPCA model with a linear kernel and where we retain only the first principal component.
The outlier score grows along the second principal component (the one with least variance). We now consider instead a Gaussian kernel (of slightly larger bandwidth than the one used for KDE) and build a the outlier function from a KPCA model containing 25 principal components.
Here, we observe that the outlier model much more closely follows the shape of the data distribution. However, we also observe that it saturates away from the data, which does not reflect the true degree of outlierness.
One-Class SVM
The one-class SVM is another approach to anomaly detection that aims to build some envelope that contains the inliner data and that separates it from outlier data. In its dual form, it consists of solving the constrained optimization problem: subject to
1⊤α = 1 and
To solve this optimization problem, we can use the quadratic solver provided as part of cvxopt and the interface of which is shown below:
Once the solution has been found, the output score can be computed as f(x) = ∑i αik(x,xi). Similarly to the outlier scores we have computed for KDE, we can build a transformation
∑ αik(x,xi)
o
(xSSV,xi)
where xSSV is any 'strict' support vector (they can be identified as implementing the box constraints above with strict inequalities).
With this transformation the equation o(x) = 0 also gives the OC-SVM decision boundary.
Task:
Implement the functions fit(self,X) and predict(self,X) of the class OCSVM .
The OC-SVM can now be tested on the 2d dataset. Here, we first consider the case where ν = 0.0001, which corresponds to implementing a hard envelope (with no points outside of it).
In [8]: kernel = lambda x,y: sklearn.metrics.pairwise.rbf_kernel(x,y,gamma=0.1) utils.plot(X,OCSVM(kernel,0.0001).fit(X).predict(utils.Xgrid),boundary=True)
pcost dcost gap pres dres 0: 5.9756e-02 -1.0097e+04 1e+04 1e-13 2e-13
1: 5.9751e-02 -1.0853e+02 1e+02 3e-15 4e-13
2: 5.9521e-02 -4.9829e+00 5e+00 2e-16 2e-14
3: 7.0927e-02 -4.2701e+00 4e+00 4e-16 1e-14
4: 7.4758e-02 -3.5860e+00 4e+00 4e-16 2e-14
5: 6.8449e-02 -1.8468e-01 3e-01 2e-16 2e-15
6: 6.2819e-02 -1.0865e-01 2e-01 2e-16 9e-16
7: 5.9641e-02 1.9806e-02 4e-02 3e-16 8e-16
8: 5.5603e-02 3.9579e-02 2e-02 7e-16 7e-16
9: 5.4303e-02 4.7196e-02 7e-03 6e-16 7e-16
10: 5.3536e-02 5.1293e-02 2e-03 2e-16 6e-16
11: 5.3182e-02 5.2474e-02 7e-04 4e-16 7e-16
12: 5.3067e-02 5.2747e-02 3e-04 4e-16 7e-16
13: 5.2983e-02 5.2925e-02 6e-05 4e-16 6e-16
14: 5.2968e-02 5.2951e-02 2e-05 4e-16 6e-16
15: 5.2963e-02 5.2960e-02 3e-06 2e-16 7e-16
16: 5.2962e-02 5.2961e-02 3e-07 2e-16 7e-16
17: 5.2961e-02 5.2961e-02 7e-09 2e-16 7e-16
We observe that all points are indeed either contained in the envelope or at the border of it. We can now test the OC-SVM with a larger parameter ν, here, ν = 0.1 and run the code again:
In [9]: kernel = lambda x,y: sklearn.metrics.pairwise.rbf_kernel(x,y,gamma=0.1) utils.plot(X,OCSVM(kernel,0.5).fit(X).predict(utils.Xgrid),boundary=True)
pcost dcost gap pres dres 0: 5.9756e-02 -1.9792e+00 4e+02 2e+01 2e-15
1: 6.6548e-02 -1.9549e+00 6e+00 2e-01 4e-15
2: 7.2113e-02 -8.7545e-01 9e-01 4e-16 3e-15
3: 7.0349e-02 1.9189e-02 5e-02 3e-17 1e-15
4: 6.5181e-02 5.4344e-02 1e-02 6e-16 9e-16
5: 6.3039e-02 5.9564e-02 3e-03 9e-17 7e-16
6: 6.2249e-02 6.0824e-02 1e-03 2e-17 6e-16
7: 6.1873e-02 6.1403e-02 5e-04 1e-16 6e-16
8: 6.1728e-02 6.1573e-02 2e-04 3e-16 7e-16
9: 6.1671e-02 6.1646e-02 3e-05 1e-16 8e-16
10: 6.1661e-02 6.1659e-02 2e-06 5e-16 7e-16
11: 6.1660e-02 6.1660e-02 4e-08 4e-16 7e-16 Optimal solution found.
This time, not all data points are contained in the envelope, and some of them are therefore classified by the model as outlier.