Starting from:

$30

Machine Learning 2-Exercise Sheet 5 Solved

Exercise 1: Convolution Kernel 
Let x,x0 be two univariate real-valued discrete signals, that we consider in the following to be infinitedimensional. We consider a discrete convolution over these two signals



[x ∗ x0]t = X x(τ) · x0(t − τ)

τ=−∞

which also produces an infinite-dimensional output signal. We then define the ‘convolution kernel’ as:



k(x,x0) = kx ∗ x0k2 = X ([x ∗ x0]t)2.

t=−∞

(a)     Show that the convolution kernel is positive semi-definite, that is, show that

                                                                                                    N         N

XX

cicjk(xi,xj) ≥ 0 i=1 j=1

for all inputs x1,...,xN and choice of real numbers c1,...,cN.

(b)    Give an explicit feature map for this kernel.

Exercise 2: Weighted Degree Kernels 
The weighted degree kernel has been proposed to represent DNA sequences (A = {G,A,T,C}), and is defined for pairs of sequences of length L as:

                                                                                                  M                   L+1−m

k(x,z) = X βm X I(ul,m(x) = ul,m(z)).

                                                                                                m=1                    l=1

where β1,...,βM ≥ 0 are weighting coefficients, and where ul,m(x) is a substring of x which starts at position l and of length m. The function I(.) is an indicator function which returns 1 if the input argument is true and 0 otherwise.

 

(a)    Show that k is a positive semi-definite kernel. That is, show that

                                                                                                    N         N

XX

cicjk(xi,xj) ≥ 0 i=1 j=1

for all inputs x1,...,xN and choice of real numbers c1,...,cN.

(b)    Give a feature map associated to this kernel for the special case M = 1.

(c)    Give a feature map associated to this kernel for the special case M = 2 with β1 = 0 and β2 = 1.

Exercise 3: Fisher Kernel 
The Fisher kernel is a structured kernel induced by a probability model pθ(x). While it is mainly used to extract a feature map of fixed dimensions from structured data on which a structured probability model readily exists (e.g. a hidden Markov model), the Fisher kernel can in principle also be derived for simpler distributions such as the multivariate Gaussian distribution.

The probability density function of the Gaussian distribution in Rd of mean µ and covariance Σ is given by:

 

In this exercise, we consider the covariance matrix Σ to be fixed, and therefore, the only effective parameter of the model (on which the Fisher kernel is based) is the mean µ.

(a)     Show that the Fisher kernel associated to this probability model is given by:

k(x,x0) = (x − µ)>Σ−1(x0 − µ)

(b)    Give a feature map associated to this kernel.

Exercise 4: Programming 
Download the programming files on ISIS and follow the instructions.

Exercise sheet 5 (programming)                                                                                                                          

 

Part A: Kernels for DNA Sequences 
In this first part the weighted degree kernel (WDK) will be implemented for the purpose of classifying DNA sequences. We will use

Scikit-Learn (http://scikit-learn.org/ (http://scikit-learn.org/)) for training SVM classifiers. The focus of this exercise is therefore on the computation of the kernels. The training and test data is available in the folder splices-data . The following code reads the DNA sequence data and stores it in numpy arrays of characters.

In [1]: import numpy

 Xtrain = numpy.array([numpy.array(list(l.rstrip('\r\n'))) for l in open('splice-data/ splice-train-data.txt','r')])

Xtest = numpy.array([numpy.array(list(l.rstrip('\r\n'))) for l in open('splice-data/ splice-test-data.txt','r')])

Ttrain = numpy.array([int(l) for l in open('splice-data/splice-train-label.txt','r ')])

Ttest = numpy.array([int(l) for l in open('splice-data/splice-test-label.txt','r')])

We consider the weighted degree kernel described in the lecture. It applies to two genes sequences x,z ∈ {A,T,G,C}L and is defined as:

                                                                                        M           L−m+1

k(x,z) = ∑βm ∑ I(ul,m(x) = ul,m(z))

                                                                                       m=1            l=1

where l iterates over the whole genes sequence, ul,m(x) is a subsequence of x starting at position l and of length m, and I(⋅) is an indicator function that returns 1 it the argument is true and 0 otherwise. We would like to implement a function that is capable of efficiently computing this weighted degree kernel for any degree M. For this, we will make use of the block method presented in the lecture.

As a first step, we would like to implement a function size2contrib , which builds a mapping from a given block size to the kernel contribution, i.e. the sum of beta values associated to all substrings contained in this block. The relation between block size and contribution to the kernel score is as follows:

 Block size 1: contribution = β1

Block size 2: contribution = 2β1 + β2

Block size 3: contribution = 3β1 + 2β2 + β3 etc.

The function should return an integer array of size 101 containing the contribution of blocks of size zero, one, two, up to 100.

 

The function can be tested on a simple weighted degree kernel of degree 3 where beta coefficients are given by β1 = 1,β2 = 3,β3 = 9.

 In [3]: size2contrib(numpy.array([1.0,3.0,9.0]))[:20]

Out[3]: array([  0.,   1.,   5.,  18.,  31.,  44.,  57.,  70.,  83.,  96., 109.,

       122., 135., 148., 161., 174., 187., 200., 213., 226.])

Having implemented this index, we now focus on implementing the weighted degree kernel. Here, the function wdk we would like to implement receives two data arrays X and Z , and some parameter vector β. The function should return the kernel Gram matrix associated to X and Z , and run efficiently, i.e. as much as possible performing operation over several data points simultaneously by means of array computations. (Hint: An array of block sizes can be transformed to an array of kernel contributions by using the indexing operation blockcontribs = s2ctable[blocksizes] .)

 

Once the weighted degree kernel has been implemented, the code below trains SVMs on the classification task of interest for different choices of parameters β.

 n), mysvm.score(Ktest,Ttest)))

beta = [1]                   training: 0.994  test: 0.916 beta = [1 3 9]               training: 1.000  test: 0.963 beta = [0 0 0 1 3 9]         training: 1.000  test: 0.933

We observe that it is necessary to include non-unigram terms in the kernel computation to achieve good prediction performance. If however, we rely mainly on long substrings, e.g. 4-, 5-, and 6-grams, there are not sufficiently many matchings in the data to obtain reliable similarity scores, and the prediction performance decreases as a result.

Part B: Kernels for Text 
Structured kernels can also be used for classifying text data. In this exercise, we consider the classification of a subset of the 20newsgroups data composed only of texts of classes comp.graphics and sci.med . The first class is assigned label -1 and the second class is assigned label +1 . Furthermore, the beginning and the end of the newsgroup messages are removed as they typically contain information that makes the classification problem trivial. Like for the genes sequences dataset, data files are composed of multiple rows, where each row corresponds to one example. The code below extracts the fifth message of the training set and displays its 500 first characters.

 In [6]: import textwrap text = list(open('newsgroup-data/newsgroup-train-data.txt','r'))[4] print(textwrap.fill(text[:500]+' [...]'))

hat is, >>center and radius, exactly fitting those points?  I know how to do it >>for a circle (from 3 points), but do not immediately see a >>straightforward way to do it in 3-D.  I have checked some >>geometry books, Graphics Gems, and Farin, but am still at a loss? >>Please have mercy on me and provide the solution?   > >Wouldn't this require a hyper-sphere.  In 3-space, 4 points over specifies >a sphere as far as

I can see.  Unless that is you can prove that a point >exists in

3-space that  [...]

Converting Texts to a Set-Of-Words
A convenient way of representing text data is as "set-of-words": a set composed of all the words occuring in the document. For the purpose of this exercise, we formally define a word as an isolated sequence of at least three consecutive alphabetical characters. Furthermore, a set of stopwords containing mostly uninformative words such as prepositions or conjunctions that should be excluded from the set-of-words representation is provided in the file stopwords.txt . Create a function text2sow(text) that converts a text into a set of words following the just described specifications.

 

The set-of-words implementation is then tested for the same text shown above:

 In [8]: print(textwrap.fill(str(text2sow(text))))

{'sphere', 'call', 'meet', 'infinity', 'failure', 'hat', 'exactly',

'gems', 'still', 'could', 'collinear', 'there', 'choose', 'and',

'must', 'solution', 'fact', 'line', 'not', 'you', 'close', 'geometry',

'through', 'otherwise', 'algorithm', 'radius', 'this', 'hyper',

'four', 'right', 'please', 'happen', 'unless', 'need', 'two',

'graphics', 'coplaner', 'possibly', 'immediately', 'define',

'relative', 'quite', 'from', 'loss', 'prove', 'those', 'correct',

'provide', 'may', 'they', 'find', 'circle', 'sorry', 'normal',

'either', 'best', 'well', 'cannot', 'such', 'intersection',

'bisector', 'center', 'normally', 'point', 'say', 'numerically',

'let', 'its', 'the', 'consider', 'containing', 'will', 'take',

'centre', 'points', 'any', 'angles', 'see', 'diameter', 'are', 'yes',

'necessarily', 'their', 'plane', 'farin', 'them', 'desired', 'some',

'one', 'checked', 'space', 'way', 'wrong', 'lie', 'wouldn', 'all',

'error', 'non', 'does', 'for', 'bisectors', 'coincident', 'can',

'how', 'which', 'mercy', 'abc', 'distant', 'specifies', 'pictures',

'passing', 'far', 'over', 'equidistant', 'circumference', 'check',

'steve', 'require', 'three', 'have', 'then', 'since',

'straightforward', 'fitting', 'books', 'surface', 'very',

'perpendicular', 'exists', 'defined', 'but', 'lies', 'equi', 'least', 'subject', 'know', 'that'}

Implementing the Set-Of-Words Kernel
The set-of-words kernels between two documents x and z is defined as k(x,z) = ∑I(w ∈ x ∧ w ∈ z)

w∈L

where I(w ∈ x ∧ w ∈ z) is an indicator function testing membership of a word to both sets of words. As for the DNA classification exercise, it is important to implement the kernel in an efficient manner.

The function benchmark(text2sow,kernel) in utils.py computes the worst-case performance (i.e. when applied to the two longest texts in the dataset) of a specific kernel implementation. Here, the function is tested on some naive implementation of the setof-words kernel available in utils.py .

 utils.benchmark(text2sow,utils.naivekernel) kernel score: 827.000 , computation time: 0.728 The goal of this exercise is to accelerate the procedure by sorting the words in the set-of-words in alphabetic order, and making use of the new sorted structure in the kernel implementation. In the code below, the sorted list associated to sow1 is called ssow1 . Implement a function sortedkernel(ssow1,ssow2) that takes as input two sets of words (sorted in alphabetic order) and that computes the kernel score in an efficient manner, by taking advantage of the sorting structure.

 

This efficient implementation of the set-of-words kernel can be tested for worst case performance by running the code below. Here, we define an additional method text2ssow(text) for computing the sorted set-of-words.

 

kernel score: 827.000 , computation time: 0.001

The kernel score remains the same, showing that our new sorted implementation still produces the same function, however, the computation time has dropped drastically.

Classifying Documents with a Kernel SVM
The set-of-words kernel implemented above can be used to build a SVM-based text classifier. Here, we would like to separate our two classes comp.graphics and sci.med . The code below reads the whole dataset and stores it in a sorted set-of-words format.

In [12]: import numpy

Xtrain = list(map(text2ssow,open('newsgroup-data/newsgroup-train-data.txt','r')))

Xtest = list(map(text2ssow,open('newsgroup-data/newsgroup-test-data.txt','r')))

 Ttrain = numpy.array(list(map(int,open('newsgroup-data/newsgroup-train-label.txt','r '))))

Ttest = numpy.array(list(map(int,open('newsgroup-data/newsgroup-test-label.txt','r '))))

Kernel matrices are then produced using our efficient kernel implementation with pre-sorting, and a SVM can be trained to predict the document class.

In [13]: Ktrain = numpy.array([[sortedkernel(ssow1,ssow2) for ssow2 in Xtrain] for ssow1 in Xt rain])

Ktest = numpy.array([[sortedkernel(ssow1,ssow2) for ssow2 in Xtrain] for ssow1 in Xt est])

 mysvm = svm.SVC(kernel='precomputed').fit(Ktrain,Ttrain) print('training: %.3f   test: %.3f'% (mysvm.score(Ktrain,Ttrain),mysvm.score(Ktest,Tt est))) training: 1.000   test: 0.962

More products