Starting from:

$30

Machine Learning 2-Exercise Sheet 7 Solved

Exercise 1: Structured Prediction for Classification 
While structured output learning is typically used for predicting complex output signals such as sequences or trees, the same framework can also be used to address the more standard problem of classification. Let x,x0 ∈ Rd be two data points and y,y0 ∈ {1,...,C} their respective classes. Consider the structured output kernel kstruct((x,y),(x0,y0)) = k(x,x0) · I(y = y0),

where k(x,x0) is a positive semi-definite kernel with associated feature map φ(x), and where I(·) is an indicator function that is 1 when the argument is true and 0 otherwise.

(a) Show that the kernel kstruct((x,y),(x0,y0)) is positive semi-definite, that is, show that

                                                                                    N        N

XX
cicj kstruct((xi,yi),(xj,yj)) ≥ 0 i=1 j=1

for all input/output pairs (x1,y1),...,(xN,yN) and choice of real numbers c1,...,cN.

(b) Find a feature map φstruct(x,y) associated this kernel, i.e. satisfying

 φstruct(x,y),φstruct struct((x,y),(x0,y0))

for all pairs (x,y) and (x0,y0).

Exercise 2: Dual Formulation of Structured Output Learning 
In structured output learning, we look for a linear model in joint feature space that produces a large margin between the correct prediction and all other possible predictions. The primal formulation of this problem can be expressed as:

  N

subject to the constraints:

                                                                        and           

where we have used the shortcut notation Ψn,y = φ(x,yn) − φ(x,y).

subject to the constraints:
y06=yn0
(a)      Show that the associated dual optimization problem is given by:

                                                               y           and          

y6=yn

(b)    Assuming    )) is the kernel that induces the feature map φ(x,y), express the dot product

hΨn,y,Ψn0,y0i in terms of this kernel.

Exercise 3: Prediction of Output Sequences 
Consider output sequences to predict to be of the type y ∈ {−1,1}L, and the feature map:

 

where  denotes the element-wise product between two vectors. The structured output model looks for the output y that maximizes the matching function, i.e.

max w>φ(x,y) y

with w ∈ R2L−1. In the following, we assume that L = 3, that the current parameter is w = (1,1,1,1,1) and that we receive the input x = (1,−1,1).

(a)     Show that the problem of maximizing the matching function simplifies to:

 

(b)    Find using the Viterbi procedure the best output (y1,y2,y3), that is, solve maxy3{} for every y2, then solve maxy2{} for every y1, and then solve maxy1{}. While doing so, keep track of the values in the sequence that have produced the respective maximums so that the optimal sequence can be reconstructed.

Exercise 4: Programming 
Download the programming files on ISIS and follow the instructions.

Exercise sheet 7 (programming)                                                                                   

 

Structured Output Learning
In this exercise, we consider a data completion task to be solved with structured output learning. The dataset is based on the dataset of the previous programming sheet on splice sites classification. We would like to be able to reconstruct a nucleotide sequence when one of the nucleotides is missing. One such incomplete sequence of nucleotides is shown in the image below

 

where the question mark indicates the missing nucleotide. We would like to make use of the degree kernel that was used in the previous programming sheet. It was shown to represent genes data efficiently near the splice sites. For our completion task, we adopt a structured output learning approach, where the candidate value for replacing the missing nucleotide is also part of the kernel feature map. Interestingly, with this approach, the kernel can still apply to the same type of input data (i.e. continuous gene sequences) as in the standard splice classification setting.

The structured output problem is defined as solving the soft-margin SVM optimization problem:

 i w,b

where for all inputs pairs (xi,yi)Ni=1 representing the genes sequences and the true value of the missing nucleotide, the following constraints hold:

w⊤ϕ(xi,yi) + b ≥ 1 − ξi

∀zi ∈ {A,T,C,G} ∖ yi : w⊤ϕ(xi,zi) + b ≤ −1 + ξi ξi ≥ 0

Once the SVM is optimized, a missing nucleotide y for sequence x can be predicted as: y(x) = arg max w⊤ϕ(x,z).

z∈{A,T,C,G}

The feature map ϕ(x,z) is implicitely defined by the degree kernel between gene sequences r and r′ given as kd 

where r is built as the incomplete genes sequence x with missing nucleotide "?" set to z, and where r[i…i + d] is a subsequence of r starting at position i and of length d.

Loading the Data
The following code calls a function from the file utils.py that loads the data in the IPython notebook. Note that only the 20 nucleotides nearest to the splice site are returned. The code then prints the first four gene sequences from the dataset, where the character "?" denotes the missing nucleotide. The label associated to each incomplete genes sequences (i.e. the value of the missing nucleotide "?") is shown on the right.

In [1]: import utils

Xtrain,Xtest,Ytrain,Ytest = utils.loaddata()

 print("".join(Xtrain[0])+"  ?="+Ytrain[0]) print("".join(Xtrain[1])+"  ?="+Ytrain[1]) print("".join(Xtrain[2])+"  ?="+Ytrain[2]) print("".join(Xtrain[3])+"  ?="+Ytrain[3])

CAACGATCCAT?CATCCACA  ?=C

CAGGACGGTCA?GAAGATCC  ?=G

AAAAAGATGA?GTGGTCAAC  ?=A

TGTCGGTTA?CAATGATTTT  ?=C

It can be observed from the output that the missing nucleotide is not always at the same position. This further confirms that the problem cannot be treated directly as a standard multiclass classification problem. Note that in this data, we have artificially removed nucleotides in the training and test set so that we have labels y available for training and evaluation.

Generating SVM Data 
In the SVM structured output formulation, the data points (xi,yi) denote the true genes sequences and are the SVM positive examples. To be able to train the SVM, we need to generate all possible examples ((xi,zi))zi∈{A,T,C,G}.

Your first task is to implement a function builddata(X,Y) that receives as input the dataset of size (N x L) of incomplete gene sequences X where N is the number of gene sequences and L is the sequence length, and where Y of size N contains the values of missing nucleotides.

Your implementation should produce as output an extended dataset of size (4N x L) . Also, the function should return a vector of labels T of size 4N that is +1 for positive SVM examples and -1 for negative SVM examples. For repeatability, ensure that all modifications of the same gene sequence occur in consecutive order in the outputs XZ and T .

 

Your implementation can be tested by running the following code. It applies the function to the training and test sets and prints the first 12 examples in the training set (i.e. all four possible completions of the first three gene sequences).

 

 ...

 ['T' 'C' 'C' ... 'G' 'C' 'T']

 ['T' 'C' 'C' ... 'G' 'C' 'T']

 ['T' 'C' 'C' ... 'G' 'C' 'T']]

CAACGATCCATACATCCACA -1

CAACGATCCATTCATCCACA -1

CAACGATCCATCCATCCACA +1

CAACGATCCATGCATCCACA -1

CAGGACGGTCAAGAAGATCC -1

CAGGACGGTCATGAAGATCC -1

CAGGACGGTCACGAAGATCC -1

CAGGACGGTCAGGAAGATCC +1

AAAAAGATGAAGTGGTCAAC +1

AAAAAGATGATGTGGTCAAC -1

AAAAAGATGACGTGGTCAAC -1

AAAAAGATGAGGTGGTCAAC -1

SVM Optimization and Sequences Completion 
In this section, we would like to create a function that predicts the missing nucleotides in the gene sequences. The function should be structured as follows: First, we build the kernel training and test matrices using the function utils.getdegreekernels and using the specified degree parameter. Using scikit-learn SVM implementation ( sklearn.svm.SVC ) to train the SVM associated to the just computed kernel matrices and the target vector Ttrain . Use the default SVM hyperparameter C=1 for training.

After training the SVM, we would like to compute the predictions for the original structured output problem, that is, for each original gene sequence in the training and test set, the choice of missing nucleotide value for which the SVM prediction value is highest. The outputs Ptrain and Ptest denote such predictions and should be arrays of characters A,T,C,G of same size as the vectors of true nucleotides values Ytrain and Ytest .

(Hint: You should consider that in some cases there might be not exactly one missing nucleotide value that produces a positive SVM classification. In such cases, we would still like to find the unique best nucleotide value based on the value of the discriminant function for this particular data point. A special function of scikit-learn's SVC class exists for that purpose.)

 

The code below tests the prediction function above with different choices of degree parameters for the kernel. Note that running the code can take a while (up to 3 minutes) due to the relatively large size of the kernel matrices. If the computation time becomes problematic, consider a subset of the dataset for development and only use the full version of the dataset once you are ready to produce the final version of your notebook.

 

degree: 1  train accuracy: 0.295  test accuracy: 0.281 degree: 2  train accuracy: 0.517  test accuracy: 0.530 degree: 3  train accuracy: 0.564  test accuracy: 0.516 degree: 4  train accuracy: 0.804  test accuracy: 0.499 degree: 5  train accuracy: 0.965  test accuracy: 0.492 degree: 6  train accuracy: 0.998  test accuracy: 0.487

More products