Starting from:

$30

Machine Learning 2-Exercise Sheet 11 Solved

Exercise 1: Activation Maximization 
Consider the linear model f(x) = w>x + b mapping some input x to an output f(x). We would like to interpret the function f by building a prototype x? in the input domain which produces a large value f. Activation maximization produces such interpretation by optimizing

 . x

(a)    Find the prototype x? obtained by activation maximization subject to the penalty Ω(x) = λkxk2.

(b)    Find the prototype x? obtained by activation maximization subject to the penalty Ω(x) = −logp(x) with x ∼ N(µ,Σ) where µ and Σ are the mean and covariance.

(c)    Find the prototype x? obtained when the data is generated as (i) z ∼ N(0,I) and (ii) x = Az + c, with A and c the parameters of the generator. Here, we optimize f w.r.t. the code z subject to the penalty Ω(z) = λkzk2.

Exercise 2: Layer-Wise Relevance Propagation 

We would like to test the dependence of layer-wise relevance propagation (LRP) on the structure of the neural network. For this, we consider the function y = min(a1,a2), where a1,a2 ∈ R+ are the input activations. This function can be implemented as a ReLU network in multiple ways. Two examples are given below.

 

(a)    Show that these two networks implement the ‘min’ function on the relevant domain.

(b)    We consider the LRP-γ propagation rule:

 

where ()+ denotes the positive part. For each network, give for the case a1 = a2 an analytic solution for the scores R1 obtained by application this propagation rule at each layer. More specifically, express R1 as a function of the input activations.

Exercise 3: Neuralization 

Consider the one-class SVM that predicts for every new data point x the ‘inlierness’ score:

M f(x) = Xαik(x,ui)

i=1

where (u  is the collection of support vectors, and αi > 0 are their weightings. We use the Gaussian kernel k(x,x0) = exp(−γkx − x0k2).

Because we are typically interested in the degree of anomaly of a particular data point, we can also define the score ) which grows with the degree of anomaly of the data point.

(a)    Show that the outlier score o(x) can be rewritten as a two-layer neural network:

(layer 1)

                                                                                                                         )                                                (layer 2)

(b)    Show that the layer 2 converges to a min-pooling (i.e. ) in the limit of γ → ∞.

Exercise 4: Programming 

Download the programming files on ISIS and follow the instructions.

Exercise sheet 11 (programming)                                            
 

Explaining the Predictions of the VGG-16 Network
In this homework, we would like to implement different explanation methods on the VGG-16 network used for image classification. As a test example, we take some image of a castle

 

and would like to explain why the VGG:16 neuron castle activates for this image. The code below loads the image and normalizes it, loads the model, and identifies the output neuron corresponding to the class castle .

 

Gradient × Input 
A simple method for explanation is Gradient × Input. Denoting f the function corresponding to the activation of the desired output neuron, and x the data point for which we would like to explain the prediction, the method assigns feature relevance using the formula

Ri = [∇f(x)]i ⋅ xi
for all i = 1…d. When the neural network is piecewise linear and positively homogeneous, the method delivers the same result as one would get with a Taylor expansion at reference point x˜ = ε ⋅ x with ε almost zero.

Task:

  Implement Gradient × Input, i.e. write a function that produces a tensor of same dimensions as the input data and that contains the scores (Ri)i, test it on the given input image, and visualize the result using the function utils.visualize .

 

We observe that the explanation is noisy and has a large amount of positive and negative evidence. To produce a more robust explanation, we would like to use instead LRP, and implement it using the trick described in the slides. The trick consists of detaching certain terms from the differentiation graph, so that the explanation can obtained by performing Gradient × Input with the modified gradient.

LRP rules are typically set different at each layer. In VGG-16, we distinguish the following three types of layers of parameters:

 First convolution: This convolution receives pixels as input, and it requires a special rewrite rule that we given below.

Next convolutions: All remaining convolutions receives activations as input, and we can treat them using LRP-γ.

Dense layers: For these layers, we want to let the standard gradient signal flow, so we can simply leave these layers intact.

LRP in First Convolution
This convolution implements the propagation rule:

xiwij − liw+ij − hiw−ij

                                                                               Ri = ∑∑ i                                                                            − Rj

                                                                                      j            0, xiwij                        hiwij

where (⋅)+ and (⋅)− are shortcut notations for max(0,⋅) and min(0,⋅), xi the value of pixel i, and where li and hi denote lower and upper-bounds on pixel values. To implement this rule using the proposed approach, we define the quantities pj = xiwij   hiw−ij and zj = xiwij and perform the reassignation: zj ← pj ⋅ [zj/pj]cst.

 

LRP in Next Convolutions 
In the next convolutions, we would like to apply the LRP-γ rule given by:

aj(wjk + γw+jk)

                                                                                 Rj = ∑                                                 Rk

k ∑0,j aj(wjk + γw

 To implement this rule using the proposed approach, we define the quantities pk and zk = ajwjk and perform the reassignation: zk ← pk ⋅ [zk/pk]cst.

Task:

  Inspired by the code for the first convolution, implement a class for the next convolutions that is equiped to perform the LRP-γ propagation when calling the gradient.

 

Now, we can create the LRP-enabled model by replacing the convolution layers with their modified versions.

 In [5]: lrpmodel = torchvision.models.vgg16(pretrained=True); model.eval() features = lrpmodel._modules['features'] for i in [0]: features[i] = ConvPx(features[i]) for i in [2]:              features[i] = Conv(features[i],10**(-0.0)) for i in [5,7]:              features[i] = Conv(features[i],10**(-0.5)) for i in [10,12,14]: features[i] = Conv(features[i],10**(-1.0)) for i in [17,19,21]: features[i] = Conv(features[i],10**(-1.5)) for i in [24,26,28]: features[i] = Conv(features[i],10**(-2.0))

Note that for the layer 0 and for the subsequent layers, we have used two different classes. Also, the parameter γ is set high in the lower layers and goes increasingly closer to zero as we reach the top convolutions.

We then proceed like Gradient × Input, except for the fact that in the modified first convolution the inputs and gradients not only comprise the pixels x but also the lower and upper bounds l and h. The code below implements this extended Gradient × Input and visualizes the resulting explanation.

 

We observe that most of the noise has disappeared, and we can clearly see in red which patterns in the data the neural network has used to predict 'castle'. We also see in blue which patterns contribute negatively to that class, e.g. the corner of the roof and the traffic sign.

More products