$25
Exercise 1: Neural Network Optimization (15+15 P)
Consider the one-layer neural network
y = w>x + b
applied to data points x ∈ Rd, and where w ∈ Rd and b ∈ R are the parameters of the model. We consider the optimization of the objective:
,
where the expectation is computed over an empirical approximation ˆp of the true joint distribution p(x,t) and t ∈ {−1,1}. The input data follows the distribution x ∼ N(µ,σ2I) where µ and σ2 are the mean and variance.
(a) Compute the Hessian of the objective function J at the current location w in the parameter space, and as a function of the parameters µ and σ of the data.
(b) Show that the condition number of the Hessian is given by: .
Exercise 2: Neural Network Regularization (10+10+10 P)
For a neural network to generalize from limited data, it is desirable to make it sufficiently invariant to small local variations. This can be done by limiting the gradient norm k∂f/∂xk for all x in the input domain. As the input domain can be high-dimensional, it is impractical to minimize the gradient norm directly. Instead, we can minimize an upper-bound of it that depends only on the model parameters.
We consider a two-layer neural network with d input neurons, h hidden neurons, and one output neuron. Let W be a weight matrix of size d×h, and ( a collection of biases. We denote by Wi,: the ith row of the weight matrix and by W:,j its jth column. The neural network computes:
) (layer 1)
f(x) = Pj sjaj (layer 2)
where sj ∈ {−1,1} are fixed parameters. The first layer detects patterns of the input data, and the second layer computes a fixed linear combination of these detected patterns.
(a) Show that the gradient norm of the network can be upper-bounded as:
(b) Let kWkMix be a `1/`2 mixed matrix norm. Show that the gradient norm of the network can be upper-bounded by it as:
Mix
(c) Show that the mixed norm provides a bound that is tighter than the one based on the Frobenius norm, i.e. show that: √
kWkMix ≤ h · kWkF
.
Exercise 3: Programming (40 P)
Download the programming files on ISIS and follow the instructions.
Neural Networks 2
In this homework, we will train neural networks on the Breast Cancer dataset. For this, we will use of the Pytorch library. We will also make use of scikitlearn for the ML baselines. A first part of the homework will analyze the parameters of the network before and after training. A second part of the homework will test some regularization penalties and their effect on the generalization error.
Breast Cancer Dataset
The following code extracts the Breast cancer dataset in a way that is already partitioned into training and test data. The data is normalized such that each dimension has mean 0 and variance 1. To test the robustness of our learning models, we also artificially inject 4% of mislabelings in the training data.
Neural Network Classifier
In this homework, we consider the same architecture as the one considered in Exercise 2 of the theoretical part. The class
NeuralNetworkClassifier implements this network. The function reg is a regularizer which we set initially to zero (i.e. no regularizer). Because the dataset is small, the network can be optimized in batch mode, using the Adam optimizer.
Neural Network Performance vs. Baselines
We compare the performance of the neural network on the Breast cancer data to two other classifiers: a random forest and a support vector classification model with RBF kernel. We use the scikit-learn implementation of these models, with their default parameters.
> NN | Train Acc: 1.000 | Test Acc: 0.884
The neural network performs not as good as the baselines. Most likely, the neural network has overfitted its decision boundary, in particular, on the mislabeled training examples.
Gradient, and Parameter Norms (25 P)
For the model to generalize better, we assume that the gradient of the decision function should be prevented from becoming too large. Because the gradient can only be evaluated on the current data distribution (and may not generalize outside the data), we resort to the following inequality we have proven in the theoretical section for this class of neural network models:
$$ \text{Grad} \leq \|W\|_\text{Mix} \leq \sqrt{h}\|W\|_\text{Frob} $$
where
$\|W\|_\text{Frob} = \sqrt{\sum_{i=1}^d \sum_{j=1}^h w_{ij}^2}$
$\|W\|_\text{Mix} = \sqrt{\sum_{i=1}^d \big(\sum_{j=1}^h | w_{ij}|\big)^2}$
$\text{Grad} = \textstyle \frac1N \sum_{n=1}^N\|\nabla_{\boldsymbol{x}}f (\boldsymbol{x_n})\|$
and where $d$ is the number of input features, $h$ is the number of neurons in the hidden layer, and $W$ is the matrix of weights in the first layer (Note that in PyTorch, the matrix of weights is given in transposed form).
As a first step, we would like to keep track of these quantities during training. The function Frob(nn) that computes $\|W\|_\text{Frob}$ is already implemented for you.
Tasks:
Implement the function Mix(nn) that receives the neural network as input and returns $\|W\|_\text{Mix}$.
Implement the function Grad(nn,X) that receives the neural network and some dataset as input, and computes the averaged gradient norm ($\text{Grad}$).
> After | Train Acc: 1.000 | Test Acc: 0.884 | Grad: 7.297 | WMix: 40.103 | sqrt(h)*WFrob:
56.739
We observe that the inequality $\text{Grad} \leq \|W\|_\text{Mix} \leq \sqrt{h} \|W\|_\text{Frob}$ we have proven also holds empirically. We also observe that these quantities tend to increase as training proceeds. This is a typical behavior, as the network starts rather simple and becomes complex as more and more variations in the training data are being captured.
Norm Penalties (15 P)
We consider the new objective $J^\text{Frob}(\theta) = \text{MSE}(\theta) + \lambda \cdot (\sqrt{h} \|W\|_\text{Frob})^2$, where the first term is the original mean square error objective and where the second term is the added penalty. We hardcode the penalty coeffecient to $\lambda = 0.005$. In principle, for maximum performance and fair comparison between the methods, several of them should be tried (also for other models), and selected based on some validation set. Here, for simplicity, we omit this step.
A downside of the Frobenius norm is that it is not a very tight upper bound of the gradient, that is, penalizing it is does not penalize specifically high gradient. Instead, other useful properties of the model could be negatively affected by it. Therefore, we also experiment with the mixed-norm regularizer
$\textstyle \lambda \cdot \|W\|_\text{Mix}^2$, which is a tighter bound of the gradient, and where we also hardcode the penalty coefficient to $\lambda =
0.025$.
Task:
Create two new classifiers by reimplementing the regularization function with the Frobenius norm regularizer and Mixed norm regularizer respectively. You may for this task call the norm functions implemented in the question above, but this time you also need to ensure that these functions can be differentiated w.r.t. the weight parameters.
The code below implements and train neural networks with the new regularizers, and compares the performance with the previous models.
> NN | Train Acc: 1.000 | Test Acc: 0.884 | Grad: 7.297 | WMix: 40.103 | sqrt(h)*WFrob:
56.739
> NN+Frob | Train Acc: 0.961 | Test Acc: 0.954 | Grad: 0.735 | WMix: 1.767 | sqrt(h)*WFrob:
2.689
> NN+Mix | Train Acc: 0.951 | Test Acc: 0.961 | Grad: 0.745 | WMix: 1.637 | sqrt(h)*WFrob:
4.138
We observe that the regularized neural networks now performs on par with the baselines. It is interesting to observe that the mixed norm penalty more selectively reduced the gradient, and has let the Frobenius norm take higher values.