I Datasets
Firstly, construct your own toy datasets to visualize the result easily and understand the algorithm. Take the input instances and the labels (regression) or {0,1} (classification).
Then, try the MNIST dataset of handwritten digits, with instances x ∈ RD (where D = 784). For classification with logistic regression, use the labels yn ∈ {0,...,9} (for binary classification, pick instances from only two digit classes and use labels yn ∈ {0,1}). For linear regression, create a ground-truth mapping y = f(x) = Ax+b as follows: • A random mapping with output dimension D0, e.g. take AD0×D and bD0×1 with elements in N(0,1).
• A mapping that rotates, scales, shifts and possibly clips the input image x and adds noise to it, e.g.. This makes it easy to visualize the result, since the desired output f(x) for an image x should look like x but transformed accordingly.
II (Stochastic) gradient descent for linear regression
Review Consider first the case of a single output dimension, y ∈ R. Given a sample with xn ∈ RD and yn ∈ R, we solve a linear least-squares regression problem by minimizing:
) with
where we assume the last element of each xn vector is equal to 1, so that wD is the bias parameter (this simplifies the notation). E(w) is a quadratic objective function over w with a unique minimizer that can be found in closed form by solving the normal equations (obtained by computing the gradient of E wrt w, setting it to zero and solving for w):
XXTw∗ = Xy ⇔ w∗ = (XXT)−1Xy, where X .
This can be done with a routine for solving linear systems, e.g. in Matlab, use linsolve or the “\” operator. You can also use inv but this is numerically more costly and less stable.
The minimizer w∗ of E can also be found iteratively by gradient descent (GD) and stochastic gradient descent (SGD), starting from an initial w. These use the following updates (obtained by computing the gradient ∇E(w) of E wrt w):
N
GD: ∆w = η X (yn − wTxn)xn SGD: ∆w = η X (yn − wTxn)xn
n=1 n∈B
where η is the learning rate (step size) and B is a minibatch (i.e., a subset of 1 < |B| < N points). For GD, at each iteration we set w ← w + ∆w with ∆w = −η ∇E(w). For SGD, after each minibatch we set w ← w + ∆w with ∆w = −η Pn∈B ∇e(w;xn), and repeat over all minibatches to complete one iteration (one epoch), which passes over the whole data once. If the minibatches contain a single data point (|B| = 1), then there are N minibatches and we do pure online learning. If there is a single minibatch containing all points (|B| = N), we do pure batch learning (identical to GD). The fastest training occurs for an intermediate (usually small) minibatch size. It also helps to reorder at random (shuffle) the data points at the beginning of each epoch (so the minibatches vary, and are scanned in a different order), rather than using the same, original order at every epoch (helpful Matlab function: randperm).
Consider now the general case of several output dimensions and a sample with xn ∈ RD, yn ∈ RD0.
This is equivalent to D0 independent single-output regressions. The corresponding equations are as follows, where XD×N = (x1,...,xN), YD0×N = (y1,...,yN), WD0×D and wdT is the dth row of W:
N D0
Objective function: Wxnk2 = X e(W;xn,yn) with e(W;x
n=1 d=1
N
Gradient: Wx GD: ∆W = η X (yn − Wx SGD: ∆W = η X (yn − Wx
n=1 n∈B
Normal equations: (XXT)W = XYT ⇔ W = (XXT)−1XYT.
Exploration: toy problem Run each algorithm (GD or SGD) for, say, 100 iterations w(0),w(1),...,w(100) from an initial w = w(0) (equal to small random numbers, e.g. uniform in [−0.01,0.01]). To visualize the results, we plot the following for each algorithm (GD and SGD):
• The dataset (yn vs xn) and the regression line f(x) = wTx.
• The error E(w) over iterations, evaluated on the training set, and also on a validation set.
• A contour plot of the error E(w) and the iterates w0,w(1),...
Consider the following questions:
• Does the error E(w) approach the optimal error E(w∗) (where w∗ is the solution to the normal equations)? How fast does it approach it? Does E(w) decrease monotonically? Or does it oscillate or even diverge?
• Vary the learning rate η 0. What happens if it is very small or if it is very big? Try to determine the value of η that gives the fastest convergence for your toy dataset by trial and error. Note: practical values of η are usually (quite) smaller than 1.
• For SGD with fixed η, vary the minibatch size |B| between 1 and N. How does this affect the speed of convergence?
• For both GD and SGD we keep the learning rate η constant. How does this affect the behavior of SGD as we keep training? What should we do to improve that?
• Repeat all of the above in the following situations:
– Use a dataset having 10 times as many points.
– Use a dataset having 3 times larger noise.
– Use a different initial w(0).
Are the best values of η the same as before for GD? How about SGD? What other changes do you observe?
See the end of file lab07 linregr.m for suggestions of things to explore.
Exploration: MNIST See file lab07 linregr MNIST.m. Select a small enough subset of MNIST as training inputs (using the whole dataset will be slow). Apply the ground-truth linear transformation to the data to generate the output labels y = f(x) = Ax + b. We can compute the optimal solution exactly by solving the normal equations, and approximately by running GD and SGD. Then, consider the same questions as with the toy dataset. Note that the appropriate values for η, etc. may now be significantly different. To visualize the results, we plot the following for each algorithm (GD and SGD):
• The training and validation error E(w) over iterations, as with the toy dataset.
• For each of the 4 results (true mapping, optimal mapping from the normal equations, and the mappings learned by GD and SGD), we plot:
– The parameters of the linear mapping using imagesc(A) and imagesc(b). These are signed values, so we use colormap(parula(256)).
– If using as linear mapping the rotation/shift/scale/clip transformation, which produces as output a (possibly smaller) image yn, we plot the following for a few sample images: the input image xn, and the outputs yn under the 4 mappings.
III (Stochastic) gradient descent for logistic regression (= classification)
Review We consider the two-class case, given a sample with xn ∈ RD and yn ∈ {0,1}. Again we assume the last element of each xn vector is equal to 1, so that wD is the bias parameter in the sigmoid
. The objective functions E(w) below do not admit a closed-form solution, so we need
iterative algorithms. We apply GD and SGD, whose updates are obtained by computing the gradient ∇E(w) of E wrt w. We can learn w by maximum likelihood, or by regression.
• By maximum likelihood: we minimize the cross-entropy
(w;xn,yn) where
With GD and SGD we have the following updates:
N
GD: ∆w = η X (yn − θn)xn
n=1
• By regression: we minimize the least-squares error
SGD: ∆w = η X (yn − θn)xn.
n∈B
) where
.
With GD and SGD we have the following updates:
N
GD: ∆w = η X (yn − θn)θn(1 − θn)xn
n=1
SGD: ∆w = η X (yn − θn)θn(1 − θn)xn.
n∈B
Exploration: toy problem Proceed as in the linear regression case. The differences now are: 1) the problem is classification rather than regression; 2) the objective function E(w) and hence its gradient and GD/SGD updates are different; and 3) the model is σ(wTx) rather than wTx. Some additional questions to consider:
• How do the contours of the objective function (cross-entropy or least-squares error) look like compared with each other, and compared with the contours of the least-squares error for the linear regression case?
• Try datasets where the two classes are linearly separable, and where they are not linearly separable. Do the contours of the objective function look the same in both cases? Why? How does this affect the optimal parameters? More specifically: in the linearly separable case, what happens to kwk as the number of iterations increases? Will the algorithm ever converge?
• Try using as initial weights w ∈ RD random numbers uniformly distributed in [−u,u]. Try a small value of u (say, 0.01) and a large one (say, 100). What happens, and why? Which initialization is better?