Starting from:

$30

IFT6390 Homework 2 Solved

•   We ask you to submit a report as a pdf file. You should also submit every source code file you made or adapted. The practical part should be coded in python (with the numpy and matplotlib libraries). You are of course encouraged to draw inspiration from what was done in lab sessions.

•   You can submit you python code as a Jupyter notebook (.ipynb). To write math in your report, you may use softwares such as LATEX; LYX; Word; or even write the equations directly in the notebook with the MathJax syntax. In any case, you should export your report to a pdf file that you will submit.

•   You should hand in your report via StudiUM. Only one of the teammates should submit the report. If you have to submit lots of files, you may also compress them into a .zip or .tar.gz archive and upload this file.

1 Linear and non-linear regularized regression (50 pts)
1.1        Linear Regression
Let’s consider a regression problem for which we have a training dataset Dn with n samples (input, target):

Dn = {(x(1),t(1)), ..., (x(n),t(n)}

with x(i) ∈ Rd, and t(i) ∈ R.

The linear regression assumes a parametrized form for the function f which predicts the value of the target from a new data point x. (More precisely, it seeks to predict the expectation of the target variable conditioned on the input variable f(x) ' E[t|x].)

The parametrization is a linear transformation of the input, or more precisely an affine transformation.

f(x) = wTx + b

1.   Precise this model’s set of parameters θ, as well as the nature and dimensionality of each of them.

2.   The loss function typically used for linear regression is the quadratic loss:

L((x,t), f) = (f(x) − t)2

We are now defining the empirical risk Rˆ on the set Dn as the sum of the losses on this set (instead of the average of the losses as it is sometimes defined). Give the precise mathematical formula of this risk.

3.   Following the principle of Empirical Risk Minimization (ERM), we are going to seek the parameters which yield the smallest quadratic loss. Write a mathematical formulation of this minimization problem.

4.   A general algorithm for solving this optimization problem is gradient descent. Give a formula for the gradient of the empirical risk with respect to each parameter.

5.   Define the error of the model on a single point (x,t) by f(x) − t. Explain in English the relationship between the empirical risk gradient and the errors on the training set.

1.2        Ridge Regression
Instead of Rˆ, we will now consider a regularized empirical risk: R˜ = Rˆ+λL(θ). Here L takes the parameters θ and returns a scalar penalty. This penalty is smaller for parameters for which we have an a priori preference. The scalar λ ≥ 0 is an hyperparameter that controls how much we favor minimizing the empirical risk versus this penalty. Note that we find the unregularized empirical risk when λ = 0.

We will consider a regularization called Ridge, or weight decay that penalizes the squared norm (`2 norm) of the weights (but not the bias): L(θ) = kwk2 = Pdk=1 wk2. We want to minimize R˜ rather than Rˆ.

1.   Express the gradient of R˜. How does it differ from the unregularized empirical risk gradient?

2.   Write down a detailed pseudocode for the training algorithms that finds the optimal parameters minimizing R˜ by gradient descent. To keep it simple, use a constant step-size η.

3.   There happens to be an analytical solution to the minimization problem coming from linear regression (regularized or not). Assuming no bias (meaning b = 0), find a matrix formulation for the empirical risk

 (1) x1

and its gradient, with the matrix X =  ...



(n)

x1
... ...

···
(1)

xd 

...  and the



(n)

xd
 t(1) 

vector t =  ... .

      t(n)

4.   Derive a matrix formulation of the analytical solution to the ridge regression minimization problem by expressing that the gradient is null at the optimum. What happens when N < d and λ = 0 ?

1.3             Regression with a fixed non-linear pre-processing
We can make a non-linear regression algorithm by first passing the data through a fixed non-linear filter: a function φ(x) that maps x non-linearly to a higher dimensional x˜.

For instance, if x ∈ R is one dimensional, we can use the polynomial transformation:         x

 x2 

x˜ = φpolyk(x) =  ... 



          xk

We can then train a regression, not on the (x(i),t(i)) from the initial training set Dn, but on the transformed data (φ(x(i)),t(i)). This training finds the parameters of an affine transformation f

To predict the target for a new training point x, you won’t use f(x) but fe(x) = f(φ(x)).

1.   Write the detailed expression of f˜(x) when x is one-dimensional (univariate) and we use φ = φpolyk.

2.   Give a detailed explanation of the parameters and their dimensions.

3.   In dimension d ≥ 2, a polynomial transformation should include not only the individual variable exponents xji, for powers j ≤ k, and variables i ≤ d, but also all the interaction terms of order k and less between several variables (e.g. terms like xji1xjl2, for j1 + j2 ≤ k and variables i,l ≤ d). For d = 2, write down as a function of each of the 2 components of x the transformations φpoly1(x), φpoly2(x), and φpoly3(x).

4.   What is the dimensionality of φpolyk(x), as a function of d and k?

2           Practical Part (50 pts)
You should include all the python files you used to get your results. It should have a main file (which can be a notebook) that produces the required plots, one after another. Your results should be reproducible! Briefly explain how to run your code in the report.

1.   Implement in python the ridge regression with gradient descent. We will call this algorithm regression_gradient. Note that we now have parameters w and b we want to learn on the training set, as well an hyper-parameter to control the capacity of our model: λ. There are also hyper-parameters for the optimization: the step-size η, and potentially the number of steps.

2.   Consider the function h(x) = sin(x) + 0.3x − 1. Draw a dataset Dn of pairs (x, h(x)) with n = 15 points where x is drawn uniformly at random in the interval [−5,5]. Make sure to use the same set Dn for all the plots below.

3.   With λ = 0, train your model on Dn with the algorithm regression_gradient). Then plot on the interval [−10,10]: the points from the training set Dn, the curve h(x), and the curve of the function learned by your model using gradient descent. Make a clean legend. Remark: The solution you found with gradient descent should converge to the straight line that is closer from the n points (and also to the analytical solution). Be ready to adjust your step-size (small enough) and number of iterations (large enough) to reach this result.

4.   on the same graph, add the predictions you get for intermediate value of λ, and for a large value of λ. Your plot should include the value of λ in the legend. It should illustrate qualitatively what happens when λ increases.

5.   Draw another dataset Dtest of 100 points by following the same procedure as Dn. Train your linear model on Dn for λ taking values in [0.0001,0.001,0.01,0.1,1,10,100]. For each value of λ, measure the average quadratic loss on Dtest. Report these values on a graph with λ on the x-axis and the loss value on the y-axis.

6.   Use the technique studied in problem 1.3 above to learn a non-linear function of x. Specifically, use Ridge regression with the fixed preprocessing φpolyl described above to get a polynomial regression of order l. Apply this technique with λ = 0.01 and different values of l. Plot a graph similar to question 2.2 with all the prediction functions you got. Don’t plot too many functions to keep it readable and precise the value of l in the legend.

7.   Comment on what happens when l increases. What happens to the empirical risk (loss on Dn), and to the true risk (loss on Dtest)?

More products