$25
Notation: For any N ∈ N, we use [N] to denote {1,2,...,N}.
1 Introduction
Linear regression relates a multivariate input x ∈ Rd to a scalar output y ∈ R via a linear model, that is, y ≈ θ>x for some parameter vector θ ∈ Rd. The goal of linear regression is to learn θ from a finite dataset . In practical applications only part of the output can be explained by the inputs, and therefore we have y(i) = θ>x(i) + ξ(i), where the error ξ(i) is a realization of a noise random variable with zero mean. In this case, the best one can hope for is to find an estimator θb ≈ θ constructed from the data. The estimator θb can be used to predict the output y corresponding to a new unseen input x, which is at the core of many applications in statistics and machine learning. We assume throughout this project that = 1 for all i ∈ [N], and thus θ1 can be interpreted as a bias term that represents the mean or the median of y(i).
In this project you will derive and implement several linear programmingbased methods to learn the unknown parameter vector θ from data.
Figure 1: MSE vs. MAE estimators for a dataset with 1 outlier (red dot)
2 Mean-Absolute-Error Penalized Regression
There exist different methods to construct the estimator θb. The most widely used approach minimizes the mean-squared-error (MSE) corresponding to the given data to obtain the MSE estimator
N
θbMSE ∈ argmin . (1)
i=1
Alternatively, one can minimize the mean-absolute-error (MAE) to obtain the more robust MAE estimator
N
θbMAE ∈ argmin . (2)
i=1
Question 1 (5 points). Figure 1 visualizes the outputs predicted by the MSE and MAE estimators on a dataset with 1 outlier. Explain the difference between the two estimators intuitively. Hint: Compare how (1) and (2) penalize errors.
In case of high-dimensional inputs (d > N), it makes sense to seek a sparse parameter vector θ. This means that many components of θ should be zero. The non-zero components of θ then correspond to the key features or key inputs that have a significant impact on the outputs. Sparsity can be induced by adding an `1-penalty to the objective function of (1) or (2), which yields the least absolute shrinkage and selection operator (LASSO) method of regression analysis. In case of the MAE objective, the resulting LASSO estimator satisfies
N
θb∈ argmin , (3)
i=1
where denotes the `1-norm of θ, and λ ≥ 0 is a hyperparameter that represents the weight of the penalty. In practice, λ is typically chosen by cross validation. That is, we randomly partition D into a training dataset Dtrain and a validation dataset Dval. We then solve (3) using only Dtrain for different values of λ and select the estimator that has minimal MAE on Dval.
Question 2 (15 points: 5+5+5).
1. Rewrite (3) as a linear program (not necessarily in standard form).
2. Implement this linear program in MATLAB or Python and solve it on the Diabetes Dataset for λ = 0.5. Use the code skeletons provided on Moodle. The diabetes dataset links a feature vector x comprising patient information (age, sex, body mass index) and blood measurements (blood pressure, T-Cells, lipoproteins, thyroid, lamotrigine and blood sugar) to an output y quantifying the progression of the diabetes disease. Compute the MAE of the resulting estimator on the separate test sets provided on Moodle. Compare the test performance against the training performance. Hint: Do not forget to append a constant bias term to the features.
3. Use cross-validation to tune the hyperparameter λ. Randomly select 75% of the data to construct Dtrain and use the rest of the data to construct Dval. Use 50 logarithmically spaced values between [10−5,10−1] as candidates for λ, select the one performing best on the validation set in terms of MAE. Compare again the test performance against the training performance.
3 Convex Hulls
Standard linear programs involve only real decision variables. However, many applications require binary decision variables that are restricted to {0,1}. Such variables emerge, for example, if you have to decide whether a project should be implemented or not or whether an item should be loaded on a truck or not etc. A linear optimization problem with binary decision variables is given below.
s.t. x1 + x2 ≤ 1 (4) x1,x2 ∈ {0,1}
Any binary linear program (such as problem (4)) is equivalent to a standard linear program, which is obtained by replacing the original feasible set with its convex hull, that is, the smallest convex set containing all feasible points.
Definition 3.1 (Convex hull). The convex hull of an arbitrary (possible nonconvex) set X ⊂ Rn is defined as
conv( .
Question 3 (5 points). Denote by X the feasible set of problem (4). Describe the convex hull of X and find its vertices. Plot both X and conv(X).
Question 4 (5 points). Replace the feasible set of problem (4) with its convex hull and solve the resulting linear program using MATLAB or Python. Report the optimal decision variables. Explain why they must be binary. Hint: Characterize the BFS of the convex hull of the feasible set.
4 Zero-Sum Games
In the standard cross validatioin scheme described in Section 2, training and validation sets were constructed randomly. In the remainder of the project we will explore an adversarial cross validation procedure, where the training set is chosen by a fictitious adversary who aims to maximize the prediction error of our estimator. This procedure will give rise to a MinMax problem of the form:
min max f(x,z). (5) x∈X z∈Z
MinMax problems are zero-sum games, where one player’s profit is the other player’s loss. They are more intricate than standard minimization problems. We will see, however, that problem (5) can sometimes be simplified to a standard minimization problem if Z is convex and f(x,z) is concave in z for all x ∈ X.
As an example, suppose you own $10,000, which you want to hide in your apartment. There are I hiding spots with capacities of Ci dollars, i ∈ [I]. In case of a burglary, you want to lose as little money as possible. Assume that the police need T minutes to reach your apartment, and therefore any thief must leave within T minutes to avoid arrest. Assume further that the amount of money found in hiding spot i equals xi·zi·pi, where xi stands for the amount of money hidden, zi denotes the amount of time the thief spends searching spot i and pi represents a constant reflecting the difficulty of searching spot i. If the thief spends an amount of time exceeding 1/pi in location i, then all the money hidden in that spot will be found. It therefore makes sense to restrict zi ≤ 1/pi.
We seek a plan for hiding the money in the best possible way. Specifically, we hope to loose the least amount of money in the worst case, that is, if a very efficient thief shows up. This problem can be modeled as a zero-sum game against the thief (indeed, our loss is the thief’s profit).
Question 5 (20 points: 3+3+3+8+3). This question will guide you through formulating and solving the problem of finding the best hiding strategy.
1. Characterize your decision variables and your feasible set.
2. Characterize the thief’s decision variables and feasible set.
3. Use the solutions of the first two questions to formulate a MinMax problem with objective function
I
f(x,z) = Xxizipi.
i=1
4. Dualize the inner maximization problem to reformulate the MinMax problem as a standard minimization problem, which can be addressed with linear programming solvers. Hint: The decision variables of the outer minimization problem represent constants for the inner maximization problem.
5. Implement the resulting linear program in Python or MATLAB and solve it for the provided input data p,C ∈ RI and T ∈ R. Use the code skeletons available from Moodle. Report the worst amount of money you lose, the optimal plan for hiding the money and the thief’s optimal search strategy. Hint: Think about the meaning of the problem’s dual variables.
5 Adversarial Training
Standard cross validation partitions D randomly into a training set Dtrain and a validation set Dval of prescribed cardinalities. Hence, it could happen that the training set contains no outliers (lucky) or all outliers (unlucky). The resulting estimator θb thus depends on the choice of the training set. We now propose an alternative approach to construct θb in an adversarial (worst-case optimal) manner. Specifically, we will train the regression model on the worst possible training set Dtrain we could have possibly constructed from D.
We now construct the best estimator for the worst-case training set containing k = b0.75·Nc samples, where byc denotes the largest integer smaller or equal to y. To that end, we formulate a MinMax problem, where the inner maximization problem optimizes over the binary variables zi, i ∈ [N], corresponding to the N samples. The binary variable zi is set to 1 if sample (y(i),x(i)) is included in the training set and to 0 otherwise. The outer optimization problem over θ aims to minimize the MAE with LASSO penalty in view of the worst-case training set. In summary, we thus find the following MinMax problem.
s.t. Xzi = k (6)
i=1
zi ∈ {0,1} ∀i ∈ [N]
This MinMax problem can be interpreted as a zero-sum game between the statistician and an evil adversary who selects the worst possible training set.
Question 6 (15 points). Derive the convex hull of the adversary’s feasible set
, (7)
where k is an integer. Prove that your formulation is indeed the convex hull. Hint: Two sets A and B are equivalent if A ⊆ B and B ⊆ A.
Question 7 (25 points: 10+10+5).
1. Show that the MinMax problem (6) is equivalent to the following linear program. Hint: Use the insights from Sections 3 and 4.
N d
min kα + Xβi + λXbi
θ,α,βi,bi
i=1 j=1
s.t. α + βi ≥ (y(i) − θ>x(i))/k ∀i ∈ [N] (8) α + βi ≥ −(y(i) − θ>x(i))/k ∀i ∈ [N] βi ≥ 0 ∀i ∈ [N]
− bj ≤ θj ≤ bj ∀j ∈ [d]
Give an interpretation for β. Explain how one can extract the validation set from a solution of this problem.
2. Implement the linear program (8) in MATLAB or Python using the skeleton code we provide on Moodle and solve it for 50 logarithmically spaced values of λ in the interval [10−5,10−1]. For each of these values compute the MAE of the corresponding estimator on the validation set, and select the best robust estimator. To assess the benefits of robust cross validation over standard cross validation, compare the MAE of the robust estimator against the MEA of the estimator found in Question 2.3. In both cases the MEA should be computed on the test set.
3. For the robust estimator found above, compare also the MEA on the test set against the MEA on the training set. Is there a significant difference to what you observed in Question 2.3.? Interpret your observations?
6 No More Toy-Examples/Optimization Prize
So far, you have applied different regression techniques to the “Diabetes” dataset. In this open question, we ask you to apply one of the regression techniques seen in Sections 2 and 5 to a regression problem of your liking.
Question 8 (10 points). Apply one of the regression techniques seen in Sections 2 and 5 to a dataset of your choice. Hint: Good sources of datasets include Kaggle and the UCI archive. In answering this question, put emphasis on interpreting your results and on justifying the chosen optimization scheme.
Question 9 (Voluntary bonus question, 10 points). If you like, prepare a short presentation of your answer of Question 8 (about 5 minutes) for the exercise session on May 20th. This will allow you to present your dataset and insights to your fellow students. We will give you bonus points for volunteering to do the presentation, but we will not grade its contents. Everyone present in the exercise session will be able to vote on the best presentation. The group attracting the most votes will receive an optimization prize.