Starting from:

$30

COMP6211E-Final Exam Solved

                      Optimization for Machine Learning    (Final Exam)


1.     Consider the following finite sum optimization problem

 ,

where xi ∈ Rd and yi ∈ {±1}.

•    Derive the dual formulation, as in the SDCA procedure.

•    Write down the formula for closed form solution for SDCA (Option I of Algorithm 11.1).

•    Write down the dual free SDCA update rule for ∆αi in Algorithm 14.3.

•    Write down the SGD update rule.

•    Implement the three methods (SDCA, dual-free SDCA, SGD) in prob1() of progtemplate.py, and plot the convergence curves (wrt primal-suboptimality) until SDCA converges (error < 10−10).

2.     Consider the minimax problem:

 .

•    Write down the optimal solution of x as a function of y. Write the optimization problem in terms of y by eliminating x. Explain the derivations.

•    Write down the GDA update rule for this problem with learning rate η.

•    Implement GDA, extra gradient, optimistic GDA in prob2() of prog-temp.py, and plot the convergence curves (wrt gradient norm of x and y; 2-norm of ) for 100 iterations.

3.     Consider zero-th order optimization.

•    In Theorem 18.6, if we further assume that f(x) is λ strongly convex, and take ηt = (λt)−1. Derive the corresponding convergence result. • (2 points) the optimization problem

minEx∼π(x|θ)f(x), θ

where x ∈ Rd. Assume we want to solve this problem using policy gradient, with θ = (µ,ρ), where µ is d-dimensional vector, and ρ ∈ R. Both are part of model parameters. Consider distribution π(x|θ) = N(µ,e−ρI). Derive the policy gradient update rule for θ including both (µ,ρ).

1

•    Consider the zero-th order optimization problem over discrete set x ∈ {0,1}d. Implement policy gradients in Example 18.10 and Example 18.11 to solve the objective function

 

on prob3() of prog-template.py, plot convergence curves (wrt f(x)) and report your x∗, θ∗ (refer to the Example, p(xi = 1) = θi).

4.     Consider the setting of decentralized computing, where we are given m nodes. A vector x = [x1,...,xm] has m components, and each node contains a local component xi of the vector, with local objective function fi(xi) + gi(xi). At any time step, in addition to local algebraic operations, we can perform the following function calls simultaneously on all nodes:

•   (gradient computation) call grad(x): each node computes the local gradient ∇fi(x).

•   (proximal mapping) call prox(η,z): each node computes the local proximal mapping

argmin .

ui

•   (communication) call communicate(z) = [(zi−1+zi+1)/2]i=1,...,m: each node sends its local vector zi over the network, then it receives vectors from the neighboring nodes i − 1 and i + 1 via the network, and computes the average (zi−1 + zi+1)/2 (where z0 = zm and zm+1 = z1).

If we have a variable w = [w1,...,wm], with wi stored on node i, then node j 6= i cannot access the information wi on node i directly, except through calling communicate(). We want to use the above function calls to jointly optimize the following objective function:

                                                                       x1 = x2 = ··· = xm = z1 = ··· = zm,

by rewriting the above problem as

 ,

with [Bz]i = zi − (zi−1 + zi+1)/2, f(x) = Pi fi(xi), g(z) = Pi gi(zi).

•    Write down an algorithm for decentralized optimization using linearized ADMM. Try to combine redundant communications so that no more than two communication calls are needed for each gradient computation.

•    Implement and plot convergence curves (wrt primal-suboptimality) with different parameters (eta and rho in ADMM) to solve the objective function in prob4() of prog-template.py.

2

More products