$30
Optimization for Machine Learning (Final Exam)
1. Consider the following finite sum optimization problem
,
where xi ∈ Rd and yi ∈ {±1}.
• Derive the dual formulation, as in the SDCA procedure.
• Write down the formula for closed form solution for SDCA (Option I of Algorithm 11.1).
• Write down the dual free SDCA update rule for ∆αi in Algorithm 14.3.
• Write down the SGD update rule.
• Implement the three methods (SDCA, dual-free SDCA, SGD) in prob1() of progtemplate.py, and plot the convergence curves (wrt primal-suboptimality) until SDCA converges (error < 10−10).
2. Consider the minimax problem:
.
• Write down the optimal solution of x as a function of y. Write the optimization problem in terms of y by eliminating x. Explain the derivations.
• Write down the GDA update rule for this problem with learning rate η.
• Implement GDA, extra gradient, optimistic GDA in prob2() of prog-temp.py, and plot the convergence curves (wrt gradient norm of x and y; 2-norm of ) for 100 iterations.
3. Consider zero-th order optimization.
• In Theorem 18.6, if we further assume that f(x) is λ strongly convex, and take ηt = (λt)−1. Derive the corresponding convergence result. • (2 points) the optimization problem
minEx∼π(x|θ)f(x), θ
where x ∈ Rd. Assume we want to solve this problem using policy gradient, with θ = (µ,ρ), where µ is d-dimensional vector, and ρ ∈ R. Both are part of model parameters. Consider distribution π(x|θ) = N(µ,e−ρI). Derive the policy gradient update rule for θ including both (µ,ρ).
1
• Consider the zero-th order optimization problem over discrete set x ∈ {0,1}d. Implement policy gradients in Example 18.10 and Example 18.11 to solve the objective function
on prob3() of prog-template.py, plot convergence curves (wrt f(x)) and report your x∗, θ∗ (refer to the Example, p(xi = 1) = θi).
4. Consider the setting of decentralized computing, where we are given m nodes. A vector x = [x1,...,xm] has m components, and each node contains a local component xi of the vector, with local objective function fi(xi) + gi(xi). At any time step, in addition to local algebraic operations, we can perform the following function calls simultaneously on all nodes:
• (gradient computation) call grad(x): each node computes the local gradient ∇fi(x).
• (proximal mapping) call prox(η,z): each node computes the local proximal mapping
argmin .
ui
• (communication) call communicate(z) = [(zi−1+zi+1)/2]i=1,...,m: each node sends its local vector zi over the network, then it receives vectors from the neighboring nodes i − 1 and i + 1 via the network, and computes the average (zi−1 + zi+1)/2 (where z0 = zm and zm+1 = z1).
If we have a variable w = [w1,...,wm], with wi stored on node i, then node j 6= i cannot access the information wi on node i directly, except through calling communicate(). We want to use the above function calls to jointly optimize the following objective function:
x1 = x2 = ··· = xm = z1 = ··· = zm,
by rewriting the above problem as
,
with [Bz]i = zi − (zi−1 + zi+1)/2, f(x) = Pi fi(xi), g(z) = Pi gi(zi).
• Write down an algorithm for decentralized optimization using linearized ADMM. Try to combine redundant communications so that no more than two communication calls are needed for each gradient computation.
• Implement and plot convergence curves (wrt primal-suboptimality) with different parameters (eta and rho in ADMM) to solve the objective function in prob4() of prog-template.py.
2