Starting from:

$25

CSE541-Homework 2 Solved

Homework 2

CSE 541: Interactive Learning

Confidence bounds
1.1 (Wald’s identity) Let X1,X2,... be a sequence of iid random variables. For j ∈ {0,1}, under Hj we have that Xi ∼ pj. Let Pj(·),Ej(·) denote the probability and expectation under Hj. Assume that the support of p0 and p1 are equal and furthermore, that supx∈support(   . Fix some  and τ = min{t : Lt > 1/δ}, we showed in class that the false alarm probability P0(Lτ > 1/δ) ≤ δ. Assume that E1[τ] < ∞. Show that  where is the

Kullback Leibler divergence between p1 and p0.

1.2 (Method of Mixtures) Let X1,X2,... be a sequence of iid random variables where X1 ∼ N(µ,1). Under H0 we have µ = 0 and under H1 assume µ = θ.

•   Define Lt(θ) := exp(Stθ − tθ2/2) where . Show that 

•   Now suppose the sequence X1,X2,... is still iid and E[X1] = µ in the above notation, but now the distribution of X1 is unknown other than the knowledge that E[exp(λ(X1 − µ))] ≤ eλ2/2. Show that Lt(θ) defined above is a super-martingale under H0.

•   Assume the setting of the previous step. Define L¯t = Rθ Lt(θ)h(θ)dθ where  and σ > 0. Show that L¯t is a super-martingale under H0.

•   Fix δ ∈ (0,1). Show that P0(∃t ∈ N : L¯t > 1/δ) ≤ δ.

•   Conclude that if Z1,Z2,... are iid random variables with E[exp(λ(Z1 − E[Z1])] ≤ exp(λ2/2), then for any σ > 0

                                                     (1)

1.3 (Best arm identification) Consider an n-armed multi-armed bandit problem where the jth pull of the ith arm yields a random variable Xi,j ∼ N(θi,1). The objective of the player is to strategically pull arms until getting to a point that they can predict the index of the arm with the highest mean, at which time they stop and output this estimated arm. Suppose an oracle told the player that θ, the true means they are playing against, is equal to ∆ej for some j = 1,...,n where ej is a vector of all zeros except a 1 in the jth location. Note that while ∆ > 0 is known to the player, which is the true j is unknown.

•   Due to the symmetry of the known problem setup, it is conceivable that the following algorithm is optimal: play every arm the same number of times (say, τ times), and then declare that the arm with the highest empirical mean is best. Provide a sufficient condition on τ such that such a procedure correctly identifies the location of the true j index with probability at least 1 − δ (don’t forget the union bound!).

•   Argue that if each arm is pulled the same number of times this value of τ up to a constant factor (ignoring dependence on δ) is necessary. Hint[1]

•   The previous two parts suggest that any algorithm that pulls every arm the same number of times and identifies the best arm requires essentially n∆−2 log(n/δ) total pulls. We will beat this with an adaptive procedure that requires just O(n∆−2 log(1/δ)) pulls.

Algorithm: Initialize S1 = [n]. At each around t ≥ 1, while |St| > 1, pull every arm in St and then set .

 

We showed in class that if Zs ∼ N(0,1) then for any α > 0 and ρ ∈ (0,1) we have

                                       (2)

Conclude that if j is the index of the best-arm (so that Xj,s ∼ N(∆,1)) then with probability at least 1 − δ arm j remains in St for all t ≥ 1.

•   For i 6= j (so that Xi,s ∼ N(0,1)) define the random variables

 .

If Ti = max{t : i ∈ St} show that Ti ≤ 4∆−2 log(1/δ) + 8∆−2 log(1/ρi).

•   Note that by (2) we have P(ρi ≤ ρ) ≤ ρ for any ρ ∈ (0,1). Use this fact to show that  cn∆−2 log(1/δ) for some absolute constant c > 0. While not necessary for this problem, it is also possible to show that the right hand side holds with probability at least 1−δ (see sub-Gamma random variables).

In general, when the means are unknown, curved boundaries with a UCB-like algorithm [1] can be used to identify the best arm using just  ))) total pulls, where ∆i = maxj θj − θi using a similar analysis.

Linear regression and experimental design

2.1 Exercise 20.2 of [SzepesvariLattimore]

Non-parametric bandits
4.1 Let FLip be a set of functions defined over [0,1] such that for each f ∈ FLip we have f : [0,1] → [0,1] and for every x,y ∈ [0,1] we have |f(y)−f(x)| ≤ L|y−x| for some known L > 0. At each round t the player chooses an xt ∈ [0,1] and observes a random variable yt ∈ [0,1] such that E[yt] = f∗(xt) where f∗ ∈ FLip. Define the regret of an algorithm after T steps as  where x? = argmaxx∈[0,1] f∗(x).

• Propose an algorithm, that perhaps uses knowledge of the time horizon T, that achieves RT ≤ O(T2/3) regret (Okay to ignore constant, log factors). • Argue that this is minimax optimal (i.e., improvable in general through the use of an explicit example, with math, but no formal proof necessary).

Experiments
5.1 Suppose we have random variables Z1,Z2,... that are iid with E[exp(λ(Z1 −E[Z1])] ≤ exp(λ2/2). Then for any fixed t ∈ N we have the standard tail bound

                                                                                                                     (3)

The bound in (1) holds for all t ∈ N simultaneously (i.e., not for just a fixed t) at the cost of a slightly inflated bound. Let δ = 0.05. Plot the ratio of the confidence bound of (1) to (3) as a function of t for values of σ2 ∈ {10−6,10−5,10−4,10−3,10−2,10−1,100}. What do you notice about where the ratio is smallest with respect to σ2 (if the smallest value is at the edge, your t is not big enough–it should be in the many millions)? Suppose you are observing a stream of iid Gaussian random variables Z1,Z2,... and you are trying to determine whether their mean is positive or negative. If someone tells you they think the absolute value of the mean is about .01, how would you choose σ2 and use the above confidence bound to perform this test using as few total observations as possible?

5.2 G-optimal design This problem addresses finding a G-optimal design. Fix (x1,...,xn) ⊂ Rd. Let   and . For some λ ∈ 4n let f(λ) =

maxi=1,...,n kxik2A(λ)−1 and g(λ) = −log(|A(λ)|) (these are the G and D optimal designs, respectively). We wish to find a discrete allocation of size N for G-optimal design. Below are strategies to output an allocation (I1,...,IN) ∈ [n]. Choose one strategy and one h ∈ {f,g} to obtain samples from a G-optimal design

1.    Greedy For i = 1,...,2d select Ii uniformly at random with replacement from [n]. For t = 2d +

1,...,N select It = argmin ). Return (I1,...,IN).

2.    Frank Wolfe For i = 1,...,2d select Ii uniformly at random with replacement from [n] and let

 . For k = 2d + 1,...,N select Ik = argmin  and set λ(k) =

(1 − ηk)λ(k−[2]) + ηkeIk where ηk = 2/(k + 1). Return (I1,...,IN). You will need to derive the partial gradients . Hint2.

Let . We are going to evaluate f(λb) in a variety of settings. For a ≥ 0 and n ∈ N let xi ∼ N(0,diag(σ2)) with σj2 = j−a for j = 1,...,d with d = 10. On a single plot with   on the x-axis, plot your chosen method for a ∈ {0,0.5,1,2} as separate lines with N = 1000.

5.3 Experimental design for function estimation.

Let f : [0,1]d → R be some unknown function and fix some locations of interest . We wish to output an estimate fb of f uniformly well over X in the sense that maxx∈X E[(f(x) − f(x))2] is small. To build such an estimate, we can make N = 1000 measurements. If we measure the function at location X ∈ X we assume we observe Y = f(X) + η where η ∼ N(0,1). While you can choose any N measurement locations in X you’d like (with repeats) you have to choose all locations before the observations are revealed. This problem emphasizes that even though we’re studying linear functions and linear bandits in class, this can encode incredibly rich functions using non-linear transformations.

1.    Linear functions Assume f(x) = hx,θ∗i for some θ∗ ∈ Rd. Propose a strategy to select your N measurements and fit fb. What guarantees can you make about max ]? What about the random quantity maxx∈X(fb(x) − f(x))2?

2.    Sobolev functions For simplicity, let d = 1 and assume f(x) is absolutely continuous and 

∞. This class of functions is known as the First-order Sobolev space3. Remarkably, this set of functions is equivalent to a reproducing kernel Hilbert space (RKHS) for the kernel k(x,x0) = 1 + min{x,x0}. While a discussion of all its properties are beyond the scope of this class (see [2, Ch.12] for an excellent treatment) it follows that there exists a sequence of orthogonal functions φ(x) := (φ1(x),φ2(x),φ3(x),...) such that each φk : [0,1] → R and

•    for all i,j ∈ N we have for a non-increasing sequence {βi}i

• ),

•    and any function f in this class can be written as ) with .

Because φ(x) could be infinite dimensional, its not immediately clear why this is convenient. For finite datasets though (i.e., n < ∞) there always exists a finite dimensional φ : X → R|X| where k(xi,xj) = hφ(xi),φ(xj)i. Precisely, if one computes the matrix [K]i,j = k(xi,xj) and K   is the resulting eigenvalue decomposition where {ek}k is a non-increasing sequence, then for all i ∈ [n] we



                have [φi]j := [φ(xi)]j := vi,j                  ej and there exists a θ ∈ Rn such that

 .

 

2I = A(λ)A(λ)−1. Thus, 0 = .

3This class is quite rich including functions as diverse as all polynomials a+bx+cx8, trigonometric functions sin(ax)+cos(bx), exponential functions√ eax +bx, and even non-smooth functions like max{1−x,3x}. However, it does not include functions like

Because the ej are rapidly going to zero, we usually do not include all n, but cut the sum off at some   so that φi ∈ Rd and . To summarize, for each i ∈ [n] we have defined a map xi → φi with φi ∈ Rd and f(x) ≈ hφi,θ∗i for some unknown θ∗. Here is a piece of code that generates our set  1] and these features Φ =  :

import numpy as np n=300

X = np.concatenate( (np.linspace(0,1,50), 0.25+ 0.01*np.random.randn(250) ), 0) X = np.sort(X)

K = np.zeros((n,n)) for i in range(n):

for j in range(n):

K[i,j] = 1+min(X[i],X[j]) e, v = np.linalg.eigh(K) # eigenvalues are increasing in order d = 30

Phi = np.real(v @ np.diag(np.sqrt(np.abs(e))) )[:,(n-d)::]
1

2

3

4

5

6

7

8

9

10

11

12

Let’s define some arbitrary function and see how well this works!

def f(x): return -x**2 + x*np.cos(8*x) + np.sin(15*x) f_star = f(X)

theta = np.linalg.lstsq( Phi, f_star, rcond=None )[0] f_hat = Phi @ theta
1

2

3

4

5

6

7

 
Observe that there is nearly no error in the reconstruction even though we’re using a linear model in R30. This is because we defined a very good basis Φ ⊂ Rd. We could have achieved the same result by learning in kernel space and adding regularization (this is known as Bayesian experimental design). Instead of choosing d we would have to choose the amount of regularization, so there is still always a hyperparameter to choose.

Let’s get back to estimating f from noisy samples. Now that we have made this a linear estimation problem, we are exactly in the setting of the first part of this problem. Consider the below listing:

def observe(idx):

return f(X[idx]) + np.random.randn(len(idx))

def sample_and_estimate(X, lbda, tau):

n, d = X.shape reg = 1e-6 # we can add a bit of regularization to avoid divide by 0 idx = np.random.choice(np.arange(n),size=tau,p=lbda) y = observe(idx)

XtX = X[idx].T @ X[idx]

XtY = X[idx].T @ y
1

2

3

4

5

6

7

8

9

10

11

12

theta = np.linalg.lstsq( XtX + reg*np.eye(d), XtY, rcond=None )[0] return Phi @ theta, XtX T = 1000

lbda = G_optimal(Phi) f_G_Phi, A = sample_and_estimate(Phi, lbda, T) conf_G = np.sqrt(np.sum(Phi @ np.linalg.inv(A) * Phi,axis=1))

lbda = np.ones(n)/n f_unif_Phi, A = sample_and_estimate(Phi, lbda, T) conf_unif = np.sqrt(np.sum(Phi @ np.linalg.inv(A) * Phi,axis=1))
13

14

15

16

17

18

19

20

21

22

23

24

Use your implementation of G  optimal from the previous problem and use the sample and estimate function given above. Your tasks (create a legend for all curves, label all axes, and provide a title for all plots):

(a)    Plot 1: x-axis should be the x locations in [0,1]. First line is the CDF of the uniform distribution over X. The second line is the G-optimal allocation over X. Comment on the relative shapes, and how this relates to the distribution over the x’s shown in the left-most plot above.

(b)    Plot 2: x-axis should be the x locations in [0,1]. Plot f  star, f G  Phi, f unif Phi.

(c)    Plot 3: x-axis should be the x locations in [0,1]. First line is the absolute value of f  G Phi minus f  star, second line is f unif Phi minus f star, third line is pd/n, fourth line is conf G, fifth line is conf unif. Comment on what these lines have to do with each other.

5.4 Linear bandits Implement the elimination algorithm (use your implementation of G-optimal design),

UCB, and Thompson sampling (see listings below). Use the precise setup as the previous problem and set

 . For T = 40,000, on a single plot with t ∈ [T] on the x-axis, plot 

with a line for each algorithm. Comment on the results–what algorithm would you recommend to minimize regret?

Elimination algorithm

 Input: T ∈ N

 Initialize τ = 100, δ = 1/T, γ = 1, U = 1 (supposed to be an upper bound on kθ∗k2), V0 = γI, S0 = 0, for: k = 1,...,bT/τc

 

Draw x(k−1)τ+1,...,xkτ ∼ λ(k)

For t = (k − 1)τ + 1,...,kτ, pull arm xt and observe yt = f(xt) + ηt where ηt ∼ N(0,1)

         kτ   kτ

 = argmaxx0∈Xbkhx0,θki

Xbk+1 = Xbk − {x ∈ Xbk : hxbk − x,θki ≥ βkkxbk − xkVk−1}
UCB algorithm

 Input: T ∈ N

Initialize δ = 1/T, γ = 1, U = 1 (supposed to be an upper bound on kθ∗k2), V0 = γI, S0 = 0 for: 

xt = argmaxx∈Xhx,θti + kxkVt−1βt

Pull arm xt and observe yt = f(xt) + ηt where ηt ∼ N(0,1)

Vt+1 = Vt + xtx>t , St+1 = St + xtyt
 Thompson sampling algorithm Input: T ∈ N

Initialize γ = 1, V0 = γI, S0 = 0 for: t = 0,1,2,...,T − 1 θt = Vt−1St θet ∼ N(θt,Vt−1) xt = argmaxx∈Xhx,θeti

Pull arm xt and observe yt = f(xt) + ηt where ηt ∼ N(0,1)

Vt+1 = Vt + xtx>t , St+1 = St + xtyt
References
[1]   Kevin Jamieson, Matthew Malloy, Robert Nowak, and S´ebastien Bubeck. lil’ucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages 423–439, 2014.

 
[1] If Zi ∼ N(0,σ2) for i = 1,...,n show that P(maxi=1,...,n Zi ≥ pσ2 log(n)) > c for some absolute constant c.
[2] /x or              x because their derivatives blow up at x = 0

More products