Starting from:

$25

CSE541-Homework 1 Solved

CSE 542:

Probability
Concentration inequalities are at the heart of most arguments in statistical learning theory and bandits. Refer to [1] for more details.

1.1 (Markov’s Inequality) Let X be a positive random variable. Prove that .

1.2 (Jensen’s Inequalty) Let X be a random vector in Rd and let φ : Rd → R be convex. Then φ(E[X]) ≤

E[φ(X)]. Show this inequality for the special case when X has discrete support. That is, for pi ≥ 0 and  = 1, and (x1,...,xn) ∈ Rn show that 

1.3 (Sub-additivity of sub-Gaussian) For i = 1,...,n assume Xi is an independent random variable with   find a ∈ R and b ≥ 0 such that E[exp(λ(Z − a))] ≤

1.4 (Maximal inequality) For i = 1,...,n let each Xi be an independent, random variable that satisfies

E[exp(λXi)] ≤ exp(σi2λ2/2) for all λ > 0.                    Show that  ).     Hint[1].     If

  how would you expect E[maxi=1,...,n Xi] to behave (intuitive justification is enough)?

The Upper Confidence Bound Algorithm.

Consider the following algorithm for the multi-armed bandit problem.

Algorithm 1: UCB

 
Input: Time horizon T, 1-subGaussian arm distributions P1,··· ,Pn with unknown means µ1,··· ,µn

Initialize: Let Ti(t) denote the number of times arm i has been pulled up to (inclusive) time t and let Ti = Ti(T). Pull each arm once.

for: t = n + 1,··· ,T

 q2log(2nT2)

Pull arm It = argmaxi=1,···,n µbi,Ti(t−1) + Ti(t−1) and observe draw from Pi Let µi,Ti(t) be the empirical mean of the first Ti(t) pulls.

b
In the following exercises, we will compute the regret of the UCB algorithm and show it matches the regret bound from lecture. Without loss of generality, assume that the best arm is µ1. For any i ∈ [n], define the sub-optimality gap ∆i = µ1 − µi. Define the regret at time 

2.1 Consider the event

 
Show that .

2.2 On event E, show that  for i 6= 1.

2.3 Show that             +2. When n ≤ T, conclude by showing that .

Thompson Sampling.

We consider the following Bayesian setting. Consider n arms and let p0 be an n-dimensional prior distribution over [−1,1]n such that θ∗ ∼ p0 is drawn before the start of the game (e.g., p0 is uniform over [−1,1]n). At any time t, when we pull arm i ∈ [n] we observe a random variable Xi,t ∈ [−1,1] where .

Algorithm 1: Thompson Sampling

 
Input: Time horizon T, arm distributions ν1,··· ,νn

Assume the prior distribution p0 over Rn is known and that θ∗ ∼ p0 (so that θ∗ ∈ Rn). Let pt(·|I1,XI1,1,··· ,It,XIt,t) be the posterior distribution on θ∗ at time t.

for: t = 1,··· ,T

Sample θ(t) ∼ pt−1 (Note: θ(t) ∈ Rn)

                   Pull arm It = argmax       to observe XIt,t

Compute exact posterior update pt
Denote the σ-algebra generated by the observations at time t by Ft = σ(I1,XI1,1,··· ,It,XIt,t) (if you are unfamiliar with σ-algebras, don’t worry too much - conditioning on the σ-algebra just means conditioning on the choices of arms and the rewards observed). The Bayesian Regret of an algorithm is

 

Assume that expectations, if not explicitly specified, are with respect to all randomness including θ∗ ∼ p0, I1,...,IT, and observations.

3.1 On a given run of the algorithm, let θbi,s denote the empirical mean of the first s pulls from arm i, note that . Let the good event be

 .

Show that P(Ec) ≤ nTδ.

3.2 (Key idea.) Argue that P(i∗ = ·|Ft−1) = P(It = ·|Ft−1).

3.3 Define          = argmax , show that 

  ]]. Conclude that   ]]. Hint[2].

3.4 Show that   )). Hint[3]

3.5 Make an appropriate choice of δ and state a final regret bound.

In general, giving frequentist bounds on the regret is significantly more difficult. We refer the interested reader to [2] and the tutorial [3] for more details. This exercise is motivated by [4]

Algorithm 1: Explore-then-Commit

 

Input: Time horizon T, m ∈ N, 1-sub-Gaussian arm distributions P1,··· ,Pn with unknown means µ1,··· ,µn for: t = 1,··· ,T

 If t ≤ mn, choose It = (t mod n) + 1

Else, It = argmax
Empirical Experiments
Implement UCB, Thompson Sampling (TS), and Explore-then-Commit (ETC). Let Pi = N(µi,1) for i = 1,...,n. For Thompson sampling, define the prior for the ith arm as N(0,1).

4.1 Let n = 10 and µ1 = 0.1 and µi = 0 for i > 1. On a single plot, for an appropriately large T to see expected effects, plot the regret for the UCB, TS, and ETC for several values of m.

√ 

4.2 Let n = 40 and µ1 = 1 and µi = 1 − 1/ i − 1 for i > 1. On a single plot, for an appropriately large T to see expected effects, plot the regret for the UCB, TS, and ETC for several values of m.

Lower Bounds on Hypothesis Testing
Consider n samples X1,··· ,Xn ∼ P where P ∈ {P0,P1} (assume for simplicity that these are probability distributions on R). A hypothesis test for H0 : P = P0,H1 : P = P1 is a function φ(x1,··· ,xn) : Rn → {0,1} that takes the data as input and returns the null or the alternative. Assume that the dPi = pi(x)dx so that the probability density function exists (think:  ). If x ∈ Rn is the vector of n observations, define ). In this problem, we will lower bound the number of samples needed by any hypothesis test on a fixed number of samples. Convince yourself, at least intuitively, that any best-arm identification algorithm for two arms will take at least as many samples as this hypothesis test takes.

5.1 Show inf . Hint[4].

5.2 Let’s continue on. Show  . Hint[5][6].

5.3 One more step. Show  . Hint[7].

5.4 The final quantity is known as the KL-Divergence between distributions. Now assume that P0 = N(µ01n,In) and P1 = N(µ11n,In) where In is the n × n identity matrix and 1n ∈ Rn is the all ones vector. Show (or look up) KL(P0||P1).

5.5 Conclude that to acheive a test that accurately determines whether the sample of size n came from P0 or P1 with a probability of error less than δ, we necessarily have n ≥ 2∆−2 log(1/4δ) where ∆ = µ1 − µ0.

Remark: The art of lower bounds is well established and extensive in statistics. See [5] for more details in the hypothesis testing setting. In the bandit setting, see [6].

More products