Starting from:

$30

Reinforcement_learning-Homework 3: Exploration in Reinforcement Learning (theory) Solved

1           UCB
We find ourselves in the setting of multi-arm bandits.

The question is to prove whether or not µbj,t is an unbiased estimator of µj.

At first sight, one could interpret µbj,t as the simple mean estimate of µj and thus would be unbiased. However, this would only apply if samples Xik,k were independent and identically distributed (iid), which is not the case here in the online on-policy learning of UCB. Whether an arm is pulled or not depends on previous samples and therefore one can expect the estimate to rather have some bias.

To prove the biasedness of µbj,t, or rather to show that it is not unbiased in the general case, we will consider a simple case and compute its analytical bias. Let us consider the setting of Bernoulli bandits as in section 3 with k = 2 binary arms of parameters µ1 and µ2. One pulls the arm it such that

it ∈ arg maxµbj,t + U(Nj,t,δ)

j

We assume here that arms are pulled randomly in case of a tie. The UCB exploration term is infinite for t ∈ {1,2} where both arms are pulled successively. At t = 3, both arms have been pulled once and one of them is going to be pulled again. We look at the sample mean estimates  and  after the third action.

This leads to the calculation of the expected value of µb1,3,

The bias of arm 1 is therefore:

Since the arms play a symmetrical role in the derivation of the bias, one can derive the bias for arm 2:

These biases are strictly negative if 0 < µ1,µ2 < 1. Therefore, µbj,t is not an unbiased estimator of µj in general .

2           Best Arm Identification
Let us compute a function U(t,δ) that satisfies the any-time confidence bound. For any arm i ∈ [k]

If one chooses ,

                                  ∞                                                                            ! ∞

                                  P [ {|µbi,t − µi| > U(t,δ)}                 ≤ XP({|µbi,t − µi| > U(t,δ)})

                                   t=1                                                                                t=1



                                                                                      )   (Hoeffding’s inequality)

Let. For ,

!

Therefore, P(E) ≤ δ. This is a bad event since the confidence intervals do not hold.

Let us show that with probability at least 1 − δ, the optimal arm i? = arg maxi{µi} remains in the active set S.

Let us assume ¬E. Under such conditions,

∀t,∀i,|µbi,t − µi| ≤ U(t,δ0)

Therefore,

? ( µbi?,t ≥ µ? − U(t,δ0)

∀t,∀i 6= i ,

µbi,t ≤ µi + U(t,δ0)

∀t,∀i 6= i?,       µi?,t − µbi,t ≥ ∆i − 2U(t,δ0)       (1) b

Let us now show that this implies that in such conditions, arm i? remains in the active set S.

Under such conditions, let us assume the opposite and prove by contradiction. Assume the arm i? is eliminated at time t0. Using δ0 instead of δ in the algorithm, this means:

                                                                        )                                                        (2)

Using equation (1) for t = t0 and i = i0 combined with equation (2), one finds the following inequality:

∆i0 − 2U(t0,δ0) ≤ µbi?,t0 − µbi0,t0 ≤ −2U(t0,δ0)

This implies ∆i0 ≤ 0 which is a contradiction since ∆i0 = µ? − µi0 > 0 (we assume for simplicity there is only one best arm). Therefore, by contradiction, we prove that under ¬E conditions, the arm i? remains in the active set S.

¬E ⊂ {arm i? remains in the active set}

P(¬E) ≤ P({arm i? remains in the active set})

P({arm i? remains in the active set}) ≥ 1 − P(E) ≥ 1 − δ

Under event ¬E, let us find C1 such that for an arm i 6= i?, if ∆i ≥ C1U(t,δ0), then the arm i will be removed from the active set.

Let i 6= i? and apply ¬E conditions on i and i?.

( |µbi?,t          ?| ≤ U(t,δ0) − µ

|µbi,t − µi| ≤ U(t,δ0)

( −U(t,δ0) + µ? ≤ µi?,t ≤ µ? + U(t,δ0) b

−U(t,δ0) − µi ≤ −µbi,t ≤ −µi + U(t,δ0)

According to the algorithm (using δ0 and not δ in the pseudo-code), if ), the arm i will be removed from the active set.

Therefore, if ∆i − 2U(t,δ0) ≥ 2U(t,δ0) ⇐⇒ ∆i ≥ 4U(t,δ0), the arm i will be removed for sure from the active set, under ¬E conditions.

Under event ¬E, an arm i 6= i? will be removed from the active set when ∆i ≥ C1U(t,δ0) with C1 = 4 .

With our definition of U(t,δ0),

By minimizing logt by 0 (since t ≥ 1), for every arm i 6= i?,

 =⇒ ∆i ≥ 4U(t,δ0) =⇒ arm i will be removed

Let us compute a lower bound on the sample complexity for identifying the optimal arm with probability 1 − δ.

With ∆i? = mini6=i? ∆i.

                                                                        with probability 1 − δ.

3           Bernoulli Bandits
UCB and KL-UCB algorithms for Bernoulli Bandits have been implemented in Python, using NumPy and Matplotlib libraries. Expected regret of both algorithms are plotted in figure 1 in the case of k = 2 Bernoulli arms of means µ1 ∈ {0.1,0.5,0.9} and µ2 = 0.5 + ∆ with ∆ ∈ [−0.5,0.5].

First, one must observe that both algorithms always have a regret of 0 when µ1 = µ2 (corresponding to ∆ = 0 in (a), ∆ = −0.4 in (b) and ∆ = 0.4 in (c)). Indeed this is explained by both arms having the same expected value and thus they are equally good in average. This means that whatever choice one makes, there is no regret from it.

Then, both algorithms perform the worst when µ1 ≈ µ2 but µ1 6= µ2. This is straightforward to understand since when their expected values are very close to each other, it is harder or at least it takes longer to distinguish them. Therefore, one makes many more errors in choosing the wrong arm, increasing the regret.

Finally, one can see in figure 1 that the KL-UCB algorithm performs better than the UCB one. Although for µ1 = 0.5, KL-UCB and UCB tend to have rather similar performances when µ2 remains close to µ1, KL-UCB performs significantly better when µ1 ∈ {0.1,0.9}. The Kullback-Leibler divergence is indeed better at distinguishing between two Bernoulli distributions that would have their means close to 0 or close to 1. This is why KL-UCB performs better than UCB in figure 1 (b) and (c) when µ1 ≈ µ2 whereas it performs closely to UCB in (a) since the KL divergence is smaller when both means are around 0.5.


−0.50                             −0.25              0.00        0.25        0.50 ∆

µ1 = 0.1
−0.50                             −0.25              0.00        0.25        0.50 ∆

µ1 = 0.9
 

Figure 1: Expected regret after n = 10000 steps for Bernoulli bandits with k = 2 arms and means µ1 = 0.5 in (a), 0.1 in (b) and 0.9 in (c), and µ2 = 0.5+∆ with ∆ ∈ [−0.5,0.5]. The plots were averaged over 50 runs for each ∆.

4           Regret Minimization in RL
We consider a finite-horizon MDP M? = (S,A,ph,rh) with stage-dependent transitions and rewards.

                              We define the event E = {∀k,M? ∈ Mk} and Mk = {M = (S,A,ph,k,rh,k)                       :       rh,k(s,a) ∈

. Let us define confidence intervals) and) as a

function of δ such that P(¬E) ≤ δ/2.

Let us choose:

                                      and   

!

Therefore, .

Let us be under the event E and let bh,k(s,a) be a bonus to define.

Qh,k(s,a) = rbh,k(s,a) + bh,k(s,a) + Xpbh,k(s0|s,a)Vh+1,k(s0)

s0

Let us prove by induction that ∀h,s,a,k, Qh,k(s,a) ≥ Q?h(s,a) Induction step

Let h ∈ [1,H−1] and let us assume the following:           ∀s,a,k, Qh+1,k(s,a) ≥ Q?h+1(s,a)            (inductive assumption).

Indeed, the induction step works if bh,k(s,a) is chosen such that

Let us define bh,k(s,a) to ensure Qh,k is optimistic.

With this choice of bh,k(s,a),

The induction step is now proved.

Base case
Since we are under the event E, we have:

rbH,k(s,a) + bHk(s,a) ≥ rbH,k(s,a) + βHkr (s,a) ≥ rH(s,a).

Then, ) = 0. Therefore,). The base case is proven.

Combining the base case and the inductive step gives us:

∀h,s,a,k, Qh,k(s,a) ≥ Q?h(s,a)

The aim in this question is to prove the following:

H

                                           δ1,k(s1,k) ≤ XQh,k(sh,k,ah,k) − r(sh,k,ah,k) − EY ∼p(·|sh,k,ah,k)[Vh+1,k(Y )]) + mh,k                                        (3)

h=1

Where) and mh,k = EY ∼p(·|sh,k,ah,k)[δh+1,k(Y )] − δh+1,k(sh+1,k).

Let us show that.
                   )          (Bellman equation)

Therefore, .

Let us prove that Vh,k(sh,k) ≤ Qh,k(sh,k,ah,k).
Vh,k(sh,k) = min{H,maxQh,k(sh,k,a0)} a0

≤ maxQh,k(sh,k,a0) a0

≤ Qh,k(sh,k,ah,k)

Therefore, Vh,k(sh,k) ≤ Qh,k(sh,k,ah,k) .

Let us prove equation 3.
H

δ1,k(s1,k) ≤ XQh,k(sh,k,ah,k) − r(sh,k,ah,k) − EY ∼p(·|sh,k,ah,k)[Vh+1,k(Y )]) + mh,k

h=1
Therefore,.

Let us show that with probability 1 − δ, R(T) ≤ Pk,h bh,k(sh,k,ah,k) + 2HpKH log(2/δ)

K

R(T) = XV1?(s1,k) − V1πk(s1,k)

k=1

                                    K                                                                               K

= XV1πk?(s1,k) − V1,k(s1,k) + XV1,k(s1,k) − V1πk(s1,k)

                                   k=1                                                                          k=1

K

= X−δ1?,k(s1,k) + δ1,k(s1,k) k=1

K

                                  ≤ Xδ1,k(s1,k)             (Since V ≥ V ? ≥ V πk)

k=1

                                    K        H

≤ XXQh,k(sh,k,ah,k) − r(sh,k,ah,k) − EY ∼p(·|sh,k,ah,k)[Vh+1,k(Y )]) + mh,k k=1 h=1

                                    K        H                              K        H

= XXmh,k + XXQh,k(sh,k,ah,k) − r(sh,k,ah,k) − EY ∼p(·|sh,k,ah,k)[Vh+1,k(Y )])

                                    k=1 h=1                            k=1 h=1

                                    K        H                              K        H

≤ XXmh,k + XXQh,k(sh,k,ah,k) − r(sh,k,ah,k) − EY ∼p(·|sh,k,ah,k)[Vh?+1(Y )])

                                    k=1 h=1                            k=1 h=1

                                    K        H                              K        H

= XXmh,k + XXQh,k(sh,k,ah,k) − Q?h,k(sh,k,ah,k)

                                    k=1 h=1                            k=1 h=1

The first sum is bounded by Azuma with probability 1  whereas the second one is bounded by the bonuses again with probability 1 . Therefore, with probability 1 − δ, we have:

                Finally, let us show that R(T) . H2S          AK.

                                                                                                                                        )          (Jensen)

!

Therefore,

Let us conclude on the regret.



Where c is a constant (depends on δ) and f (H,S,A,K) . H S. Therefore, we find the regret upper bound:

R(T) . H2S            AK

More products