Starting from:

$30

CS234- Assignment 3 Solved

Please review any additional instructions posted on the assignment page. When you are ready to submit, follow the instructions on the course website.
1        Policy Gradient Methods (50 pts coding + 15 pts writeup)
The goal of this problem is to experiment with policy gradient and its variants, including variance reduction methods. Your goals will be to set up policy gradient for both continuous and discrete environments, and implement a neural network baseline for variance reduction. The framework for the vanilla policy gradient algorithm is setup in the starter code pg.py, and everything that you need to implement is in this file. The file has detailed instructions for each implementation task, but an overview of key steps in the algorithm is provided here. For this assignment you need to have MuJoCo installed, please follow the installation guide.

REINFORCE
Recall the vanilla policy-gradient theorem,

∇θJ(θ) = Eπθ [∇θ logπθ(a|s)Qπθ(s,a)]

REINFORCE is a monte-carlo policy gradient algorithm, so we will be using the sampled returns Gt as unbiased estimates of Qπθ(s,a). Then the gradient update can be expressed as maximizing the following objective function: where D is the set of all trajectories collected by policy πθ, and τ = (s0,a0,r0,s1...,sT) is a trajectory.

Baseline
One difficulty of training with the REINFORCE algorithm is that the monte-carlo estimated return Gt can have high variance. To reduce variance, we subtract a baseline bφ(s) from the estimated returns when computing the policy gradient. A good baseline is the state value function parametrized by φ, bφ(s) = V πθ(s), which requires a training update to φ to minimize the following mean-squared error loss:



1

Advantage Normalization
After subtracting the baseline, we get the following new objective function:



where

Aˆt = Gt − bφ(st)

A second variance reduction technique is to normalize the computed advantages, Aˆt, so that they have mean 0 and standard deviation 1. From a theoretical perspective, we can consider centering the advantages to be simply adjusting the advantages by a constant baseline, which does not change the policy gradient. Likewise, rescaling the advantages effectively changes the learning rate by a factor of 1/σ, where σ is the standard deviation of the empirical advantages.

1.1       Coding Questions (50 pts)
The functions that you need to implement in pg.py are enumerated here. Detailed instructions for each function can be found in the comments in pg.py. We strongly encourage you to look at pg.py and understand the code structure first.

•   buildmlp

•   addplaceholdersop

•   buildpolicynetworkop

•   addlossop

•   addoptimizerop

•   addbaselineop

•   getreturns

•   calculateadvantage

•   updatebaseline

1.2       Writeup Questions (15 pts)
(a)  (4 pts) (CartPole-v0) Test your implementation on the CartPole-v0 environment by running



With the given configuration file config.py, the average reward should reach 200 within 100 iterations. NOTE: training may repeatedly converge to 200 and diverge. Your plot does not have to reach 200 and stay there. We only require that you achieve a perfect score of 200 sometime during training.

Include in your writeup the tensorboard plot for the average reward. Start tensorboard with:



and then navigate to the link it gives you. Click on the “SCALARS” tab to view the average reward graph.

Now, test your implementation on the CartPole-v0 environment without baseline by running



Include the tensorboard plot for the average reward. Do you notice any difference? Explain.

(b)  (4 pts) (InvertedPendulum-v1) Test your implementation on the InvertedPendulum-v1 environment by running



With the given configuration file config.py, the average reward should reach 1000 within 100 iterations. NOTE: Again, we only require that you reach 1000 sometime during training.

Include the tensorboard plot for the average reward in your writeup.

Now, test your implementation on the InvertedPendulum-v1 environment without baseline by running



Include the tensorboard plot for the average reward. Do you notice any difference? Explain.

(c)  (7 pts) (HalfCheetah-v1) Test your implementation on the HalfCheetah-v1 environment with γ = 0.9 by running



With the given configuration file config.py, the average reward should reach 200 within 100 iterations. NOTE: There is some variance in training. You can run multiple times and report the best results or average. We have provided our results (average reward) averaged over 6 different random seed in figure 1 Include the tensorboard plot for the average reward in your writeup.

Now, test your implementation on the HalfCheetah-v1 environment without baseline by running



Include the tensorboard plot for the average reward. Do you notice any difference? Explain.



Figure 1: Half Cheetah, averaged over 6 runs

2        Best Arm Identification in Multiarmed Bandit (35pts)
In this problem we focus on the Bandit setting with rewards bounded in [0,1]. A Bandit problem instance is defined as an MDP with just one state and action set A. Since there is only one state, a “policy” consists of the choice of a single action: there are exactly A = |A| different deterministic policies. Your goal is to design a simple algorithm to identify a near-optimal arm with high probability.

Imagine we have n samples of a random variable x, {x1,...,xn}. We recall Hoeffding’s inequality below, where x is the expected value of a random variable is the sample mean (under the assumption that the random variables are in the interval [0,1]), n is the number of samples and δ 0 is a scalar:

Pr

Assuming that the rewards are bounded in [0,1], we propose this simple strategy: allocate an identical number of samples n1 = n2 = ... = nA = ndes to every action, compute the average reward (empirical payout) of each arm and return the action with the highest empirical payout argmax . The purpose of this exercise is to study the number of samples required to output an arm that is at least -optimal with high probability. Intuitively, as ndes increases the empirical payout converges to its expected value ra for every action a, and so choosing the arm with the highest empirical payout rba corresponds to approximately choosing the arm with the highest expected payout ra.

(a)  (15 pts) We start by defining a good event. Under this good event, the empirical payout of each arm is not too far from its expected value. Starting from Hoeffding inequality with ndes samples allocated to every action show that:

Pr

In other words, the bad event is that at least one arm has an empirical mean that differs significantly from its expected value and this has probability at most Aδ.

(b)  (20 pts) After pulling each arm (action) ndes times our algorithm returns the arm with the highest empirical payout:



Notice that a† is a random variable. Define a? as the optimal arm (that yields the highest average reward a? = argmaxara). Suppose that we want our algorithm to return at least an  optimal arm with probability 1 − δ0, as follows:

Pr .

How many samples are needed to ensure this? Express your result as a function of the number of actions A, the required precision  and the failure probability δ0.

More products