Starting from:

$20

CS234-Assignment #3 Solved

Policy Gradient Methods (50 pts coding + 15 pts writeup)
The goal of this problem is to experiment with policy gradient and its variants, including variance reduction methods. Your goals will be to set up policy gradient for both continuous and discrete environments, and implement a neural network baseline for variance reduction. The framework for the policy gradient algorithm is setup in main.py, and everything that you need to implement is in the files networkutils.py, policy.py, policygradient.py and baselinenetwork.py. The file has detailed instructions for each implementation task, but an overview of key steps in the algorithm is provided here.

1.1         REINFORCE
Recall the policy gradient theorem,

∇θJ(θ) = Eπθ [∇θ logπθ(a|s)Qπθ(s,a)]

REINFORCE is a Monte Carlo policy gradient algorithm, so we will be using the sampled returns Gt as unbiased estimates of Qπθ(s,a). The REINFORCE estimator can be expressed as the gradient of the following objective function:

where D is the set of all trajectories collected by policy πθ, and ) is trajectory i.

1.2         Baseline
One difficulty of training with the REINFORCE algorithm is that the Monte Carlo sampled return(s) Gt can have high variance. To reduce variance, we subtract a baseline bφ(s) from the estimated returns when computing the policy gradient. A good baseline is the state value function, V πθ(s), which requires a training

1

update to φ to minimize the following mean-squared error loss:

LMSE

1.3         Advantage Normalization
After subtracting the baseline, we get the following new objective function:

where

A second variance reduction technique is to normalize the computed advantages, Aˆit, so that they have mean 0 and standard deviation 1. From a theoretical perspective, we can consider centering the advantages to be simply adjusting the advantages by a constant baseline, which does not change the policy gradient. Likewise, rescaling the advantages effectively changes the learning rate by a factor of 1/σ, where σ is the standard deviation of the empirical advantages.

1.4         Coding Questions (50 pts)
The functions that you need to implement in networkutils.py, policy.py, policygradient.py, and baselinenetwork.py are enumerated here. Detailed instructions for each function can be found in the comments in each of these files.

Note: The ”batch size” for all the arguments is PTi since we already flattened out all the episode observations, actions, and rewards for you. In networkutils.py,

buildmlp
In policy.py,

act
actiondistribution
init
std
actiondistribution
In policygradient.py,

initpolicy
getreturns
normalizeadvantage
updatepolicy
In baselinenetwork.py,

init
forward
calculateadvantage
updatebaseline
1.5         Testing
We have provided some basic tests to sanity check your implementation. Please note that the tests are not comprehensive, and passing them does not guarantee a correct implementation. Use the following command to run the tests:

You can also add additional tests of your own design in tests/testbasic.py.

1.6         Writeup Questions (15 pts)
(3 pts) To compute the REINFORCE estimator, you will need to calculate the values (we drop the trajectory index i for simplicity), where
Naively, computing all these values takes O(T2) time. Describe how to compute them in O(T) time.

(12 pts) The general form for running your policy gradient implementation is as follows:
if not using a baseline, or

if using a baseline. Here ENV should be cartpole, pendulum, or cheetah, and SEED should be a positive integer.

For each of the 3 environments, choose 3 random seeds and run the algorithm both without baseline and with baseline. Then plot the results using

where SEEDS should be a comma-separated list of seeds which you want to plot (e.g. --seeds 1,2,3). Please include the plots (one for each environment) in your writeup, and comment on whether or not you observe improved performance when using a baseline.

We have the following expectations about performance to receive full credit:

cartpole: Should reach the max reward of 200 (although it may not stay there)
pendulum: Should reach the max reward of 1000 (although it may not stay there)
cheetah: Should reach at least 200 (Could be as large as 950)
2           Reducing Variance in Policy Gradient Methods (35 pts)
In class, we explored REINFORCE as a policy gradient method with no bias but high variance. In this problem, we will explore methods to dramatically reduce variance in policy gradient methods, potentially at the cost of increased bias.

Let us consider an infinite horizon MDP M = hS,A,R,T ,γi. Let us define

Aπ(st,at) = Qπ(st,at) − V π(st)

An approximation to the policy gradient is defined as
(1)


g = Es0:∞[XAπ(st,at)∇θ log πθ(at,st)]

a0:∞ t=0

where the colon notation a : b represents the range [a,a + 1,a + 2,...b] inclusive of both ends.
(2)
(5 pts) Let us define the partial sum. Show that it is not necessarily true that Var(Rt+1) ≥ Var(Rt). [Hint: Construct a counterexample MDP where this statement does not hold.]
(10 pts) Prove that Var(Rt+1) ≥ Var(Rt) is true if we assume that rt+1 is, on average, correlated with the previous rewards, i.e. Cov(ri,rt+1)  
(5 pts) In practice, we do not have access to the true function Aπ(st,at), so we would like to obtain an estimate instead. We will consider the general form of an estimator Aˆt(s0:∞,a0:∞) that can be a function of the entire trajectory.
Let Aˆt(s0:∞,a0:∞) = Qˆt(st:∞,at:∞)−bt(s0:t,a0:t−1), where for all st,at, we have that Qˆt is an unbiased estimator of the true Qπ. Namely, we have that ). Note that bt is an arbitrary function of the actions and states sampled before at. Prove that by using this estimate of Aˆt, we obtain an unbiased estimate of the policy gradient g. In other words, prove that

.

(5 pts) We will now look at a few different variants of Aˆt. Recall the TD error δtVˆ (st,at) = rt+γVˆ(st+1)−Vˆ(st). If Vˆ = V π, prove that is an unbiased estimate of Aπ.
(5 pts) Let us define. Show that. In general,
how does bias and variance change as k increases? (a few sentences of justification would suffice, no formal proof is necessary)

(5 pts) Show that

More products