Starting from:

$39.99

CS234 Assignment #3 Solution

These questions require thought, but do not require long answers. Please be as concise as possible.
Please review any additional instructions posted on the assignment page at http://cs234.stanford.edu/assignment3. When you are ready to submit, please follow the instructions on the course website.
1 Policy Gradient Methods (50 pts coding + 15 pts writeup)
The goal of this problem is to experiment with policy gradient and its variants, including variance reduction methods. Your goals will be to set up policy gradient for both continuous and discrete environments, and implement a neural network baseline for variance reduction. The framework for the policy gradient algorithm is setup in main.py, and everything that you need to implement is in the files network-utils.py, policy-network.py and baseline-network.py. The file has detailed instructions for each implementation task, but an overview of key steps in the algorithm is provided here.
1.1 REINFORCE
Recall the policy gradient theorem,
∇θJ(θ) = Eπθ [∇θ logπθ(a|s)Qπθ(s,a)]
REINFORCE is a Monte Carlo policy gradient algorithm, so we will be using the sampled returns Gt as unbiased estimates of Qπθ(s,a). Then the gradient update can be expressed as maximizing the following objective function:

where D is the set of all trajectories collected by policy πθ, and τ = (s0,a0,r0,s1...) is a trajectory.
1.2 Baseline
One difficulty of training with the REINFORCE algorithm is that the Monte Carlo sampled return(s) Gt can have high variance. To reduce variance, we subtract a baseline bφ(s) from the estimated returns when computing the policy gradient. A good baseline is the state value function, V πθ(s), which requires a training update to φ to minimize the following mean-squared error loss:

1
1.3 Advantage Normalization
After subtracting the baseline, we get the following new objective function:

where
Aˆt = Gt − bφ(st)
A second variance reduction technique is to normalize the computed advantages, Aˆt, so that they have mean 0 and standard deviation 1. From a theoretical perspective, we can consider centering the advantages to be simply adjusting the advantages by a constant baseline, which does not change the policy gradient. Likewise, rescaling the advantages effectively changes the learning rate by a factor of 1/σ, where σ is the standard deviation of the empirical advantages.
1.4 Coding Questions (50 pts)
The functions that you need to implement in network-utils.py, policy-network.py and baseline-network.py are enumerated here. Detailed instructions for each function can be found in the comments in each of these files.
• buildmlp in network-utils.py
• add placeholdersop in policy-network.py
• buildpolicynetworkop in policy-network.py
• addlossop in policy-network.py
• addoptimizerop in policy-network.py
• getreturns in policy-network.py
• normalizeadvantage in policy-network.py
• add baselineop in baseline-network.py
• calculateadvantage in baseline-network.py
• updatebaseline in baseline-network.py
1.5 Writeup Questions (15 pts)
(a)(i) (2 pts) (CartPole-v0) Test your implementation on the CartPole-v0 environment by running the following command. We are using random seed as 15

Include in your writeup the tensorboard plot for the average reward. Start tensorboard with:

and then navigate to the link it gives you. Click on the “SCALARS” tab to view the average reward graph.
Now, test your implementation on the CartPole-v0 environment without baseline by running

Include the tensorboard plot for the average reward. Do you notice any difference? Explain.
(a)(ii) (2 pts) (CartPole-v0) Test your implementation on the CartPole-v0 environment by running the following command. We are using random seed as 12345456


and then navigate to the link it gives you. Click on the “SCALARS” tab to view the average reward graph.
Now, test your implementation on the CartPole-v0 environment without baseline by running

Include the tensorboard plot for the average reward. Do you notice any difference? Explain.
(b)(i) (2 pts) (InvertedPendulum-v1) Test your implementation on the InvertedPendulum-v1 environment by running

With the given configuration file, the average reward should reach 1000 within 100 iterations. NOTE: Again, we only require that you reach 1000 sometime during training. Include the tensorboard plot for the average reward in your writeup.
Now, test your implementation on the InvertedPendulum-v1 environment without baseline by running for seed 15

Include the tensorboard plot for the average reward. Do you notice any difference? Explain.
(b)(ii) (2 pts) (InvertedPendulum-v1) Test your implementation on the InvertedPendulum-v1 environment by running

With the given configuration file, the average reward should reach 1000 within 100 iterations. NOTE: Again, we only require that you reach 1000 sometime during training. Include the tensorboard plot for the average reward in your writeup.
Now, test your implementation on the InvertedPendulum-v1 environment without baseline by running for seed 8

Include the tensorboard plot for the average reward. Do you notice any difference? Explain.
(c)(i) (2 pts) (HalfCheetah-v1) Test your implementation on the HalfCheetah-v1 environment with γ = 0.9 by running the following command for seed 123

With the given configuration file, the average reward should reach 200 within 100 iterations. NOTE: Again, we only require that you reach 200 sometime during training. There is some variance in training. You can run multiple times and report the best results. Include the tensorboard plot for the average reward in your writeup.
Now, test your implementation on the HalfCheetah-v1 environment without baseline by running for seed 123

Include the tensorboard plot for the average reward. Do you notice any difference? Explain.
(c)(ii) (2 pts) (HalfCheetah-v1) Test your implementation on the HalfCheetah-v1 environment with γ = 0.9 by running the following command for seed 15

With the given configuration file, the average reward should reach 200 within 100 iterations. NOTE: Again, we only require that you reach 200 sometime during training. There is some variance in training. You can run multiple times and report the best results. Include the tensorboard plot for the average reward in your writeup.
Now, test your implementation on the HalfCheetah-v1 environment without baseline by running for seed 15

Include the tensorboard plot for the average reward. Do you notice any difference? Explain.
(c)(iii) (3 pts) How do the results differ across seeds? Describe briefly comparing the performance you get for different seeds for the 3 environments. Run the following commands to get the average performance across 3 runs (2 of the previous runs + 1 another run below).

Please comment on how the averaged performance for HalfCheetah environment baseline vs. no baseline case.
1.6 Testing
We have provided some basic tests in the root directory of the starter code to test your implementation. You can run the following command to run the tests and then check if you have the right implementation for individual functions.

2 Best Arm Identification in Multi-armed Bandit (35pts)
In this problem we focus on the bandit setting with rewards bounded in [0,1]. A bandit problem can be considered as a finite-horizon MDP with just one state (|S| = 1) and horizon 1: each episode consists of taking a single action and observing a reward. In the bandit setting – unlike in standard RL – there are no delayed rewards, and the action taken does affect the distribution of future states.
Actions are also referred to as “arms” in a bandit setting . Here we consider a multi-armed bandit, meaning that 1 < |A| < ∞. Since there is only one state, a policy is simply a distribution over actions. There are exactly |A| different deterministic policies. Your goal is to design a simple algorithm to identify a near-optimal arm with high probability.
We recall Hoeffding’s inequality: if X1,...,Xn are i.i.d. random variables satisfying 0 ≤ Xi ≤ 1 with

probability 1 for all i, X = E[X1] = ··· = E[Xn] is the expected value of the random variables, and is the sample mean, then for any δ > 0 we have
Pr (1)
Assuming that the rewards are bounded in [0,1], we propose this simple strategy: pull each arm ne times, and return the action with the highest average payout . The purpose of this exercise is to study the number of samples required to output an arm that is at least -optimal with high probability. Intuitively, as ne increases the empirical average of the payout converges to its expected value ra for every action a, and so choosing the arm with the highest empirical payout corresponds to approximately choosing the arm with the highest expected payout ra.
(a) (15 pts) We start by bounding the probability of the “bad event” in which the empirical mean of some arm differs significantly from its expected return. Starting from Hoeffding’s inequality with ne samples allocated to every action, show that:

Pr ∃a ∈ A s.t.(2)
(b) (20 pts) After pulling each arm (action) ne times our algorithm returns the arm with the highest empirical mean:
a† = argmaxr
a ba (3)
Notice that a† is a random variable. Let a? = argmaxa ra be the true optimal arm. Suppose that we want our algorithm to return at least an -optimal arm with probability at least 1 − δ0, as follows:
Pr . (4)
How many samples are needed to ensure this? Express your result as a function of the number of actions, the required precision and the failure probability δ0.

More products