Introduction
This assignment will cover Goal Conditioned Reinforcement Learning (GCRL). You will learn how to train a goal conditioned policy and how to use a pretrained goal conditioned policy to train a hierarchical RL agent.
Part 1: GCRL
We will be building on the code that we have implemented in the first two assignments. We will use the policy gradient algorithm provided (pg). This algorithm will be used to train our goal-conditioned policies.
In order to implement the goal conditionned wrapper, you will be writing new code in the following files:
• hw4/infrastructure/gclrwrapper.py
For this assignment a new environment will be used based on an extension to the ant environment.
Implementation
When we use goal-conditioned RL we need to modify the observation from the original environment and append the goal. If s was the original state and g is the generated goal your environment wrapper should construct a new state smod ← [s,g] which results in the policy π(a|smod,θ)
Goal Distributions For this part of the assignment and the first question, generate goals from a uniform distribution with bounds ant = [-4,20], reacher [-0.3, 0.3].
def GoalConditionedEnv(object):
def __init__()
## You need to update the size of the self.observation_space to include the goal
def reset():
# Add code to generate a goal from a distribution
def step():
## Add code to compute a new goal-conditioned reward
def createState():
## Add the goal to the state
Next, you should run your code on the modified Ant environment to evaluate how well your solution works. After 1 million steps your agent should achieve average reward close to −5.
Evaluation
Once you have a working implementation of GCLR, you should prepare a report. The report should consist of one figure for each question below. You should turn in the report as one PDF and a zip file with your code. If your code requires special instructions or dependencies to run, please include these in a file called README inside the zip file. Also provide the log file of your run on Gradescope named as reacher1.csv and ant1.csv.
Question 1: basic GCRL performance. [3 pts] Include a learning curve plot showing the performance of your implementation on reacher and Ant. The x-axis should correspond to the number of time steps (consider using scientific notation) and the y-axis should show the average per-epoch reward as well as the best mean reward so far. You should also add logging for the distance between the state and the goal (this is the intrinsic reward you have implemented) and the success. Make sure they are logged to the data folder. Include a plot of these metrics using Tensorboard, wandb.com, etc as in previous assignments or using excel by using the csv file. Be sure to label the y-axis since we need to verify that your implementation achieves similar reward as ours. You should not need to modify the default hyperparameters in order to obtain good performance, but if you modify any of the parameters, list them in the caption of the figure. The final results should use the following experiment name:
python ift6131/run_hw4_gcrl.py env_name=reacher exp_name=q1_reacher python ift6131/run_hw4_gcrl.py env_name=antmaze exp_name=q1_ant
python run_hw4_gcrl.py env_name=reacher exp_name=q1_reacher_normal goal_dist= normal python run_hw4_gcrl.py env_name=antmaze exp_name=q1_ant_normal goal_dist= normal
Question 3: Relative Goals [2 pts] So far, we have specified goals in a global state space. Using a global state space makes it more complex for the agent to understand how to reach the goal. To make training easier, we can generate and provide the goal in an agent-relative coordinate frame gmod ← g − s. Use this new relative goal location gmod as the goal you add to the state smod ← [s,gmod] In addition change, the distribution the goals are generated from to g ← N(agent position,3) Train the policies again using this representation and plot the distance to the generated goals.
python run_hw4_gcrl.py env_name=reacher exp_name=q3_reacher_normal_relative goal_dist=normal goal_frequency=10 goal_rep=relative python run_hw4_gcrl.py env_name=antmaze exp_name=q3_ant_normal_relative goal_dist=normal goal_frequency=10 goal_rep=relative
Submit the run logs for all the experiments above. In your report, make a single graph that averages the performance across three runs for both Uniform and Normal goal distributions. See scripts/readresults.py for an example of how to read the evaluation returns from Tensorboard logs.
Question 4: Changing goals during the episode. [3pts] Often, we want a GCRL agent that can learn to achieve many goals in one episode. For this, we will create another wrapper that uses a controlled goal update frequency. We will fill in the missing parts of the class GoalConditionedWrapperV2().
def GoalConditionedWrapperV2(object):
def __init__()
## You need to update the size of the self.observation_space to include the goal
def sampleGoal():
## This will sample a goal from the distribution
def reset():
# Use sampleGoal to get a new goal
def step():
## Now we need to use the goal for k steps and after these k steps sample a new goal
## Add code to compute a new goal-conditioned reward
def createState():
## Add the goal to the state
python run_hw4_gcrl.py env_name=reacher exp_name=q3_reacher_normal goal_dist= normal goal_frequency=10 python run_hw4_gcrl.py env_name=antmaze exp_name=q3_ant_normal goal_dist= normal goal_frequency=10
You should try different values [5,10] for goal frequency to see how it affects performance.
Saving your Policy For the next part of the assignment you will need to load the policies you have trained. Save and load the policy
Part 2: Hierarchial RL
Implement a Hierarchical policy to plan longer paths across the environment.
In order to implement Hierarchical Reinforcement Learning (HRL), you will be writing new code in the following files:
• hw4/infrastructure/hrlwrapper.py
The higher level HRL policy π(g|s,θhi) is going to use the policy you trained in Q3 as the lower level π(a|s,g,θlo). To accomplish this we will create another environment wrapper that loads the lower level policy and uses it plus the suggested goals from the high level to act in the environment.
1: init θhi a random networks, load θlow and D ← {}
2: for t ∈ 0,...,T do
3: g ← π(·|st,θhi) {Take k steps in the environment}
4: for i ∈ 0,...,k do
5: compute at ← π(·|st,g,θlow) and receive {st,at,rt,s′t}
6: end for
7: Put {st,g,rt+k,st+k}, add to D
8: Update RL algorithm parameters θhi
9: end for
Question 5: Experiments (HRL) [3pts] For this question, the goal is to implement HRL and tune a few of the hyperparameters to improve the performance. Try different goal frequency=[5,10] for the environment. You should use the lower-level policies you trained for Q3 for each of these frequencies.
python run_hw4.py exp_name=q4_hrl_gf<b> rl_alg=pg env_name=antmaze
python run_hw4.py exp_name=q4_hrl_gf<b> rl_alg=pg env_name=antmaze
python run_hw4.py exp_name=q4_hrl_gf<b> rl_alg=pg env_name=antmaze
Submit the learning graphs from these experiments along with the write-up for the assignment. You should plot the environment reward.
1 Bonus [3pts]
You can get extra points if you also implement Hindsight experience replay.
2 Bug Bonus [3pts]
Submission
We ask you to submit the following content on the course GradeScope :
Submitting the PDF.
Your report should be a PDF document containing the plots and responses indicated in the questions above.
Submitting log files on the autograder.
Make sure to submit all the log files that are requested by GradeScope AutoGrader, you can find them in your log directory /data/expname/ by default.
Submitting code, experiments runs, and videos.
In order to turn in your code and experiment logs, create a folder that contains the following:
• The roble folder with all the .py files, with the same names and directory structure as the original homework repository (not include the outputs/ folder). Additionally, include the commands (with clear hyperparameters) and the config file conf/confighw4.yaml file that we need in order to run the code and produce the numbers that are in your figures/tables (e.g. run “python run hw4.py –ep len 200”) in the form of a README file. Finally, your plotting script should also be submitted, which should be a python script (or jupyter notebook) such that running it can generate all plots from your pdf. This plotting script should extract its values directly from the experiments in your outputs/ folder and should not have hardcoded reward values.
• You must also provide a video of your final policy for each question above. To enable video logging, set both flags video log freq to be greater than 0 and render to be true in conf/confighw4.yaml before running your experiments. Videos could be fin as .mp4 files in the folder: outputs/.../data/expname/videos/. Note : For this homework, the atari envs should be in the folder gym and the other in the folder video.
As an example, the unzipped version of your submission should result in the following file structure. Make sure to include the prefix q1 ,q2 ,q3 ,q4 , and q5 . q1cheetahn500 arch1x32 events.out.tfevents.1567529456.e3a096ac8ff4 eval stepx.mp4 q1cheetahn5arch2x250 events.out.tfevents.1567529456.e3a096ac8ff4 eval stepx.mp4 dqnagent.py
policies
confighw4.yaml
You also need to include a diff of your code compared to the starter homework code.
You can use the command
git diff 8ea2347b8c6d8f2545e06cef2835ebfd67fdd608 >> diff.txt
2. Turn in your assignment on Gradescope. Upload the zip file with your code and log files to HW4 Code, and upload the PDF of your report to HW4.