Introduction
Part 1: DQN
We will be building on the code that we have implemented in the first two assignments. All files needed to run your code are in the hw3 folder, but there will be some blanks you will fill with your solutions from homework 1. These locations are marked with # TODO: get this from hw1, hw2 and are found in the following files:
• infrastructure/rltrainer.py
• infrastructure/utils.py
• policies/MLPpolicy.py
In order to implement deep Q-learning, you will be writing new code in the following files:
• agents/dqnagent.py
• critics/dqncritic.py
• policies/argmaxpolicy.py
There are two new package requirements (opencv-python and gym[atari]) beyond what was used in the first two assignments; make sure to install these with pip install -r requirements.txt if you are running the assignment locally.
Implementation
Evaluation
Once you have a working implementation of Q-learning, you should prepare a report. The report should consist of one figure for each question below. You should turn in the report as one PDF and a zip file with your code. If your code requires special instructions or dependencies to run, please include these in a file called README inside the zip file. Also provide the log file of your run on gradescope named as pacman1.csv.
Question 1: basic Q-learning performance (DQN). Include a learning curve plot showing the performance of your implementation on Ms. Pac-Man. The x-axis should correspond to number of time steps (consider using scientific notation) and the y-axis should show the average per-epoch reward as well as the best mean reward so far. These quantities are already computed and printed in the starter code. They are also logged to the data folder, and can be visualized using Tensorboard as in previous assignments. Be sure to label the y-axis, since we need to verify that your implementation achieves similar reward as ours. You should not need to modify the default hyperparameters in order to obtain good performance, but if you modify any of the parameters, list them in the caption of the figure. The final results should use the following experiment name:
python ift6131/scripts/run_hw3_dqn.py env_name=MsPacman-v0 exp_name=q1
python run_hw3.py env_name=LunarLander-v3 exp_name=q2_dqn_1 seed=1 python run_hw3.py env_name=LunarLander-v3 exp_name=q2_dqn_2 seed=2 python run_hw4.py env_name=LunarLander-v3 exp_name=q2_dqn_3 seed=3
python run_hw3.py env_name=LunarLander-v3 exp_name=q2_doubledqn_1 double_q= true seed=1 python run_hw3.py env_name=LunarLander-v3 exp_name=q2_doubledqn_2 double_q= true seed=2 python run_hw3.py env_name=LunarLander-v3 exp_name=q2_doubledqn_3 double_q= true seed=3
Submit the run logs for all the experiments above. In your report, make a single graph that averages the performance across three runs for both DQN and double DQN. See scripts/readresults.py for an example of how to read the evaluation returns from Tensorboard logs.
python run_hw3_dqn.py env_name=LunarLander-v3 exp_name=q3_hparam1 python run_hw3_dqn.py env_name=LunarLander-v3 exp_name=q3_hparam2 python run_hw3_dqn.py env_name=LunarLander-v3 exp_name=q3_hparam3
You can replace LunarLander-v3 with PongNoFrameskip-v4 or MsPacman-v0 if you would like to test on a different environment.
Part 2: DDPG
Implement the DDPG algorithm.
In order to implement Deep Deterministic Policy Gradient (DDPG), you will be writing new code in the following files:
• agents/ddpg agent.py
• critics/ddpgcritic.py
• policies/MPLpolicy.py
DDPG is programmed a little different that the rl algorithms so far. DDPG does not use n-step returns to estimate the advantage given a large batch of on-policy data. Instead, DDPG is off-policy. DDPG trains a Q-Function Q(st,at,ϕ) to estimate the policy reward-to-go if for a state and action. This model can then be used as the objective to optimize the current policy.
(1)
Algorithm 1 DDPG algorithm
1: init ϕ′ ← ϕ and θ′ ← θ to random networks and D ← {}
2: for l ∈ 0,...,L do
3: take some action at and receive {st,at,rt,s′t}, add to D
4: Sample batch of data from D
5: yi ← ri + γQ(si,µ(si,θ ),ϕ ) {Compute Target}
6: Update ϕ by minimizing N1 PN ||Q(si,ai,ϕ) − yi||2 {Update critic}
7: Update {Update Actor}
8: Update θ′ ← ρθ′ + (1 − ρ)θ and ϕ′ ← ρϕ′ + (1 − ρ)ϕ {Using Polyak averaging}
9: end for
Question 4: Experiments (DDPG) For this question the goal is to implement DDPG and tune a few of the hyper parameters to improve the performance. Try different update frequencies for the Q-Function and actor. Also, try different learning rates for the Q-function and actor. First try different learning rates
python run_hw3.py exp_name=q4_ddpg_up<b>_lr<r> rl_alg=ddpg env_name=InvertedPendulum-v2 atari=false
python run_hw3.py exp_name=q4_ddpg_up<b>_lr<r> rl_alg=ddpg env_name=InvertedPendulum-v2 atari=false
python run_hw3.py exp_name=q4_ddpg_up<b>_lr<r> rl_alg=ddpg env_name=
InvertedPendulum-v2 atari=false
Next try different update frequencies for training the policies.
python run_hw3.py exp_name=q4_ddpg_up<b>_lr<r> rl_alg=ddpg env_name=InvertedPendulum-v2 atari=false
python run_hw3.py exp_name=q4_ddpg_up<b>_lr<r> rl_alg=ddpg env_name=InvertedPendulum-v2 atari=false
python run_hw3.py exp_name=q4_ddpg_up<b>_lr<r> rl_alg=ddpg env_name=InvertedPendulum-v2 atari=false
Submit the learning graphs from these experiments along with the write-up for the assignment.
Question 5: Best parameters on a more difficult task After you have completed the parameter tuning on the simpler InvertedPendulum-v2 environment use those parameters to train a model on the more difficult HalfCheetah-v2 environment.
python run_hw4.py exp_name=q5_ddpg_hard_up<b>_lr<r> rl_alg=ddpg env_name=HalfCheetah-v2 atari=false
Include the learning graph from this experiment in the write-up as well. Also provide the log file of your run on gradescope named as halfcheetah5.csv.
Part 3: TD3
In order to implement Twin Delayed Deep Deterministic Policy Gradient (TD3), you will be writing new code in the following files:
• critics/td3 critic.py
This is a relatively small change to DDPG to get TD3. Implement the additional target q function for TD3
Question 6: TD3 tuning Again the hyper parameters for this new algorithm need to be tuned as well using InvertedPendulum-v2. Try different values for the noise being added to the target policy ρ when computing the target values. Also try different Q-Function network structures. Start with trying different values for ρ.
python run_hw3.py exp_name=q6_td3_shape<s>_rho<r> rl_alg=ddpg env_name=InvertedPendulum-v2 atari=false
python run_hw3.py exp_name=q6_td3_shape<s>_rho<r> rl_alg=ddpg env_name=InvertedPendulum-v2 atari=false
Algorithm 2 TD3
1: init ϕ′ ← ϕ and θ′ ← θ to a random network and D ← {} 2: for l ∈ 0,...,L do
3: take some action at and receive {st,at,rt,st+1}, add to D
4: Sample batch of data
5: clip(N(0,I) · ρ),−c,c) {Compute Target}
6: Update ϕ by minimizing N1 PN ||Q(si,ai,ϕ) − yi||2 {Update critic}
7: if t mod d then
8: Update{Update Actor}
9: Update θ′ ← ρθ′ + (1 − ρ)θ and ϕ′ ← ρϕ′ + (1 − ρ)ϕ {Using Polyak averaging}
10: end if
11: end for
python run_hw3.py exp_name=q6_td3_shape<s>_rho<r> rl_alg=ddpg env_name=InvertedPendulum-v2 atari=false
Next try different update frequencies for training the policies.
python run_hw3.py exp_name=q6_td3_shape<s>_rho<r> rl_alg=ddpg env_name=InvertedPendulum-v2 atari=false
python run_hw3.py exp_name=q6_td3_shape<s>_rho<r> rl_alg=ddpg env_name=InvertedPendulum-v2 atari=false
python run_hw3.py exp_name=q6_td3_shape<s>_rho<r> rl_alg=ddpg env_name=InvertedPendulum-v2 atari=false
Include the results from these hyper-parameter experiments in the assignment write-up. Make sure to comment clearly on the parameters you studied and why which settings have worked better than others.
Question 7: Evaluate TD3 compared to DDPG In this last question, evaluate TD3 compared to DDPG. Using the best parameter setting from Q6 train TD3 on the more difficult environment used for Q5.
python run_hw3.py exp_name=q6_td3_shape<s>_rho<r> rl_alg=td3 env_name=HalfCheetah-v2 atari=false
Include the learning graph from this experiment in the write-up for the assignment. Make sure to comment on the different performance between DDPG and TD3 and what makes the performance different. Also provide the log file of your run on gradescope named as half cheetah7.csv.
Bonus: For finding issues or adding features Keeping up with the latest research often means adding new types of programming questions to the assignments. If you find issues with the code and provide a solution you can receive bonus points. Also, adding features that help the class can also result is bonus points for the assignment.
Part 4: SAC
In order to implement Soft Actor-Critic (SAC), you will be writing new code in the following files:
• agents/sacagent.py
• critics/saccritic.py
• MLPPolicyStochastic in policies/MLPpolicy.py
SAC is similar to TD3 except for his central feature which is entropy regularization. The policy is trained to maximize a trade-off between expected return and entropy, a measure of randomness in the policy.The algorithm and some relevant resources are available here. Bonus: Add linear annealing to alpha (the entropy coefficient).
Question 8: SAC entropy tuning Again the hyper parameters for this new algorithm need to be tuned as well using InvertedPendulum-v2. Try different values for the entropy coefficient α being added to the loss.
python run_hw3.py exp_name=q6_sac_alpha<a> rl_alg=ddpg env_name=InvertedPendulum-v2 atari=false
python run_hw3.py exp_name=q6_sac_alpha<a> rl_alg=ddpg env_name=InvertedPendulum-v2 atari=false
python run_hw3.py exp_name=q6_sac_alpha<a> rl_alg=ddpg env_name=InvertedPendulum-v2 atari=false
Question 9: Evaluate SAC compared to DDPG In this last question, evaluate SAC compared to DDPG. Using the best parameter setting from Q6 train SAC on the more difficult environment used for Q5.
python run_hw3.py exp_name=q6_td3_shape<s>_rho<r> rl_alg=td3 env_name=
HalfCheetah-v2 atari=false
Include the learning graph from this experiment in the write-up for the assignment. Make sure to comment on the different performance between DDPG and SAC and what makes the performance different. Also provide the log file of your run on gradescope named as half cheetah9.csv.
Submission
We ask you to submit the following content on the course GradeScope :
Submitting the PDF.
Your report should be a PDF document containing the plots and responses indicated in the questions above.
Submitting log files on the autograder.
Make sure to submit all the log files that are requested by GradeScope AutoGrader, you can find them in your log directory /data/expname/ by default.
Submitting code, experiments runs, and videos.
In order to turn in your code and experiment logs, create a folder that contains the following:
• The roble folder with all the .py files, with the same names and directory structure as the original homework repository (not include the outputs/ folder). Additionaly, include the commands (with clear hyperparameters) and the config file conf/config hw3.yaml file that we need in order to run the code and produce the numbers that are in your figures/tables (e.g. run “python run hw3.py –ep len 200”) in the form of a README file. Finally, your plotting script should also be submitted, which should be a python script (or jupyter notebook) such that running it can generate all plots from your pdf. This plotting script should extract its values directly from the experiments in your outputs/ folder and should not have hardcoded reward values.
• You must also provide a video of your final policy for each question above. To enable video logging, set both flags video log freq to be greater than 0 and render to be true in conf/confighw3.yaml before running your experiments. Videos could be fin as .mp4 files in the folder: outputs/.../data/expname/videos/. Note : For this homework, the atari envs should be in the folder gym and the other in the folder video.
As an example, the unzipped version of your submission should result in the following file structure. Make sure to include the prefix q1 ,q2 ,q3 ,q4 , and q5 . q1cheetahn500 arch1x32 events.out.tfevents.1567529456.e3a096ac8ff4 eval stepx.mp4 q1cheetahn5arch2x250 events.out.tfevents.1567529456.e3a096ac8ff4 eval stepx.mp4 dqnagent.py
policies
confighw3.yaml
You also need to include a diff of your code compared to the starter homework code.
You can use the command
git diff 8ea2347b8c6d8f2545e06cef2835ebfd67fdd608 >> diff.txt
2. Turn in your assignment on Gradescope. Upload the zip file with your code and log files to HW3 Code, and upload the PDF of your report to HW3.