Starting from:

$29.99

CS3316 Assignment 5 Solution

1 Introduction
2 Actor-Critic Algorithm
In original policy gradient 5θlogπθ(st,at)vt, return vt is the unbiased estimation of expected long-term value Qπθ(s,a) following a policy πθ(s) (Actor). However, original policy gradient suffers from high variance. Actor-Critic Algorithm uses Q value function Qw(s,a), named Critic, to estimate Qπθ(s,a).
DDPG borrows the success from DQN by introducing experience replay and target Q network into deterministic policy gradient method (DPG). DPG deterministically maps states to specific action instead of outputting a stochastic policy over all actions.
In A3C, we maintain several instances of local agent and a global agent. Instead of experience replay, we asynchronously execute all the local agents in parallel. The parameter of the global agent is updated by all the local experience. The pseudo code is listed at the end of the article. For more details of these two algorithms, you can refer to the original papers in the following.
• Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control with deep reinforcement learning[J]. arXiv preprint arXiv:1509.02971, 2015.
3 Experiment Description
• Programming language: python3
• You are required to implement A3C and DDPG algorithms.
• You should test your agent in a classical RL control environment–Pendulum. OPENAI gym provides this environment, which is implemented with python
(https://www.gymlibrary.dev/environments/classic_control/pendulum/). You can also refer to gym’s source code at (https://github.com/openai/ gym).
4 Report and Submission
• Your report and source code should be compressed and named after “studentID+name+assignment5”.

More products