$25
In this homework you will implement policy gradient algorithm with a neural network for the cart pole task [1] in OpenAI Gym environment. As in previous homework, do not care about done variable. Terminate the episode after 500 iterations. You can consider the task is solved if you consistently get +450 reward.
2 Policy Gradient
As it is explained in the lecture, your RL agent can be a neural network. Since the environment is not complex, in this homework you will use a single layer with at most 4 neurons. (Our implementation has a single neuron and can solve the task approximately in 50 episodes where there are 50 rollouts in each episode. Considering the state space, there are 4 weights and 1 bias for the neuron. Activation function is a sigmoid. Discount factor is 0.99, learning rate is 0.05. Average reward of the roll-outs is used as baseline. Remember to check the course website for the explanation of causality principle.)
In the lecture, we used a Gaussian distribution for the probability distribution of actions, but there are 2 discrete actions in cart pole task. Therefore, it is not reasonable to use a Gaussian distribution. Instead, a Bernoulli distribution will be used and the output of the network will be the probability(p) of pushing the cart either to the left or to the right. (The other probability is 1-p naturally) Remember that:
T T
5θ J(θ) = Eπθ[(X5θ logπθ(at|st)(Xγk−t ∗ R(sk,ak))]
t=1 k=t
Here:
(1)
πθ(at|st) = pn ∗ (1 − p)1−n
Where:
(2)
p = Pθ(at) = sigmoid(θ ∗ s + b)
(3)
So:
1
))] (4)
(5)
Because of the property of sigmoid:
5θ p = 5θPθ(at) = p ∗ (1 − p) ∗ s
Then;
(6)
5θ logπθ(at|st) = n ∗ (1 − p) ∗ s + (1 − n) ∗ (−p) ∗ s
(7)
After you correctly calculate these gradients, you can update your parameters using stochastic gradient descent as in the second homework.