$25
In this homework you will implement Q-learning with function approximation for the cart pole task [3] in OpenAI Gym environment. As in previous homework, do not care about done variable. Terminate the episode after 500 iterations. You can consider the task is solved if you consistently get +400 reward.
2 Function Approximation
Instead of using a large table which is not feasible for continuous-valued variables, we can use a function. As you might have noticed in the first homework, you have to discretize states to keep a table. However, it is cumbersome in general since you might not know anything about the environment, how to discretize and so on. What we do here instead is using a function approximator which will directly give us action values. After all, all we need is to select the best action.
As this task is quite easy, a linear transformation should suffice. You will observe a fourdimensional state. You will have a [4, 2] sized matrix A, and [2] sized vector b as your parameter set. The computation is:
out = np.matmul(observation, A) + b
which will correspond to Q(s,a). Here, out will be two-dimensional, one Q value for each action. To update A and b, you need some sort of direction, supervision. Remember the update rule for Q-table learning:
Qnew(st,at) = Q(st,at) + α(rt + γ maxQ(st+1,a0) − Q(st,at)) (1)
a0
Motivated from this update rule, we will use the following function as our loss function (also known as: objective function, error function) and update our parameters with respect to this loss function:
2
out[(2)
This is also known as temporal difference learning [2]. Since this loss function is differentiable with respect to our parameter set, we can use gradient-based learning. You need to find the
1
gradients, ∂L/∂A, ∂L/∂b.
∂L ∂L ∂out
= (3)
∂A ∂out ∂A
∂L ∂L ∂out
= (4)
∂b ∂out ∂b
After you correctly calculate these gradients, you can update your parameters using stochastic gradient descent.
A (5) ∂A
b (6)
∂b
where η is the learning rate. As in the previous homework, the convergence of the algorithm depends on your hyperparameter settings. One of the most important thing is to clip your parameters into a range, [-lim, lim], to stabilize learning.