$30
Problem 1: Q-Learning with Continuous State
Consider a system with a single continuous state variable x and actions a1 and a2. An agent can observe the value of the state variable as well as the reward in the observed state. Assume a discount factor γ = 0.9.
(a) Assume that function approximation is used with Q(x,a1) = w0,1+w1,1x+w2,1x2 and Q(x,a2) = w0,2 + w1,2x + w2,2x2. Give the Q-learning update equations.
Q(x,a) ← Q(x,a) + α[R(x) + 0.9maxQ(x0,a0) − Q(x,a)],
a0
where a0 ∈ a1,a2, α indicates learning rate and R is the reward.
(b) Assume that wi,j = 1 for all i, j. The following transition is observed: x = 0.5, observed reward r = 10, action a1, next state x = 1. What are the updated values of the parameters assuming a learning rate of 0.5?
, where i = 0,1,2;
, where i = 0,1,2;
w0,1 ← 1 + 5.475 ∗ 1 = 6.475; w1,1 ← 1 + 5.475 ∗ 0.5 = 3.7375; w2,1 ← 1 + 5.475 ∗ 0.25 = 2.36875;
Problem 2: Policy Gradient with Continuous State
Assume that Q-function with function approximation is used together with the softmax function to form a policy πθ(s,a) = eQθ(s,a)/Pa0 eQθ(s,a0) . Assume that there are two actions with Q(x,a1) = w0,1 + w1,1x + w2,1x2 and Q(x,a2) = w0,2 + w1,2x + w2,2x2 for a real valued variable x.
(a) Give the update equations for the REINFORCE algorithm. Assume that the the return at the current step is G and the action taken is a1.
wi,j ← wi,j + α ∗ G∇wi,j lnπwi,j(x,a1), where i = 0,1,2, j = 1,2, and α indicates learning rate.
(b) Assume that wi,j = 1 for all i, j and return G = 5 is received. What are the updated values of the parameters assuming x = 0.5 and a learning rate of 0.5? wi,j ← wi,j + α ∗ G∇wi,j lnπwi,j(x,a1), where i = 0,1,2 and j = 1,2;