Starting from:

$30

CS5446-AI Planning and Decision Making 3 Solved

Problem 1: Q-Learning with Continuous State

Consider a system with a single continuous state variable x and actions a1 and a2. An agent can observe the value of the state variable as well as the reward in the observed state. Assume a discount factor γ = 0.9.

(a)    Assume that function approximation is used with Q(x,a1) = w0,1+w1,1x+w2,1x2 and Q(x,a2) = w0,2 + w1,2x + w2,2x2. Give the Q-learning update equations.

Q(x,a) ← Q(x,a) + α[R(x) + 0.9maxQ(x0,a0) − Q(x,a)],

a0

where a0 ∈ a1,a2, α indicates learning rate and R is the reward.

(b)   Assume that wi,j = 1 for all i, j. The following transition is observed: x = 0.5, observed reward r = 10, action a1, next state x = 1. What are the updated values of the parameters assuming a learning rate of 0.5?

 , where i = 0,1,2;

 , where i = 0,1,2;

w0,1 ← 1 + 5.475 ∗ 1 = 6.475; w1,1 ← 1 + 5.475 ∗ 0.5 = 3.7375; w2,1 ← 1 + 5.475 ∗ 0.25 = 2.36875;

Problem 2: Policy Gradient with Continuous State

Assume that Q-function with function approximation is used together with the softmax function to form a policy πθ(s,a) = eQθ(s,a)/Pa0 eQθ(s,a0) . Assume that there are two actions with Q(x,a1) = w0,1 + w1,1x + w2,1x2 and Q(x,a2) = w0,2 + w1,2x + w2,2x2 for a real valued variable x.

(a)    Give the update equations for the REINFORCE algorithm. Assume that the the return at the current step is G and the action taken is a1.

wi,j ← wi,j + α ∗ G∇wi,j lnπwi,j(x,a1), where i = 0,1,2, j = 1,2, and α indicates learning rate.

(b)   Assume that wi,j = 1 for all i, j and return G = 5 is received. What are the updated values of the parameters assuming x = 0.5 and a learning rate of 0.5? wi,j ← wi,j + α ∗ G∇wi,j lnπwi,j(x,a1), where i = 0,1,2 and j = 1,2;

More products