Starting from:

$25

EE526-Homework 5 Solved

Problem 1.

(a)   Find the shortest path from start to end in the following figure.



(b)   Find the longest path from start to end in the following figure.



Problem 2. Consider a MDP with two states S={0, 1}, two actions A = {1,2}, and the follow reward function

                                                                         sa     1,   (s,a) = (0,1)

                                                                             ( ) =4,            (s,a) = (0,2)                                           (1)

R

                                                                                         3,       (s,a) = (1,1)

                                                                                2, (s,a) = (1,2)

and the transition probabilities  as follows:

                                                                                                                         (2)

The other probabilities can be deduced, for example:

                                                                   .                                       (3)

The discount factor is
 
γ = 3/4.
(4)
(a)    For the policy π that chooses action 1 in state 0, and action 2 in state 1, find the state value function vπ(s), by writing out the Bellman’s expectation equation, and solve the equation explicitly.

(b)   For the same policy π, obtain the state value function using iterative update based on the Bellman’s expectation equation. You need to list the first 5 iteration values of v(s).

(c)    For the policy π, calculate the qπ(s,a) function.

(d)   Based on the value function vπ(s), obtain an improved policy π′ based on

                                                                                 π′(s) = argmaxqπ(s,a).                                         (5)

a

(e)    Obtain the optimal value function v∗(s) using value iteration based on the Bellman’s optimality equation, with all initial values set to 0.

(f)     Obtain the optimal policy.

Problem 3. Consider a MDP with two states S={0, 1}, two actions A = {1,2}, and the follow reward function

                                                                         sa     1,   (s,a) = (0,1)

                                                                             ( ) =4,            (s,a) = (0,2)                                           (6)

R

                                                                                         3,       (s,a) = (1,1)

                                                                                2, (s,a) = (1,2)

and the transition probabilities  as follows:

                                                                                                                         (7)

The other probabilities can be deduced, for example:

                                                                   .                                       (8)

The discount factor is
 
γ = 3/4.
(9)
Exercise on model-free prediction:

(a)    For the policy π that chooses action 1 in state 0, and action 2 in state 1, starting from state 0, generate one episode E of 10000 triplets of (Ri,Si,Ai), i=0, 2, ..., 9999, with R0 = 0, S0 = 0.

(b)    Based on the episode E, use Monte Carlo policy evaluation to estimate the value function vπ(s).

(c)    Based on the episode E, use n-step temporal difference policy evaluation to estimate the value function vπ(s).

Exercise on model-free control:

(a)    Use the SARSA algorithm to estimate the optimal action-value function q∗(s,a), by running the algorithm in Sutton and Barto’s book (2nd edition, available online).

(b)   Use the Q-learning algorithm to estimate the optimal action-value function q∗(s,a), by running the algorithm in Sutton and Barto’s book (2nd edition, available online).

You only need to simulate one episode. In both cases, you will need to decide an appropriate fixed step-size α, and exploration probability ǫ, and number of time steps in the episode.

More products