RL-TAU-Homework 2 Solved

Your shopping cart is empty.

Theory

Question 1
Consider a Markov chain with states S = {1,2,3,4} and the following transition matrix:

1. Draw the state transition diagram.

2. State the communicating classes. Is the chain irreducible?

3. Calculate di, the period of each state. Is the chain aperiodic?

4. If possible, find the invariant distribution.

5. For every state i ∈ S, calculate E[Ti], the expected return time to state i (hint: use the previous clause). Which states are recurrent? Classify each recurrent state as null or positive.

6. Write a new transition matrix P0 so that E[T1] = 3 and d1 = 3. Calculate as a function of m for that transition matrix, given that 1 is the initial state.

2-1

Question 2
An MDP is represented by a bidirectional cycle of size 2k, where each node ∈ {0,...,2k−1} is a state. The MDP starts at state k. There are two available actions: CW (clockwise) and CCW (counter-clockwise), both are deterministic. (CW turns state i into state i+1mod 2k and CCW turns every state i to i − 1mod 2k.)

The reward from each action state (s,a) is 0 for all states but state 0 that receives a (running) reward of 1 (for CW and CCW).

The return is discounted with parameter γ < 1.

1. Write a formal MDP for this problem (i.e., S,A,P,R and s0).

2. What is the optimal policy π∗?

3. After one iteration of the Value Iteration algorithm (where all states start with a value of 0), which states change their value after one iteration? what are the new values?

4. Which states change their value after two iterations? what are the new values?

5. For k = 2 (4 states), calculate V ∗(s) for every s ∈ S (as a function of γ).

Question 3
Consider the following digit game. We have N + 1 slots, marked by 0 to N. At each round (of N + 1 rounds) a random digit is chosen. (I.e., a random digit from {0,1,...,9}.) We can put the digit in one of the unoccupied slots, and that slot becomes occupied. Our aim is to maximize the expected value of the sequence of digits, when viewed as a number. (I.e. if we have di in slot i then the final value is

Example: Suppose N = 2. At the beginning we have three empty slots. Let us denote an empty slot by ×, then we have × × ×. Suppose the first digit is 5, and we decide to put it in the second slot, then we have ×5×. Suppose the second digit is 4 and we decide to put it in the first slot, we have ×54. Finally, the last digit is 4 again, and our final number is 454.

1. Build an MDP for this problem. (Hint: you may incorporate the random digit in the state.)

2. Show that the decision of the optimal policy depends only on the number of empty slots and the current random digit di. For example, in round 2, the game is in d0 × ×/ × d0 × / × × d0 and a current digit is d1, the optimal policy will put d1 in d0 d1 ×/ d1 d0 × / d1 × d0 respectively or d0 × d1/ × d0 d1 / × d1 d0 for every d0,d1 ∈ {0,1,...,9}.

3. Compute the optimal policy for N = 2. (I.e the number has three digits.)

Question 4
Let M be an MDP, defined by (S,A,P,S0,r).

Let T be the following operator for V ∈ R|S|:

!

Show that T is a γ-contracting with respect to the max-norm.

Programing

Intro
The following questions will use the Frozen Lake environment, a simple gridworld MDP that is taken from gym and slightly modified for this assignment. In this MDP, the agent must navigate from the start state to the goal state on a 4x4 grid, with stochastic transitions. You can read more about the environment here: https://gym.openai.com/envs/FrozenLake-v0/. You are provided with template python script in which you should implement the algorithm: value_iter.py and policy_iter.py.

An example episode is given below, the agent falls into a hole after two timesteps. Note the stochasticity–on the first step, the DOWN action is selected, but the agent moves to the right.

Question 1: Value Iteration
In this problem, you will implement value iteration, which has the following pseudocode:

Value Iteration

Initialize V (0)(s) = 0, for all s for i = 0,1,2,... do for for all s ∈ S do

V (i+1)(s) = maxa Ps0 P(s,a,s0)[R(s,a,s0) + γV (i)(s0)]

We additionally define the sequence of greedy policies π(0),π(1),...,π(n−1), where

π(i)(s) = argmaxXP(s,a,s0)[R(s,a,s0) + γV (i)(s0)] (1)

a

s0

Your code will return two lists: [V (0),V (1),...,V (n)] and [π(0),π(1),...,π(n−1)]

Shopping cart

US$0

RL-TAU-Homework 2 Solved

More products