$24.99
Course: Reinforcement Learning (CS6700)
Instructions
1. Work on your own. You can discuss with your classmates on the problems, use books orweb. However, the solutions that are submitted must be your own and you must acknowledge all the sources (names of people you worked with, books, webpages etc., including class notes.) Failure to do so will be considered cheating. Identical or similar write-ups will be considered cheating as well.
2. In your submission, add the following declaration at the outset:
“I pledge that I have not copied or given any unauthorized assistance on this assignment.”
3. The assignment has two parts. The first part involves theoretical exercises, while thesecond part requires programming. For the first part, write/typeset the solutions, and upload it on moodle. For the second part, you are required to submit your work in a separate interface (check the details in Section II below).
1
I. Theory exercises
Problem 2.
Consider an n-state discounted problem with bounded single stage cost g(i, a), discount factor α ∈ (0,1), and transition probabilities pij(a). For each j = 1, . . . , n, let
mj = min min pij(a). i=1,...,n a∈A(i)
For all i, j, a, let
pij(a) − mj p˜ij(a) = nk=1 mk , 1 − ∑
assuming .
(a) Show that p˜ij are transition probabilities.
(b) Consider a modified discounted cost problem with single stage cost g(i, a), discount factor , and transition probabilities p˜ij(a). Show that this problem has the same optimal policy as the original, and that its optimal cost J˜ satisfies
J∗ = J˜ + α∑nj=1 mjJ˜(j)e,
1 − α
where J∗ is the optimal cost vector of the original problem, and e is a n-vector of ones.
Problem 3.
Consider a variant of the discounted MDP where in state xk at time instant k, a coin with bias β ∈ (0,1) is tossed. If the coin turns up heads, then the MDP will transition to the next state according to the underlying transition probabilities. On the other hand, if the coin turns up tails, then the MDP will transition to a special cost-free and absorbing state, say T.
1. Show that the variant described above can be cast into the discounted MDP framework.
2. What would be the discount factor of the MDP variant? Would the MDP variant continue to be of discounted type, if the discount factor α of the original MDP is 1?
Problem 4.
Consider an MDP with a finite-horizon. For this problem, derive a policy iteration algorithm.
Problem 5.
In the “We Care” software firm, every six months, a manager has to decide on “incentives” for his/her managees. Each managee is either of ‘Type I’ or ‘Type II’. The first type managee works for the salary, while the second type managee works out of passion. The manager can choose to give perks to all the managees of a single type, and has to decide which managee class to incentivize. The cost of the incentive is 5000 INR per managee. If a managee of ’Type I’ gets an incentive, then he/she contributes 7500 INR to company’s profit, while his/her contribution in case of not receiving an incentive is 2500 INR. The corresponding numbers for Type-II managees are 20000 INR and 10000 INR, respectively. Note that the type of a managee is not static, and incentives are doled out at the beginning of a six-month period. If a Type-I managee receives an incentive, then with probability (w.p) 0.75 he/she becomes a “Type-II” person at the end of a six-month period, and in case he/she does not receive an incentive, w.p. 0.75 he/she stays as Type-I. A Type-II person w.p. 0.8 stays Type-II on receiving an incentive, and w.p. 0.4 stays Type-II if not given an incentive.
(a) Formulate this as an infinite-horizon discounted MDP, with discount factor 0.9.
(b) Find an optimal policy using policy iteration starting with a greedy policy that picks anaction with the highest single-stage reward.
(c) Run value iteration, and Gauss- Seidel value iteration with δ = 0.1.
II. Simulation exercises
Question 1. Consider a problem of a taxi driver, who serves three cities A, B and C. The taxi driver can find a new ride by choosing one of the following actions.
1. Cruise the streets looking for a passenger.
2. Go to the nearest taxi stand and wait in line.
3. Wait for a call from the dispatcher (this is not possible in town B because of poor reception).
For a given town and a given action, there is a probability that the next trip will go to each of the towns A, B and C and a corresponding reward in monetary units associated with each such trip. This reward represents the income from the trip after all necessary expenses have been deducted. Please refer Table 1 below for the rewards and transition probabilities. In Table 1 below, pijk is the probability of getting a ride to town j, by choosing an action k while the driver was in town i and rijk is the immediate reward of getting a ride to town j, by choosing an action k while the driver was in town i.
Town Actions Probabilities Rewards
i k pijk rijk
j = A B C j = A B C
A B C A B C
A 1
2
3 1/2
1/16
1/4 1/4
3/4
1/8 1/4
3/16
5/8
10
8
4 4
2
6
A B C A B C
B 1
2 1/2
1/16 0
7/8 1/2
1/16 0
16
A B C A B C
C 1
2
3 1/4 1/8
3/4 1/4
3/4
1/16 1/2
1/8
3/16
2
4
0
Table 1: Taxi Problem: Probabilities and Rewards
Suppose 1 − γ is the probability that the taxi will breakdown before the next trip. The driver’s goal is to maximize the total reward untill his taxi breakdown.
1.1: Find an optimal policy using policy iteration(Algorithm 3) starting with a policy that will always cruise independent of the town, and a zero value vector. Let γ = 0.9.
1.2: Run policy iteration for discount factors γ ranging from 0 to 0.95 with intervals of 0.05 and display the results.
1.3: Find an optimal policy using modified policy iteration(Algorithm 4) starting with a policy that will always cruise independent of the town, and a zero value vector. Let γ = 0.9 and m = 5.
1.4: Find optimal values using value iteration(Algorithm 1) starting with a zero vector. Let γ = 0.9.
1.5: Find optimal values using Gauss-Seidel value iteration(Algorithm 2) starting with a zero vector. Let γ = 0.9.
1.a How is different values of γ affecting the policy iteration from 1.2? Explain your findings.
1.b For modified policy iteration from 1.3, do you find any improvement if you choose m = 10? Explain your findings.
1.c Compare and contrast the behavior of value iteration from 1.4 and Gauss-Seidel value iteration from 1.5.
Question 2. Consider a 10 × 10 gridworld (GridWorld-1) as shown in Figure 1.
Figure 1: GridWorld-1
• State space: Gridworld has 100 distinct states. There is a special absorbing state called “Goal". There are two wormholes labeled as “IN" in Grey and Brown, any action taken in those states will teleport you to state labeled “OUT" in Grey and Brown respectively. In case of Grey wormhole you can teleport to any one of the states labeled OUT with equal probability (i.e. 1/4). States labeled OUT is just a normal state.
• Actions: In each non-terminal state, you can take 4 actions A = {North, East, South, West}, which moves you one cell in the corresponding direction.
• Transition model: Gridworld is stochastic because the actions can be unreliable. In this model, an action X ∈ {North, East, South, West} moves you one cell in the X direction of your current position with probability 0.8, but with probabilities 0.2/3 it will transition
to one of the other neighbouring cells. For example, if the selected action is North, it will transition you one cell to the North of your current position with probability 0.8, one cell to the East, West, or South of your current position with probability 0.2/3.
Transitions that take you off the grid will not result in any change. There are no transitions available from the “Goal" state.
• Rewards: You will receive a reward of 0 for all transitions except the transition to the “Goal" state. Any transition to the "Goal" state gives you a reward of +10.
2.1: Find optimal values using value iteration (Algorithm 1)). Start with J0(s) = 0 and π0(s) = North, ∀s. Let the discount factor γ = 0.7.
2.2: Find optimal policy for each state using policy iteration (Algorithm 3). Start with J0(s) = 0 and π0(s) = North, ∀s. Let the discount factor γ = 0.7.
2.3: Implement TD(λ) (Algorithm 5). Choose the “Grey IN " state as the start state. Let the discount factor γ = 0.7, and step-size α = 0.5. Let the maximum number of episodes N = 1000. Consider the policy π as given in Algorithm 6.
2.4: Run TD(λ) for different values of λ ranging from 0 to 1 with intervals of 0.05 and display the results.
2.a Compare value iteration from 2.1 and policy iteration from 2.2 by plotting J(s) vs iterations (the outer loop iterations) for states “Brown IN", “Brown OUT", and “Grey IN". Which algorithm converges faster? Why?
2.b How is different values of λ affecting the TD(λ) from 2.4? Explain your findings.
Let ∀s ∈ S, Jv(s) be the optimal values (final values) obtained using value iteration from 2.1. Then the root mean squared (RMS) error of J with respect to Jv averaged over states can be calulated as
error (1)
2.c Plot graph of error(Ji) from policy iteration from 2.2 vs iterations i. Here Ji is the values obtained after ith iteration of the outer loop of the algorithm.
2.d Plot graph of error(Ji) from TD(λ) from 2.3 vs iterations i for λ ∈ {0,0.25,0.5,0.75,1}. Here Ji is the values obtained after ith iteration of the outer loop the algorithm.
The pseudocode for the Algorithms are given below. (courtesy: Sutton & Barto 1998)
Algorithm 1 Value Iteration
1: Initialize: J(s) = 0, ∀s ∈ S;
2: repeat
3: δ = 0;
4: for each s ∈ S do
5: H(s) = maxa∈A∑s0∈S Pss0(a)[r(s, a, s0) + γJ(s0)]
6: δ = max(δ, |J(s) − H(s)|);
7: end for
8: for each s ∈ S do
9: J(s) = H(s);
10: end for
11: until (δ< 1e−8)
Algorithm 2 Gauss-Seidel Value Iteration
1: Initialize: J(s) = 0, ∀s ∈ S;
2: repeat
3: δ = 0;
4: for each s ∈ S do
5: j = J(s);
6: J(s) = maxa∈A∑s0∈S Pss0(a)[r(s, a, s0) + γJ(s0)]
7: δ = max(δ, |j − J(s)|);
8: end for
9: until (δ< 1e−8)
Algorithm 3 Policy Iteration
1: Input: π0(s), ∀s ∈ S;
2: Initialize: J(s) = 0, π(s) = π0(s), ∀s ∈ S;
3: repeat
4: repeat
5: δ = 0;
6: for each s ∈ S do
7: j = J(s);
8: J(s) = ∑s0∈S Pss0(π(s))[r(s, π(s), s0) + γJ(s0)]
9: δ = max(δ, |j − J(s)|);
10: end for
11: until (δ< 1e−8) 12: done = 1;
13: for each s ∈ S do
14: b = π(s);
15: π(s) = argmaxa∈A∑s0∈S Pss0(a)[r(s, a, s0) + γJ(s0)];
16: if b 6= π(s) then
17: done = 0;
18: end if
19: end for
20: until done = 1
Algorithm 4 Modified Policy Iteration
1: Input: π0(s), ∀s ∈ S, m;
2: Initialize: J(s) = 0, π(s) = π0(s), ∀s ∈ S;
3: repeat
4: for k = 0, ··· , m do
5: for each s ∈ S do
6: J(s) = ∑s0∈S Pss0(π(s))[r(s, π(s), s0) + γJ(s0)]
7: end for
8: end for
9: done = 1;
10: for each s ∈ S do
11: b = π(s);
12: π(s) = argmaxa∈A∑s0∈S Pss0(a)[r(s, a, s0) + γJ(s0)];
13: if b 6= π(s) then
14: done = 0;
15: end if
16: end for
17: until done = 1
Algorithm 5 TD(λ)
1: Input: Start state s0, the policy π(s), ∀s ∈ S, maximum number of episodes N, step-size α, and discount factor γ;
2: Initialize: J(s) = e(s) = 0, ∀s ∈ S;
3: for i = 1, ··· , N do
4: s = s0;
5: repeat {each step of episode}
6: a = π(s);
7: Take action a, observe reward r, and next state s0.
8: δ = r + γJ(s0) − J(s);
9: e(s) = e(s) + 1;
10: for each s ∈ S do
11: J(s) = J(s) + αδe(s);
12: e(s) = γλe(s);
13: end for
14: s = s0;
15: until s is terminal
16: end for
Algorithm 6 policy for TD(λ)
1: Input: Current state position (sx, sy), Goal state position (gx, gy);
2: Output: The direction of the action d ∈ {(1,0), (−1,0), (0,1), (0, −1)};
3: δx = gx − sx; δy = gy − sy;
4: if |δx| ≥ |δy| then
5: d = (sign(δx),0);
6: else
7: d = (0, sign(δy));
8: end if
9: return d;