$30
Reinforcement Learning (RL) is the task of learning from interaction to achieve a goal. The learner and the decision maker is called the agent. The thing it interacts with, comprising everything outside the agent, is called the environment. These interact continually, the agent selecting actions and the environment responding to those actions by presenting rewards and new states.
In the first part of the project, we will learn the optimal policy of an agent navigating in a 2-D environment. We will implement the Value iteration algorithm to learn the optimal policy.
Inverse Reinforcement Learning (IRL) is the task of extracting an expert’s reward function by observing the optimal policy of the expert. In the second part of the project, we will explore the application of IRL in the context of apprenticeship learning.
1 Reinforcement learning (RL)
The two main objects in Reinforcement learning are:
• Agent
• Environment
In this project, we will learn the optimal policy of a single agent navigating in a 2-D environment.
1.1 Environment
In this project, we assume that the environment of the agent is modeled by a Markov Decision Process (MDP). In a MDP, agents occupy a state of the environment and perform actions to change the state they are in. After taking an action, they are given some representation of the new state and some reward value associated with the new state.
An MDP formally is a tuple ( ) where:
• S is a set of states
• A is a set of actions
is a set of transition probabilities is the probability of transitioning from state
s ∈ S to state s0 ∈ S after taking action a ∈ A
–
• Given any current state and action, s and a, together with any next state, s0, the expected value of the next reward is Rass0
–
• γ ∈ [0,1) is the discount factor, and it is used to compute the present value of future reward
– If γ is close to 1 then the future rewards are discounted less
– If γ is close to 0 then the future rewards are discounted more
In the next few subsections, we will discuss the parameters that will be used to generate the environment for the project.
1.1.1 State space
In this project, we consider the state space to be a 2-D square grid with 100 states. The 2-D square grid along with the numbering of the states is shown in figure 1
Figure 1: 2-D square grid with state numbering
1.1.2 Action set
In this project, we consider the action set(A) to contain the 4 following actions:
• Move Right
• Move Left
• Move Up
• Move Down
The 4 types of actions are displayed in figure 2
Figure 2: 4 types of action
From the above figure, we can see that the agent can take 4 actions from the state marked with a dot.
1.1.3 Transition probabilities
In this project, we define the transition probabilities in the following manner: 1. If state s0 and s are not neighboring states in the 2-D grid, then
P(st+1 = s0|st = s,at = a) = 0
s0 and s are neighbors in the 2-D grid if you can move to s0 from s by taking an action a from the action set A. We will consider a state s to be a neighbor of itself. For example, from figure 1 we can observe that states 1 and 11 are neighbors (we can transition from 1 to 11 by moving right) but states 1 and 12 are not neighbors.
2. Each action corresponds to a movement in the intended direction with probability 1 − w, but has a probability of w of moving in a random direction instead due to wind. To illustrate this, let’s consider the states shown in figure 3
Figure 3: Inner grid states (Non-boundary states)
The transition probabilities for the non-boundary states shown in figure 3 are given below:
From the above calculation it can be observed that if the agent is at a non-boundary state then it has 4 neighbors excluding itself and the probability w is uniformly distributed over the 4 neighbors. Also, if the agent is at a non-boundary state then it transitions to a new state after taking an action (P(st+1 = 44|st = 44,at =↑) = 0)
3. If the agent is at one of the four corner states (0,9,90,99), the agent stays at the current state if it takes an action to move off the grid or is blown off the grid by wind. The actions can be divided into two categories:
• Action to move off the grid
• Action to stay in the grid
To illustrate this, let’s consider the states shown in figure 4
Figure 4: Corner states
The transition probabilities for taking an action to move off the grid are given below:
The transition probabilities for taking an action to stay in the grid are given below:
At a corner state, you can be blown off the grid in two directions. As a result, we have P(st+1 = since we can be blown off the grid in two directions and in both the
cases we stay at the current state.
4. If the agent is at one of the edge states, the agent stays at the current state if it takes an action to move off the grid or is blown off the grid by wind. The actions can be divided into two categories:
• Action to move off the grid
• Action to stay in the grid
To illustrate this, let’s consider the states shown in figure 5
Figure 5: Edge states
The transition probabilities for taking an action to move off the grid are given below:
The transition probabilities for taking an action to stay in the grid are given below:
At an edge state, you can be blown off the grid in one direction. As a result, we have P(st+1 = since we can be blown off the grid in one direction and in that case we stay
at the current state.
The main difference between a corner state and an edge state is that a corner state has 2 neighbors and an edge state has 3 neighbors.
1.1.4 Reward function
To simplify the project, we will assume that the reward function is independent of the current state (s) and the action that you take at the current state (a). To be specific, reward function only depends on the state that you transition to (s0). With this simplification, we have
In this project, we will learn the optimal policy of an agent for two different reward functions:
• Reward function 1
• Reward function 2
The two different reward functions are displayed in figures 6 and 7 respectively
Figure 6: Reward function 1
Figure 7: Reward function 2
Question 1: For visualization purpose, generate heat maps of Reward function 1 and Reward function 2. For the heat maps, make sure you display the coloring scale. You will have 2 plots for this question
For solving question 1, you might find the following function useful:
2 Optimal policy learning using RL algorithms
In this part of the project, we will use reinforcement learning (RL) algorithm to find the optimal policy. The main steps in RL algorithm are:
• Find optimal state-value or action-value
• Use the optimal state-value or action-value to determine the deterministic optimal policy
There are a couple of RL algorithms, but we will use the Value iteration algorithm since it was discussed in detail in the lecture. We will skip the derivation of the algorithm here because it was covered in the lecture (for the derivation details please refer to the lecture slides on Reinforcement learning). We will just reproduce the algorithm below for the ease of implementation:
1: procedure Value Iteration
2: for all s ∈ S do
. Initialization
3:
V (s) ← 0
4:
end for
5:
∆ ← ∞
6:
while do
. Estimation
7:
∆ ← 0
8:
for all s ∈ S do
9:
v ← V (s);
10:
V (s) ← max X Pssa 0[Rass0 + γV (s0)]; a∈A
s0∈S
11:
∆ ← max(∆,|v − V (s)|);
12:
end for
13:
end while
14:
for all s ∈ S do
. Computation
15:
)];
s0∈S
16: end for
17: end procedure return π
Question 2: Create the environment of the agent using the information provided in section 2. To be specific, create the MDP by setting up the state-space, action set, transition probabilities, discount factor, and reward function. For creating the environment, use the following set of parameters:
• Number of states = 100 (state space is a 10 by 10 square grid as displayed in figure 1)
• Number of actions = 4 (set of possible actions is displayed in figure 2)
• w = 0.1
• Discount factor = 0.8
• Reward function 1
After you have created the environment, then write an optimal state-value function that takes as input the environment of the agent and outputs the optimal value of each state in the grid. For the optimal state-value function, you have to implement the Initialization (lines 2-4) and Estimation (lines 5-13) steps of the Value Iteration algorithm. For the estimation step, use 01. For visualization purpose, you should generate a figure similar to that of figure 1 but with the number of state replaced by the optimal value of that state. In this part of question, you should have 1 plot.
Let’s assume that your value iteration algorithm converges in N steps. Plot snapshots of state values in 5 different steps linearly distributed from 1 to N. Report N and your step numbers. What observations do you have from the plots?
Question 3: Generate a heat map of the optimal state values across the 2-D grid. For generating the heat map, you can use the same function provided in the hint earlier (see the hint after question 1).
Question 4: (15 points) Explain the distribution of the optimal state values across the 2-D grid. (Hint: Use the figure generated in question 3 to explain)
Question 5: (Implement the computation step of the value iteration algorithm (lines 14-17) to compute the optimal policy of the agent navigating the 2-D state-space. For visualization purpose, you should generate a figure similar to that of figure 1 but with the number of state replaced by the optimal action at that state. The optimal actions should be displayed using arrows. Does the optimal policy of the agent match your intuition? Please provide a brief explanation. Is it possible for the agent to compute the optimal action to take at each state by observing the optimal values of it’s neighboring states? In this question, you should have 1 plot.
Question 6: Modify the environment of the agent by replacing Reward function 1 with Reward function 2. Use the optimal state-value function implemented in question 2 to compute the optimal value of each state in the grid. For visualization purpose, you should generate a figure similar to that of figure 1 but with the number of state replaced by the optimal value of that state. In this question, you should have 1 plot.
Question 7: Generate a heat map of the optimal state values (found in question 6) across the 2-D grid. For generating the heat map, you can use the same function provided in the hint earlier. Explain the distribution of the optimal state values across the 2-D grid. (Hint: Use the figure generated in this question to explain)
Question 8: Implement the computation step of the value iteration algorithm (lines 14-17) to compute the optimal policy of the agent navigating the 2-D state-space. For visualization purpose, you should generate a figure similar to that of figure 1 but with the number of state replaced by the optimal action at that state. The optimal actions should be displayed using arrows. Does the optimal policy of the agent match your intuition? Please provide a brief explanation. In this question, you should have 1 plot.
Question 9Change the hyper parameter w to 0.6 and find the optimal policy map similar to previous question for reward functions. Explain the differences you observe. What do you think about value of new w compared to previous value? Choose the w that you think give rise to better optimal policy and use that w for the next stages of the project.
3 Inverse Reinforcement learning (IRL)
Inverse Reinforcement learning (IRL) is the task of learning an expert’s reward function by observing the optimal behavior of the expert. The motivation for IRL comes from apprenticeship learning. In apprenticeship learning, the goal of the agent is to learn a policy by observing the behavior of an expert. This task can be accomplished in two ways:
1. Learn the policy directly from expert behavior
2. Learn the expert’s reward function and use it to generate the optimal policy
The second way is preferred because the reward function provides a much more parsimonious description of behavior. Reward function, rather than the policy, is the most succinct, robust, and transferable definition of the task. Therefore, extracting the reward function of an expert would help design more robust agents.
In this part of the project, we will use IRL algorithm to extract the reward function. We will use the optimal policy computed in the previous section as the expert behavior and use the algorithm to extract the reward function of the expert. Then, we will use the extracted reward function to compute the optimal policy of the agent. We will compare the optimal policy of the agent to the optimal policy of the expert and use some similarity metric between the two to measure the performance of the IRL algorithm.
3.1 IRL algorithm
For finite state spaces, there are a couple of IRL algorithms for extracting the reward function:
• Linear Programming (LP) formulation
• Maximum Entropy formulation
Since we covered LP formulation in the lecture and it is the simplest IRL algorithm, so we will use the LP formulation in this project. We will skip the derivation of the algorithm (for details on the derivation please refer to the lecture slides) here. The LP formulation of the IRL is given by equation 1
maximize
R,ti,ui
subject to [(Pa1(i) − Pa(i))(I − γPa1)−1R] ≥ ti, ∀a ∈ A \ a1,∀i
(1)
Ri max = 1,2,··· ,
In the LP given by equation 1, R is the reward vector (R(i) = R(si)), Pa is the transition probability matrix corresponding to the action a, λ is the adjustable penalty coefficient, and ti’s and ui’s are the extra optimization variables (please note that u(i) = ui). Use the maximum absolute value of the ground truth reward as Rmax. For the ease of implementation, we can recast the LP in equation 1 into an equivalent form given by equation 2 using block matrices.
maximize cTx
x(2)
subject to Dx
Question 10: Express c, x, D, b in terms of R,Pa,Pa1,ti,u,λ and Rmax
3.2 Performance measure
In this project, we use a very simple measure to evaluate the performance of the IRL algorithm. Before we state the performance measure, let’s introduce some notation:
• OA(s): Optimal action of the agent at state s
• OE(s): Optimal action of the expert at state s •
,OA(s) = OE(s)
,else
Then with the above notation, accuracy is given by equation 3
(3)
Since we are using the optimal policy found in the previous section as the expert behavior, so we will use the optimal policy found in the previous section to fill the OE(s) values. Please note that these values will be different depending on whether we used Reward Function 1 or Reward Function 2 to create the environment.
To compute OA(s), we will solve the linear program given by equation 2 to extract the reward function of the expert. For solving the linear program you can use the LP solver in python (from cvxopt import solvers and then use solvers.lp). Then, we will use the extracted reward function to compute the optimal policy of the agent using the value iteration algorithm you implemented in the previous section. The optimal policy of the agent found in this manner will be used to fill the OA(s) values. Please note that these values will depend on the adjustable penalty coefficient λ. We will tune λ to maximize the accuracy.
Question 11: Sweep λ from 0 to 5 to get 500 evenly spaced values for λ. For each value of λ compute OA(s) by following the process described above. For this problem, use the optimal policy of the agent found in question 5 to fill in the OE(s) values. Then use equation 3 to compute the accuracy of the IRL algorithm for this value of λ. You need to repeat the above process for all 500 values of λ to get 500 data points. Plot λ (x-axis) against Accuracy (y-axis). In this question, you should have 1 plot.
Question 12: Use the plot in question 11 to compute the value of λ for which accuracy is
(1) (1) maximum. For future reference we will denote this value as λmax. Please report λmax
(1)
Question 13: For λmax, generate heat maps of the ground truth reward and the extracted reward. Please note that the ground truth reward is the Reward function 1 and the extracted reward is computed by solving the linear program given by equation 2 with the λ parameter set to . In this question, you should have 2 plots.
Question 14: Use the extracted reward function computed in question 13, to compute the optimal values of the states in the 2-D grid. For computing the optimal values you need to use the optimal state-value function that you wrote in question 2. For visualization purpose, generate a heat map of the optimal state values across the 2-D grid (similar to the figure generated in question 3). In this question, you should have 1 plot.
Question 15: Compare the heat maps of Question 3 and Question 14 and provide a brief explanation on their similarities and differences.
Question 16: Use the extracted reward function found in question 13 to compute the optimal policy of the agent. For computing the optimal policy of the agent you need to use the function that you wrote in question 5. For visualization purpose, you should generate a figure similar to that of figure 1 but with the number of state replaced by the optimal action at that state. The actions should be displayed using arrows. In this question, you should have 1 plot.
Question 17: Compare the figures of Question 5 and Question 16 and provide a brief explanation on their similarities and differences.
Question 18: Sweep λ from 0 to 5 to get 500 evenly spaced values for λ. For each value of λ compute OA(s) by following the process described above. For this problem, use the optimal policy of the agent found in question 9 to fill in the OE(s) values. Then use equation 3 to compute the accuracy of the IRL algorithm for this value of λ. You need to repeat the above process for all 500 values of λ to get 500 data points. Plot λ (x-axis) against Accuracy (y-axis). In this question, you should have 1 plot.
Question 19: Use the plot in question 18 to compute the value of λ for which accuracy is
(2) (2) maximum. For future reference we will denote this value as λmax. Please report λmax
(2)
Question 20: For λmax, generate heat maps of the ground truth reward and the extracted reward. Please note that the ground truth reward is the Reward function 2 and the extracted reward is computed by solving the linear program given by equation 2 with the λ parameter set to . In this question, you should have 2 plots.
Question 21: Use the extracted reward function computed in question 20, to compute the optimal values of the states in the 2-D grid. For computing the optimal values you need to use the optimal state-value function that you wrote in question 2. For visualization purpose, generate a heat map of the optimal state values across the 2-D grid (similar to the figure generated in question 7). In this question, you should have 1 plot.
Question 22: Compare the heat maps of Question 7 and Question 21 and provide a brief explanation on their similarities and differences.
Question 23: Use the extracted reward function found in question 20 to compute the optimal policy of the agent. For computing the optimal policy of the agent you need to use the function that you wrote in question 9. For visualization purpose, you should generate a figure similar to that of figure 1 but with the number of state replaced by the optimal action at that state. The actions should be displayed using arrows. In this question, you should have 1 plot.
Question 24: Compare the figures of Question 9 and Question 23 and provide a brief explanation on their similarities and differences.
Question 25: From the figure in question 23, you should observe that the optimal policy of the agent has two major discrepancies. Please identify and provide the causes for these two discrepancies. One of the discrepancy can be fixed easily by a slight modification to the value iteration algorithm. Perform this modification and re-run the modified value iteration algorithm to compute the optimal policy of the agent. Also, recompute the maximum accuracy after this modification. Is there a change in maximum accuracy? The second discrepancy is harder to fix and is a limitation of the simple IRL algorithm.