Starting from:

$35

SYSU2022-Homework 1 Solved

Task 1 Frozen Lake MDP                                                                                                            
Now you will implement value iteration and policy iteration for the Frozen Lake environment from OpenAI Gym . We have provided custom versions of this environment.

1.  (coding) Read through vi_and_pi.py and implement policy_evaluation , policy_improvement and policy_iteration . The stopping tolerance (defined as  ) is tol = . Use . Return the optimal value function and the optimal policy.

2. (coding) Implement value_iteration in vi_and_pi.py . The stopping tolerance is tol =                     . Use              . Return the optimal value function and the optimal policy.

Task 2 Test Environment                                                                                                           
It is crucial to test our code on a test environment. In this problem, you will reason about optimality in the provided test environment by hand; later, to sanity-check your code, you will verify that your implementation is able to achieve this optimality. You should be able to run your models on CPU in no more than a few minutes on the following environment:  

4  states: 0, 1, 2, 3

5   actions: 0, 1, 2, 3, 4. Action   goes to state  , while action 4 makes the agent stay in the same state.

             Rewards: Going to state  from states 0, 1 and 3 gives a reward                     , where

 If we start in state 2, then the rewards defined above are multiplied

by -10. See Table 1 for the full transition and reward structure.

One episode lasts 5 time steps (for a total of 5 actions) and always starts in state 0 (no rewards at the initial state).  

 An example of a trajectory (or episode) in the test environment is shown in Figure 1, and the trajectory can be represented in terms

 

.  

State ( )
Action ( )
Next State ( 
)
Reward (R)
0
0
0
 
0.1
0
1
1
 
-0.3
0
2
2
 
0.0
0
3
3
 
-0.2
0
4
0
 
0.1
1
0
0
 
0.1
1
1
1
 
-0.3
1
2
2
 
0.0
1
3
3
 
-0.2
1
4
1
 
-0.3
2
0
0
 
-1.0
2
1
1
 
3.0
2
2
2
 
0.0
2
3
3
 
2.0
2
4
2
 
0.0
3
0
0
 
0.1
3
1
1
 
-0.3
3
2
2
 
0.0
3
3
3
 
-0.2
3
3
3
 
-0.2
Table 1: Transition table for the Test Environment

 

Figure 1: Example of a trajectory in the Test Environment

1. (written) What is the maximum sum of rewards that can be achieved in a single trajectory in the test environment, assuming   = 1? Show first that this value is attainable in a single trajectory, and then briefly argue why no other trajectory can achieve greater cumulative reward.  

Task 3 Tabular Q-Learning                                                                                                        
If the state and action spaces are sufficiently small, we can simply maintain a table containing the value of   , an estimate of

                , for every                      pair. In this tabular setting, given an experience sample

 is the learning rate,

chosen uniformly at random from

Task 4 Maze Example                                                                                                                
 You will implement Sarsa and Q-learning for the Maze environment from OpenAI Gym . We have provided custom versions of this environment. In the scenario, a red rectangle (agent) are initialized at the maze made up of        grids and can only observe its location. At each timestep, agent can move to one of four neighboring grids. The reward is +1 if agent is located at the yellow grid,  -1 if agent reaches the black grid, otherwise 0.

1. (coding) Implement Sarsa  in RL_sarsa.py .

2.  (coding) Implement Q_learning  in RL_q_learning.py .

More products