CZ3005-Lab 2 Reinforcement Learning Solved

Your shopping cart is empty.

In this project, you need to implement one reinforcement learning algorithm (e.g., value iteration, policy iteration, Q-learning) for one grid-world-based environment: Treasure Hunting.

(a) 3D grid world. Smile faces represent terminal states which (b) The illustration of transition, e.g., the ingive reward 1. tended action is RIGHT

Figure 1: Illustration of treasure hunting in a cube

1 Treasure Hunting in a Cube
The environment is a 3D grid world. The MDP formulation is described as follows:

• State: a 3D coordinate, which indicates the current position where the agent is. The initial state is (0, 0, 0) and there is only one terminal state: (3,3,3).

• Action: The action space is (forward, backward, left, right, up, down). The agent needs to select one of them to navigate in the environment.

• Reward: The agent will receive 1 reward when it arrives at the terminal states, or otherwise receive -0.1 reward.

• Transition: The intended movement happens with probability 0.6. With probability 0.1, the agent ends up in one of the states perpendicular to the intended direction. If a collision with a wall happens, the agent stays in the same state.

2 Code Example
We provide the environment code environment.py and examples code test.py. In environment.py, we provide the code: TreasureCube.

In test.py, we provide a random agent. You can modify it to implement your agent. You should install a numpy package additionally to run the code.

from collections import defaultdict import argparse import random import numpy as np from environment import TreasureCube

# you need to implement your agent based on one RL algorithm class RandomAgent(object):

def __init__(self):

self.action_space = [’left’,’right’,’forward’,’backward’,’up’,’down’] # in

TreasureCube self.Q = defaultdict(lambda: np.zeros(len(self.action_space)))

def take_action(self, state):

action = random.choice(self.action_space) return action

# implement your train/update function to update self.V or self.Q

# you should pass arguments to the train function def train(self, state, action, next_state, reward):

pass
1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

Besides, in test.py, we implement a test function. You should replace the random agent with your agent in line 3.

def
test_corridor(max_episode, max_step):

env = TreasureCorridor(max_step=max_step) agent = RandomAgent()

for epsisode_num in range(0, max_episode):

state = env.reset() terminate = False

t = 0

episode_reward = 0 while not terminate:

action = agent.take_action(state)

reward, terminate, next_state = env.step(action) episode_reward += reward

# env.render()

# print(f’step: {t}, action: {action}, reward: {reward}’) t += 1

agent.train(state, action, next_state, reward) state = next_state

print(f’epsisode: {epsisode_num}, total_steps: {t} episode reward: {

episode_reward}’)
1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

If you use Q-learning, you can use the parameters: discount factor γ = 0.99, learning rate α = 0.5, exploration rate

You can run the following code to generate output and test your agent.

Shopping cart

US$0

CZ3005-Lab 2 Reinforcement Learning Solved

More products