$25
Problem 1
Figure 1: Hidden Markov Model (HMM)
Consider the HMM described in Figure 1. let the states of Hi be {1,2,3}, and the states of Oi be {”the”,”cat”,”jumps”,”up”}.
(a) Let T = 3. Write out the factorized form of P(O1 = ”the”,O2 = ”cat”,O3 = ”jumps”) in terms of the initial, transition and emission probabilities. Specify all summations and products clearly.
(b) Show that if any initial parameter πi or transition parameter pij is initially set to 0, then it will remain 0 in all subsequent updates of the EM algorithm.
Problem 2
In this problem, we will study the multi-armed (k = 10) bandit problem. Begin by including the following code in your script:
import numpy as np import matplotlib . pyplot as plt
T = 1000
(a) Define the greedy and UCB policy functions
def e greedy (Q, N, t , e ):
. . .
def UCB(Q, N, t , c ):
. . .
which both take in Q, an array representing the action-values, N, an array representing the number of times different actions have been taken, t, the total number of actions taken thus far, and finally a policy-specific parameter.
(b) Write a function that performs one run (1000 time steps), updates Q incrementally and records the reward received at each time step:
def testrun ( policy , param ): true means = np. random . normal (0 , 1 , 10) reward = np. zeros (T + 1)
Q = np. zeros (10)
. . .
r = np. random . normal( true means [ a ] , 1)
. . .
return reward
At each time step, when action a is taken, the reward r is sampled from a normal distribution with mean true means[a] and standard deviation 1.
(c) Use the following function to average over 2000 runs and plot the results:
def main ():
aveg = np. zeros (T+1) aveeg = np. zeros (T+1) aveucb = np. zeros (T+1)
for i in range (2000):
g = test run ( e greedy , 0.0) eg = test run ( e greedy , # choose parameter ) ucb = test run (UCB, # choose parameter )
aveg += (g − ave g ) / ( i + 1) aveeg += (eg − ave eg ) / ( i + 1) aveucb += (ucb − ave ucb ) / ( i + 1)
t = np. arange (T + 1) plt . plot (t , ave g , ’b−’, t , ave eg , ’r −’, t , ave ucb , ’g−’) plt . show()
At approximately what value of does the -greedy method switch from being better than the greedy policy to being worse than it?
Problem 3: Robot vs Snakes
In this problem, we will simulate a robot navigating a room trying to find the exit. The robot is bleeding and loses 1 hitpoint for every step it takes before it finds the exit. To make matters worse, there are 2 snakes in the centre of the room. If the robot steps onto a square with a snake, the snake will bite it and cause it to lose 15 hitpoints! As the robot is terrified of being bitten, it is very nervous and does not strictly respond to commands; if told to move in a certain direction, it will only do so half the time. One quarter of the time it will instead move to the left of the commanded direction, and another quarter of the time it will move to the right of the commanded direction. Can you help train the robot to find its way out?
Begin by installing OpenAI’s gym module (pip install gym), and include the following code in your script.
import numpy as np import sys from gym. envs . toy text import discrete UP = 0
RIGHT = 1
DOWN = 2
LEFT = 3
GOAL = 4
# upper−right
corner
START = 20
SNAKE1 = 7
SNAKE2 = 17
# lower−l e f t
corner
eps = 0.25
class Robot vs snakes world ( discrete . DiscreteEnv ):
def i n i t ( s e l f ): s e l f . shape = [5 , 5]
nS = np. prod( s e l f . shape ) # total number of states
nA = 4 # total number of actions per state
MAXY = s e l f . shape [0]
MAXX = s e l f . shape [1]
P = {}
grid = np. arange (nS ). reshape ( s e l f . shape ) it = np. nditer ( grid , flags =[ ’ multi index ’ ])
while not it . finished : s = it . iterindex
y , x = it . multi index P[ s ] = {a : [ ] for a in range(nA)}
is done = lambda s :
if is done ( s ): reward = 0.0
s == GOAL
elif s == SNAKE1 or reward = −15.0
s == SNAKE2:
else :
reward = −1.0
if is done ( s ):
P[ s ] [UP] = [(1.0 , s , reward , True )]
P[ s ] [RIGHT] = [(1.0 , s , reward , True )]
P[ s ] [DOWN] = [(1.0 , s , reward , True )]
P[ s ] [LEFT] = [(1.0 , s , reward , True )] else :
nsup = s if y == 0 else s − MAXX nsright = s if x == (MAXX − 1) else s + 1
nsdown = s if y == (MAXY − 1) else s + MAXX
nsleft = s if x == 0 else s − 1
P[ s ] [UP] = [(1 − (2 ∗ eps ) , ns up , reward , is done ( ns up )) ,
(eps , nsright , reward , is done ( ns right )) ,
(eps , nsleft , reward , is done ( ns left ))]
P[ s ] [RIGHT] = [(1 − (2 ∗ eps ) , ns right , reward , is done ( ns right )) ,
(eps , nsup , reward , is done ( ns up )) ,
(eps , nsdown , reward , is done (ns down ))]
P[ s ] [DOWN] = [(1 − (2 ∗ eps ) , ns down , reward , is done (ns down )) ,
(eps , nsright , reward , is done ( ns right )) ,
(eps , nsleft , reward , is done ( ns left ))]
P[ s ] [LEFT] = [(1 − (2 ∗ eps ) , ns left , reward , is done ( ns left )) ,
(eps , nsup , reward , is done ( ns up )) ,
(eps , nsdown , reward , is done (ns down ))]
it . iternext ()
isd = np. zeros (nS) isd [START] = 1.0 s e l f .P = P super( Robot vs snakes world , s e l f ). i n i t (nS , nA, P, isd )
def render ( s e l f ):
grid = np. arange ( s e l f .nS ). reshape ( s e l f . shape ) it = np. nditer ( grid , flags =[ ’ multi index ’ ])
while not it . finished : s = it . iterindex
y , x = it . multi index
if s e l f . s == s :
output = ” R ”
elif s == GOAL:
output = ” G ”
elif s == SNAKE1 or s == SNAKE2:
output = ” S ” else :
output = ” o ”
if x == 0:
output = output . lstrip ()
if x == s e l f . shape [1] − 1:
output = output . rstrip () sys . stdout . write ( output )
if x == s e l f . shape [1] − 1: sys . stdout . write (”\n”) it . iternext () sys . stdout . write (”\n”)
You can write
env = Robot vs snakes world ()
to start the environment, which is a 5 x 5 grid. Locations on the grid are numbered starting from 0 on the upper-left, and increasing in a row-wise manner; i.e. the upper-right is numbered 4 (exit/goal state), the lower-left (starting location of robot) is numbered 20, the lower-right corner is numbered
24, and so on. You can access the robot’s location at any time using env . s ,
and issue a command to move using env . step (DIR) ,
where DIR is UP, DOWN, LEFT or RIGHT. Finally,
env .p[ state ] [ action ]
returns a list of tuples, each corresponding to a possible next state j and of the form (probability of transitioning to state j, j, reward, goal state?).
(a) Begin by writing a function which performs value-iteration (note that there is no discountingof the rewards, i.e. the discount factor is 1):
def value iteration (env ): policy = np. zeros ([ env .nS , env .nA])
V = np. zeros (env .nS)
. . .
. . .
return policy , V
This function is to return the value function, as well as the greedy policy function. Print out both the policy and the value function.
(b) Next, use the policy function obtained above to navigate the robot through the room. At eachtime step, use env . render ()
to print out the layout of the room with the robot’s current location. Terminate the program when the robot reaches the exit.
(c) Does the robot try to avoid the snakes? If so, refer to the policy function obtained in (a) toexplain in what way it does so.