Starting from:

$25

40.319-Homework 4 Solved

Problem 1
 

Figure 1: Hidden Markov Model (HMM)

Consider the HMM described in Figure 1. let the states of Hi be {1,2,3}, and the states of Oi be {”the”,”cat”,”jumps”,”up”}.

(a)    Let T = 3. Write out the factorized form of P(O1 = ”the”,O2 = ”cat”,O3 = ”jumps”) in terms of the initial, transition and emission probabilities. Specify all summations and products clearly.

(b)    Show that if any initial parameter πi or transition parameter pij is initially set to 0, then it will remain 0 in all subsequent updates of the EM algorithm.

Problem 2
In this problem, we will study the multi-armed (k = 10) bandit problem. Begin by including the following code in your script:

import numpy as np import matplotlib . pyplot as plt

T = 1000

(a)    Define the greedy and UCB policy functions

             def         e  greedy (Q, N, t ,         e ):

. . .

                  def UCB(Q, N, t ,         c ):

. . .

which both take in Q, an array representing the action-values, N, an array representing the number of times different actions have been taken, t, the total number of actions taken thus far, and finally a policy-specific parameter.

(b)    Write a function that performs one run (1000 time steps), updates Q incrementally and records the reward received at each time step:

def testrun ( policy , param ): true means = np. random . normal (0 , 1 , 10) reward = np. zeros (T + 1)

Q = np. zeros (10)

. . .

                            r = np. random . normal( true  means [ a ] ,         1)

. . .

return reward

At each time step, when action a is taken, the reward r is sampled from a normal distribution with mean true means[a] and standard deviation 1.

(c)    Use the following function to average over 2000 runs and plot the results:

def main ():

 aveg = np. zeros (T+1) aveeg = np. zeros (T+1) aveucb = np. zeros (T+1)

                       for    i       in range (2000):

 g = test run ( e greedy , 0.0) eg = test run ( e greedy , # choose parameter ) ucb = test run (UCB, # choose parameter )

  aveg += (g − ave g ) / ( i + 1) aveeg += (eg − ave eg ) / ( i + 1) aveucb += (ucb − ave ucb ) / ( i + 1)

t = np. arange (T + 1) plt . plot (t , ave  g , ’b−’, t , ave  eg , ’r −’, t , ave  ucb , ’g−’) plt . show()

At approximately what value of  does the -greedy method switch from being better than the greedy policy to being worse than it?

Problem 3: Robot vs Snakes
In this problem, we will simulate a robot navigating a room trying to find the exit. The robot is bleeding and loses 1 hitpoint for every step it takes before it finds the exit. To make matters worse, there are 2 snakes in the centre of the room. If the robot steps onto a square with a snake, the snake will bite it and cause it to lose 15 hitpoints! As the robot is terrified of being bitten, it is very nervous and does not strictly respond to commands; if told to move in a certain direction, it will only do so half the time. One quarter of the time it will instead move to the left of the commanded direction, and another quarter of the time it will move to the right of the commanded direction. Can you help train the robot to find its way out?

Begin by installing OpenAI’s gym module (pip install gym), and include the following code in your script.

import numpy as np import sys from gym. envs . toy  text import discrete UP = 0

RIGHT = 1

DOWN = 2

LEFT = 3

GOAL = 4
# upper−right
corner
START = 20

SNAKE1 = 7

SNAKE2 = 17
# lower−l e f t
corner
eps = 0.25

class Robot vs snakes  world ( discrete . DiscreteEnv ):

 def i n i t ( s e l f ): s e l f . shape = [5 , 5]

                         nS = np. prod( s e l f . shape )                  # total       number    of     states

                      nA = 4                                       # total       number    of      actions      per     state

MAXY = s e l f . shape [0]

MAXX = s e l f . shape [1]

P = {}

grid = np. arange (nS ). reshape ( s e l f . shape ) it = np. nditer ( grid , flags =[ ’ multi  index ’ ])

while not it . finished : s = it . iterindex

y , x = it . multi  index P[ s ] = {a : [ ] for a in range(nA)}

is done = lambda s :

 if is done ( s ): reward = 0.0
s == GOAL
elif s == SNAKE1 or reward = −15.0
s == SNAKE2:
else :

reward = −1.0

if is  done ( s ):

                                             P[ s ] [UP] = [(1.0 ,         s ,       reward , True )]

                                              P[ s ] [RIGHT] = [(1.0 ,         s ,       reward , True )]

                                            P[ s ] [DOWN] = [(1.0 ,        s ,       reward , True )]

P[ s ] [LEFT] = [(1.0 , s , reward , True )] else :

 nsup = s if y == 0 else s − MAXX nsright = s if x == (MAXX − 1) else s + 1

nsdown = s if y == (MAXY − 1) else s + MAXX

nsleft = s if x == 0 else s − 1

                                                   P[ s ] [UP] = [(1 − (2 ∗ eps ) , ns  up ,                reward ,      is  done ( ns  up )) ,

                                                             (eps ,        nsright ,        reward ,          is done ( ns right )) ,

                                                             (eps ,        nsleft ,        reward ,          is done ( ns left ))]

                                                     P[ s ] [RIGHT] = [(1 − (2 ∗ eps ) ,                ns right ,        reward ,           is  done ( ns  right )) ,

                                                             (eps ,      nsup ,      reward ,        is done ( ns up )) ,

                                                                (eps , nsdown ,         reward ,         is done (ns down ))]

                                                   P[ s ] [DOWN] = [(1 − (2 ∗ eps ) , ns down ,               reward ,       is  done (ns  down )) ,

                                                             (eps ,        nsright ,        reward ,          is done ( ns right )) ,

                                                             (eps ,        nsleft ,        reward ,          is done ( ns left ))]

                                                  P[ s ] [LEFT] = [(1 − (2 ∗ eps ) ,                ns left ,        reward ,        is  done ( ns  left )) ,

                                                             (eps ,      nsup ,      reward ,        is done ( ns up )) ,

                                                                (eps , nsdown ,         reward ,         is done (ns down ))]

it . iternext ()

  isd = np. zeros (nS) isd [START] = 1.0 s e l f .P = P super( Robot vs snakes  world , s e l f ). i n i t (nS , nA, P, isd )

           def       render ( s e l f ):

grid = np. arange ( s e l f .nS ). reshape ( s e l f . shape ) it = np. nditer ( grid , flags =[ ’ multi  index ’ ])

while not it . finished : s = it . iterindex

y , x = it . multi  index

                               if       s e l f . s == s :

output = ” R ”

elif s == GOAL:

output = ” G ”

elif s == SNAKE1 or s == SNAKE2:

 output = ” S ” else :

output = ” o ”

if x == 0:

output = output . lstrip ()

if x == s e l f . shape [1] − 1:

output = output . rstrip () sys . stdout . write ( output )

if x == s e l f . shape [1] − 1: sys . stdout . write (”\n”) it . iternext () sys . stdout . write (”\n”)

You can write

 env = Robot vs snakes  world ()

to start the environment, which is a 5 x 5 grid. Locations on the grid are numbered starting from 0 on the upper-left, and increasing in a row-wise manner; i.e. the upper-right is numbered 4 (exit/goal state), the lower-left (starting location of robot) is numbered 20, the lower-right corner is numbered

24, and so on. You can access the robot’s location at any time using env . s ,

and issue a command to move using env . step (DIR) ,

where DIR is UP, DOWN, LEFT or RIGHT. Finally,

env .p[ state ] [ action ]

returns a list of tuples, each corresponding to a possible next state j and of the form (probability of transitioning to state j, j, reward, goal state?).

(a)    Begin by writing a function which performs value-iteration (note that there is no discountingof the rewards, i.e. the discount factor is 1):

def value  iteration (env ): policy = np. zeros ([ env .nS , env .nA])

V = np. zeros (env .nS)

. . .

. . .

return policy , V

This function is to return the value function, as well as the greedy policy function. Print out both the policy and the value function.

(b)    Next, use the policy function obtained above to navigate the robot through the room. At eachtime step, use env .  render ()

to print out the layout of the room with the robot’s current location. Terminate the program when the robot reaches the exit.

(c)    Does the robot try to avoid the snakes? If so, refer to the policy function obtained in (a) toexplain in what way it does so.

More products