Starting from:

$30

ML-Homework 1 Introduction to RL using Open AI Gym Solved

In this lab, we are required to work with Reinforcement Learning, a newer Machine

10Learning technique, that can train an agent in an environment.  Thea gent will navigate the

11classic 4x4 grid-world environment to a specific goal. The agent will learn an optimal 12policy through Q-Learning which will allow it to take actions to reach the goal while 13avoiding the boundaries. We use a platform here called AI gym to facilitate the whole 14process of the construction of the agent and the environment.

15

16

17

181              Introduction
191.1                    What is Reinforcment Learning?

20                    Reinforcement Learning(RL) is one of the hottest research topics in the field of modern

21                    Artificial Intelligence and its popularity is only growing.  Reinforcement Learning(RL) is 22         a type of machine learning technique that enables an agent to learn in an interactive 23 environment by trial and error using feedback from its own actions and experiences.

24                    Though both supervised and reinforcement learning use mapping between input and

25                    output, unlike supervised learning where the feedback provided to the agent is correct set 26             of actions for performing a task, reinforcement learning uses rewards and punishments as 27      signals for positive and negative behavior.

28                    As compared to unsupervised learning, reinforcement learning is different in terms of

29                    goals. While the goal in unsupervised learning is to find similarities and differences

30                    between data points, in the case of reinforcement learning the goal is to find a suitable 31    action model that would maximize the total cumulative reward of the agent. The figure 32         below illustrates the action-reward feedback loop of a generic RL model.

33

34

35

36

37

38

39

40

41

42

43 44 45 46 47 48

49                 2.      Open AI Gym
50                              Gym is a toolkit for developing and comparing reinforcement learning algorithms. It

51                              makes no assumptions about the structure of your agent, and is compatible with any

52                              numerical computation library, such as TensorFlow or Theano.The gym library is a

53                              collection of test problems — environments — that you can use to work out your 54  reinforcement learning algorithms. These environments have a shared interface, 55  allowing you to write general algorithms.

56                 3.      The Envrionment
57                              The Environment is a 4 by 4 grid environment described by two

58                              things the grid environment and the agent. An observation space

59                              which is defined as a vector of elements. This can be particularly

60                              useful for environments which return measurements, such as in 61       robotic environments.

env
env
62         The core gym interface is , which is the unified environment interface. The 63    following are the methods that would be quite helpful to us:

env.reset
64: Resets the environment and returns a random initial state.

env.step(action)
65: Step the environment by one timestep.

66                              Returns observation: Observations of the environment

67                              reward: If your action was beneficial or not

68                              done: Indicates if we have successfully picked up and dropped off a passenger, also

69                              called one episode

70                              info: Additional info such as performance and latency for debugging purposes

env.render
71: Renders one frame of the environment (helpful in visualizing the

72                              environment)

73                              4.             We have an Action Space of size 4

74                              0 = down

75                              1 = up

76                              2 = right

77                              3 = left

78                 5.               The MULti-Bandit Problem (E-Greedy Algorithm)
79                              The multi-armed bandit problem is a classic reinforcement learning example where

80                              we are given a slot machine with n arms (bandits) with each arm having its own

81                              rigged probability distribution of success. Pulling any one of the arms gives you a

82                              stochastic reward of either R=+1 for success, or R=0 for failure. Our objective is to

83                              pull the arms one-by-one in sequence such that we maximize our total reward 84 collected in the long run.

85



87

88

89                              The non-triviality of the multi-armed bandit problem lies in

90                              the fact that we (the agent) cannot access the true bandit

91                              probability distributions — all learning is carried out via the 92             means of trial-and-error and value estimation. So the question

93                              is:

94                              This is our goal for the multi-armed bandit problem, and having such a strategy 95  would prove very useful in many real-world situations where one would like to select 96 the “best” bandit out of a group of bandits.

97           In this project, we approach the multi-armed     bandit problem with a classical

98           reinforcement learning technique of an epsilon-greedy agent with a learning framework 99              of reward-average sampling to compute the action-value Q(a) to help the agent improve 100              its future action decisions for long-term reward maximization.

101                    In a nutshell, the epsilon-greedy agent is a hybrid of a (1) completely-exploratory agent

102                    and a (2) completely-greedy agent. In the multi-armed bandit problem, a completely103         exploratory agent will sample all the bandits at a uniform rate and acquire knowledge

104                    about every bandit over time; the caveat of such an agent is that this knowledge is never

105                    utilized to help itself to make better future decisions! On the other extreme, a completely-

106                    greedy agent will choose a bandit and stick with its choice for the rest of eternity; it will 107  not make an effort to try out other bandits in the system to see whether they have better 108      success rates to help it maximize its long-term rewards, thus it is very narrow-minded!

109                    How do we perform this in our code?

110                    We perform this by assigning a variable called epsilon. This epsilon switches between the

111                    exploratory and the greedy agent. We choose a random number between 0 and 1, if this 112  number is less than epsilon, we tell the agent to explore if it is greater then we tell the 113           agent to be greedy. This tactic is used in our policy part of our code.

114Q-Learning Algorithm
115

116         Essentially, Q-learning lets the agent use the environment's rewards 117            to learn, over time, the best action to take in a given state.

118                    In our environment, we have the reward table,that the agent will learn from. It does

119                    thing by looking receiving a reward for taking an action in the current state, then 120         updating a Q-value to remember if that action was beneficial.

121                    The values store in the Q-table are called a Q-values, and they map to a    (state,action)

122                    combination.

123                    A Q-value for a particular state-action combination is representative of the "quality" of 124  an action taken from that state. Better Q-values imply better chances of getting greater

125               rewards.

126

127                    Q-values are initialized to an arbitrary value, and as the agent exposes itself to the

128                    environment and receives different rewards by executing different actions, the Q129         values are updated using the equation:

130 Q(state,action)←(1−α)Q(state,action)+α(reward+γmaxaQ(nextα)Q(state,action)+α(reward+γmaxaQ(next)Q(state,action)+α(reward+γmaxaQ(nextα)Q(state,action)+α(reward+γmaxaQ(next(reward+α(reward+γmaxaQ(nextγmaxaQ(nextmaxaQ(next state,all actions))

131

132                    Where:

133                    - α(alpha) is the learning rate (0<α≤1α≤1) - Just like in supervised learning settings, α is 134  the extent to which our Q-values are being updated in every iteration.

135                    - γ (gamma) is the discount factor (0≤γ≤1) - determines how much importance we

136                    want to give to future rewards. A high value for the discount factor (close to 1) 137         captures the long-term effective award, whereas, a discount factor of 0 makes our 138          agent consider only immediate reward, hence making it greedy.

139                    What is this saying?

140                    We are assigning (←), or updating, the Q-value of the agent's current state and action

141                    by first taking a weight (1−αα) of the old Q-value, then adding the learned value. The

142                    learned value is a combination of the reward for taking the current action in the current

143                    state, and the discounted maximum reward from the next state we will be in once we 144  take the current action.

145                    Basically, we are learning the proper action to take in the current state by looking at

146                    the reward for the current state/action combo, and the max rewards for the next state.

147                    This will eventually cause our taxi to consider the route with the best rewards strung

148                    together.

149                    The Q-value of a state-action pair is the sum of the instant reward and the discounted

150                    future reward (of the resulting state). The way we store the Q-values for each state and

151                    action is through a Q-table

152                    Q-Table

153                    The Q-table is a matrix where we have a row for every state  and a column for every 154         action. It's first initialized to 0, and then values are updated after training.

155                      

156                              6.             Epsilon

157                              We want the odds of the agent exploring to decrease as time goes

158                              on. One way to do this is by updated the epsilon every moment of

159                              the agent during the training phase. Choosing a decay rate was

160                              the difficult part. We want episilon not to decay too soon, so the

161                              agent can have time to explore. The way I went about this is I

162                              want the agent the ability to explore the region for at least the

163                              size of the grid movements. So in 25 movements the agent should 164  still have a good chance of being in exploratory phase. I chose

165                              decay = 1/1.01 because (1/1.01)^25 = 75% which is a good chance

166                              and still being in the exploratory phase within 0-25 moves. After 167     50 moves episilon goes to 0.6 which is good odds of agent doing 168       both exploring and action orientated.

169

170                              7.             Results and Charts

171                              Here we plot epsilon vs episode.

172                              We see that we have a exponential decay going on here.

173                     

Then we plot rewards vs episode



Here we see that the total rewards slowly goes up then plateus after hitting 8. This is because 8 is the optimal path for our algrotihm.

CONCLUSION

Our Reinforcement Algorithm had done fairly well. Our agent has been able to learn the material in an effective manner. It has been able to reach its goal in 8 steps

More products