Starting from:

$24.99

CS234 Assignment 2 Solution


Please review any additional instructions posted on the assignment page. When you are ready to submit, please follow the instructions on the course website. Make sure you test your code using the provided commands and do not edit outside of the marked areas.
You’ll need to download the starter code and fill the appropriate functions following the instructions from the handout and the code’s documentation. Training DeepMind’s network on Pong takes roughly 12 hours on GPU, so please start early! (Only a completed run will recieve full credit) We will give you access to an Azure GPU cluster. You’ll find the setup instructions on the course assignment page.
Introduction
1 Test Environment (5 pts)
Before running our code on Pong, it is crucial to test our code on a test environment. You should be able to run your models on CPU in no more than a few minutes on the following environment:
• 4 states: 0,1,2,3
• 5 actions: 0,1,2,3,4. Action 0 ≤ i ≤ 3 goes to state i, while action 4 makes the agent stay in the same state.
• Rewards: Going to state i from states 0, 1, and 3 gives a reward R(i), where R(0) = 0.1,R(1) = −0.2,R(2) = 0,R(3) = −0.1. If we start in state 2, then the rewards defind above are multiplied by −10. See Table 1 for the full transition and reward structure.
1
• One episode lasts 5 time steps (for a total of 5 actions) and always starts in state 0 (no rewards at the initial state).
State (s) Action (a) Next State (s’) Reward (R)
0 0 0 0.1
0 1 1 -0.2
0 2 2 0.0
0 3 3 -0.1
0 4 0 0.1
1 0 0 0.1
1 1 1 -0.2
1 2 2 0.0
1 3 3 -0.1
1 4 1 -0.2
2 0 0 -1.0
2 1 1 2.0
2 2 2 0.0
2 3 3 1.0
2 4 2 0.0
3 0 0 0.1
3 1 1 -0.2
3 2 2 0.0
3 3 3 -0.1
3 4 3 -0.1
Table 1: Transition table for the Test Environment
An example of a path (or an episode) in the test environment is shown in Figure 1, and the trajectory can be represented in terms of st,at,Rt as: s0 = 0,a0 = 1,R0 = −0.2,s1 = 1,a1 = 2,R1 = 0,s2 = 2,a2 = 4,R2 = 0,s3 = 2,a3 = 3,R3 = (−0.1) ∗ (−10) = 1,s4 = 3,a4 = 0,R4 = 0.1,s5 = 0.

Figure 1: Example of a path in the Test Environment
1. (written 5pts) What is the maximum sum of rewards that can be achieved in a single episode in the test environment, assuming γ = 1?
2 Q-learning (12 pts)
Tabular setting In the tabular setting, we maintain a table Q(s,a) for each tuple state-action. Given an experience sample (s,a,r,s0), our update rule is
, (1)
where α ∈ R is the learning rate, γ the discount factor.
w . (2)
In other words, we are try to minimize
(3)
w . (4)
The target network’s parameters are updated with the Q-network’s parameters occasionally and are kept fixed between individual updates. Note that when computing the update, we don’t compute gradients with respect to w− (these are considered fixed weights).
Replay Memory As we play, we store our transitions (s,a,r,s0) in a buffer. Old examples are deleted as we store new transitions. To update our parameters, we sample a minibatch from the buffer and perform a stochastic gradient descent update.
There are several things to be noted:
• In this assignment, we will update w every learningfreq steps by using a minibatch of experiences sampled from the replay buffer.
• DeepMind’s deep Q network takes as input the state s and outputs a vector of size = number of actions.
In the Pong environment, we have 6 actions, thus ˆq(s,w) ∈ R6.
• The input of the deep Q network is the concatenation 4 consecutive steps, which results in an input after preprocessing of shape (80 × 80 × 4).
We will now examine these assumptions and implement the epsilon-greedy strategy.
1. (written 3pts) What is one benefit of using experience replay?
2. (written 3pts) What is one benefit of the target network?
3. (written 3pts) What is one benefit of representing the Q function as ˆq(s,w) ∈ RK
4. (coding 3pts) Implement the getaction and update functions in q1schedule.py. Test your implementation by running python q1schedule.py.
3 Linear Approximation (26 pts)
1. (written 3pts) Show that Equations (1) and (2) from section 2 above are exactly the same when qˆ(s,a,w) = wTx(s,a), where w ∈ R|S||A| and x : S × A → R|S||A| such that
1 if s0 = s,a0 = a
s,a s0,a0
0 otherwise
for all (s,a) ∈ S×A, x(s,a) is a vector of length |S||A| where the element corresponding to s0 ∈ S,a0 ∈ A is 1 when s0 = s,a0 = a and is 0 otherwise.
2. (written 3pts) Derive the gradient with regard to the value function parameter w ∈ Rn given qˆ(s,a,w) = wTx(s,a) for any function x(s,a) 7→ x ∈ Rn and write the update rule for w.
3. (coding 15pts) We will now implement linear approximation in Tensorflow. This question will setup the whole pipeline for the remiander of the assignment. You’ll need to implement the following functions in q2linear.py (pleasd read throughq2linear.py) :
• addplaceholdersop
• getqvaluesop
• addupdatetargetop
• addlossop
• addoptimizerop
Test your code by running python q2linear.py locally on CPU. This will run linear approximation with Tensorflow on the test environment. Running this implementation should only take a minute or two.
4. (written 5pts) Do you reach the optimal achievable reward on the test environment? Attach the plot scores.png from the directory results/q2linear to your writeup.
4 Implementing DeepMind’s DQN (15 pts)
1. (coding 10pts) Implement the deep Q-network as described in [mnih2015human] by implementing getq valuesop in q3nature.py. The rest of the code inherits from what you wrote for linear approximation. Test your implementation locally on CPU on the test environment by running python q3nature.py. Running this implementation should only take a minute or two.
2. (written 5pts) Attach the plot of scores, scores.png, from the directory results/q3nature to your writeup. Compare this model with linear approximation. How do the final performances compare?
How about the training time?
5 DQN on Atari (27 pts)
The Atari environment from OpenAI gym returns observations (or original frames) of size (210×160×3), the last dimension corresponds to the RGB channels filled with values between 0 and 255 (uint8). Following DeepMind’s paper [mnih2015human], we will apply some preprocessing to the observations:
• Single frame encoding: To encode a single frame, we take the maximum value for each pixel color value over the frame being encoded and the previous frame. In other words, we return a pixel-wise max-pooling of the last 2 observations.
• Dimensionality reduction: Convert the encoded frame to grey scale, and rescale it to (80 × 80 × 1). (See Figure 2)
The above preprocessing is applied to the 4 most recent observations and these encoded frames are stacked together to produce the input (of shape (80×80×4)) to the Q-function. Also, for each time we decide on an action, we perform that action for 4 time steps. This reduces the frequency of decisions without impacting the performance too much and enables us to play 4 times as many games while training. You can refer to the Methods Section of [mnih2015human] for more details.

(a) Original input (210 × 160 × 3) with RGB colors (b) After preprocessing in grey scale of shape (80×80×1) Figure 2: Pong-v0 environment
1. (written 2pts) Why do we use the last 4 time steps as input to the network for playing Atari games?
2. (written 5pts) What’s the number of parameters of the DQN model (for Pong) if the input to the Q-network is a tensor of shape (80,80,4) and we use ”SAME” padding? How many parameters are required for the linear Q-network, assuming the input is still of shape (80,80,4)? How do the number of parameters compare between the two models?
3. (coding and written 5pts). Now, we’re ready to train on the Atari Pong-v0 environment. First, launch linear approximation on pong with python q4trainatarilinear.pyon Azure’s GPU. This will train the model for 500,000 steps and should take approximately an hour. What do you notice about the performance?
4. (coding and written 10 pts). In this question, we’ll train the agent with DeepMind’s architecture on the Atari Pong-v0 environment. Run python q5trainatarinature.py on Azure’s GPU. This will train the model for 5 million steps and should take around 12 hours. Attach the plot scores.png from the directory results/q5trainatarinature to your writeup. You should get a score of around 13-15 after 5 million time steps. As stated previously, the Deepmind paper claims average human performance is −3.
• The evaluation score printed on terminal should start at -21 and increase.
• The max of the q values should also be increasing
• The standard deviation of q shouldn’t be too small. Otherwise it means that all states have similar q values

5. (written 5pts) Compare the performance of the DeepMind architecure with the linear Q-network approximation. How can you explain the gap in performance?
6 Real world RL with neural networks (10 pts)
Given a stream of batches of n environment interactions (si,ai,ri,s0i) we want to learn the optimal value function using a neural network. The underlying MDP has a finite sized action space.
1. (written 4pts) Your friend first suggests the following approach
(a) Initialize parameters φ of neural network Vφ
(b) For each batch of k tuples ( ) do Stochastic Gradient Descent with loss function Vφ(si)|2 where yi = maxai[ri + γVφ(s0i)]
What is the problem with this approach? (Hint: Think about the type of data we have)
2. (written 3pts) Your friend now suggests the following
(a) Initialize parameters φ of neural network for state-action value function Qφ(s,a)
(b) For each batch of k tuples (si,ai,ri,s0i) do Stochastic Gradient Descent with loss function Qφ(si,ai)|2 where
Now as we just have the network Qφ(s,a) how would you determine V (s) needed for the above training procedure?
3. (written 3pts) Is the above method of learning the Q network guaranteed to give us an approximation of the optimal state action value function?

More products