Starting from:

$30

CS 446/546 Advanced Topics in Machine Learning Programming Assignment #2 Solved



CS 446/546  
Advanced Topics in Machine Learning 
Programming Assignment #2 

In this homework you will write code to implement tic-tac-toe using techniques from Reinforcement Learning.  

 

  

 

 

Your Assignment:  

 

Part 1:  Write a simple simulator for tic-tac-toe that encodes (and displays) a standard 3x3 grid, with a check for licit moves and “goal state” (i.e. did someone win/draw).  

 

Implement a form of Q-learning to train a tic-tac-toe playing “agent”. One straightforward option for Q-learning is to use a Q-matrix, in which the rows correspond to states and the columns correspond to actions (you have the option of using a more sophisticated method than a Q-matrix if you wish, such as a Q-network). The Q-matrix is commonly initialized to all zeros at the beginning of a run (although you may use a different initialization strategy if you prefer – please note this in your write-up).  

 

At each time step t during an episode, your code should do the following:

•       Observe the current state st

•       Choose an action at, using -greedy action selection (I recommend training on-policy)  

•       Perform the action

•       Observe the new state st+1

•       Update 𝑄(𝑠𝑡, 𝑎𝑡) = 𝑄(𝑠𝑡, 𝑎𝑡) + 𝜂(𝑟𝑡 + 𝛾𝑚𝑎𝑥𝑎′𝑄(𝑠𝑡+1, 𝑎′) − 𝑄(𝑠𝑡, 𝑎𝑡))  

•       Receive reward rt (at the conclusion of a game, include “reward” for outcomes, including: win, lose or draw)

 

For choosing actions with -greedy action selection, initialize   (for example:  = 0.1), and decrease it by a constant value Δ every m epochs until it reaches 1 (I encourage you to experiment with different choices for Δ and m, as well as the initial value for ).  

 

After each training epoch, test your agent on 10 games against a baseline random opponent. Record the total score of your agent against the baseline out of these 10 games (+1 for a win, +.5 for draw, 0 for loss). After training is completed, print a plot of the training progress for your agent with the epoch number on the horizontal axis and the (total score) / 10 against the baseline opponent on the vertical axis.  

 

When you have completed training, play 10 games against your agent and report these results.  

 

(Optional) Parts 2-3 are optional 

 

Part 2:  Experiment with Learning Rate.  Choose 4 different values for the learning rate, 𝜂, approximately evenly spaced in the range [0,1], keeping the other parameters set as in Part 1.  For each value, give a performance plot as described above.  Discuss how changing the learning rate changes these results.  

 

Part 3: Q-network.   Use a Q-network in place of a Q-matrix; describe how you train and architected your network. Provide a performance plot.    

 

Here is what you need to turn in: 

 

Your spell-checked, double-spaced report with the information requested above.   Also, your (briefly) commented code with instructions how to run it.  Your report should include a detailed summary of your approach and results.  

 

How to turn it in: 

 

•       Send these items in electronic format to me (arhodes@pdx.edu) on the due date. No hard copy please!  

•       The report should be in pdf format and the code should be in its original format (e.g., .py, .m, etc.)

 

More products