CS 7642: Reinforcement Learning and Decision Making Solution
TD(λ) Project #1 1 Problem 1.1 Description For this project, you will read Richard Sutton’s 1988 paper “Learning to Predict by the Methods of Temporal Differences.” Then you will create an implementation and replication of the results found in figures 3, 4, and 5. It might also be informative to compare these results with those in Chapter 7 of Sutton’s textbook [1]. You will present your work in a written report of a maximum of 5 pages. The report should include a description of the experiments replicated, how the experiments were implemented (the environment, algorithms, etc), and the outcomes of the experiments. You should provide an analysis of these results. What exactly do the results demonstrate? Are there any significant differences between your results and the results in the original paper? How can you explain those differences? Describe any pitfalls you ran into while trying to replicate the experiment from the paper (e.g. unclear parameters, contradictory descriptions of the procedure to follow, results that differ wildly from the published results). What steps did you take to overcome those pitfalls? What assumptions did you make? And, why were these assumptions justified? Add anything else that you think is relevant to discuss. 1.2 Procedure As noted, replicating results can be challenging. Expect some issues along the way and be prepared to resolve them. • Read Sutton’s Paper • Write the code necessary to replicate Sutton’s experiments – You will be replicating figures 3, 4, and 5 (Check Erratum at the end of the paper) • Create the graphs – Replicate figures 3, 4, and 5 – Make sure to include a README.md file for your repository ∗ Include thorough and detailed instructions on how to run your source code in the README.md ∗ If you work in a notebook, like Jupyter, include an export of your code in a .py file along with your notebook ∗ The README.md file should be placed in the project 1 folder in your repository. 1 –TD(λ) 2 – You will be penalized by 25 points if you: ∗ Do not have any code or do not submit your full code to the GitHub repository ∗ Do not include the git hash for your last commit in your paper • Write a paper describing the experiments, how you replicated them, and any other relevant information. – Include the hash for your last commit to the GitHub repository in the paper’s header. – 5 pages maximum – really, you will lose points for longer papers. – Make sure your graphs are legible and you cite sources properly. While it is not required, we recommend you use a conference paper format. – Describe the problem ∗ You should assume your reader has not read Sutton 88 and provide sufficient background for them to understand your work and its significance. Don’t cut corners here. We’ve never read your take and analysis of the random walk. – Your graphs ∗ And, discussions regarding them – Describe the experiments ∗ Discuss the implementation ∗ Discuss the outcome ∗ The generated data – Analyze your results ∗ How do they match ∗ How do they differ ∗ Why is this the case and why is it important? Analyze your results in the context of the problem and the approach. Your analysis is where you demonstrate your understanding to the reader. – Describe any problems/pitfalls you encountered ∗ How did you overcome them ∗ What were your assumptions/justifications for this solution – Yes, it can be done within 5 pages and in normal font size – Save this paper in PDF format – Submit! 2 Resources 2.1 Lectures • Lesson 4: TD and Friends 2.2 Readings • Sutton (1988) [2] • Chapter 7 (7.1 n-step TD Prediction) and Chapter 12 (12.2 TD(λ)) of [1] –TD(λ) 3 3 Submission Details • Your written report in PDF format (Make sure to include the git hash of your last commit) To complete the assignment, submit your written report to Project 1 under your Assignments on Canvas: https://gatech.instructure.com 3.1 Grading and Regrading When your projects are graded, you will receive feedback explaining your errors (and your successes!) in some level of detail. This feedback is for your benefit, both on this assignment and for future assignments. It is considered a part of your learning goals to internalize this feedback. This is one of many learning goals for this course, such as: understanding game theory, random variables, and noise. It is important to note that because we consider your ability to internalize feedback a learning goal, we also assess it. This ability is considered 10% of each assignment. We default to assigning you full credit. If you request a regrade and do not receive at least 5 points as a result of the request, you will lose those 10 points. References [Sut88] Richard Sutton. “Learning to Predict by the Method of Temporal Differences”. In: Machine Learning 3 (Aug. 1988), pp. 9–44.