$30
The goal of this assignment is to experiment with imitation learning, including direct behavior cloning and the DAgger algorithm.
Section 1. Behavioral Cloning
1. The starter code provides an expert policy for each of the MuJoCo tasks in OpenAI Gym. Fill in the blanks in the code marked with Todo to implement behavioral cloning. A command for running behavioral cloning is given in the Readme file.
The following files have blanks in them and can be read in this order:
• scripts/run hw1 behavior cloning.py
• infrastructure/rl trainer.py
• agents/bc agent.py
• policies/MLP policy.py
• infrastructure/replay buffer.py
• infrastructure/utils.py
• infrastructure/tf utils.py
2. Run behavioral cloning (BC) and report results on two tasks: one task where a behavioral cloning agent achieves at least 30% of the performance of the expert, and one task where it does not. When providing results, report the mean and standard deviation of the return over multiple rollouts in a table, and state which task was used. Be sure to set up a fair comparison, in terms of network size, amount of data, and number of training iterations, and provide these details (and any others you feel are appropriate) in the table caption.
Tip: to speed up run times, the video logging can be disabled by setting --video log freq -1
3. Experiment with one set of hyperparameter that affects the performance of the behavioral cloning agent, such as the number of demonstrations, the number of training epochs, the variance of the expert policy, or something that you come up with yourself. For one of the tasks used in the previous question, show a graph of how the BC agent’s performance varies with the value of this hyperparameter, and state the hyperparameter and a brief rationale for why you chose it in the caption for the graph.
Section 2. DAgger
1. Implement DAgger by filling out all the remaining blanks in the code marked with Todo. A command for running DAgger is provided in the Readme file.
Run DAgger and report results on one task in which DAgger can learn a better policy than behavioral cloning. Report your results in the form of a learning curve, plotting the number of DAgger iterations vs. the policy’s mean return, with error bars to show the standard deviation. Include the performance of the expert policy and the behavioral cloning agent on the same plot. In the caption, state which task you used, and any details regarding network architecture, amount of data, etc.