Starting from:

$30

CS7642-Reinforcement Learning Maze Solved

(c) Next, implement the EM algorithm for Gaussian mixtures. Write three functions: log-likelihood, gm_e_step, and gm_m_step as given in the lecture. Identify the correct arguments, and the order to run them. Initialize the algorithm with means as in (1.1), covariances with ⌃ˆ1 = ⌃ˆ2 = I, and ˆ⇡1 = ⇡ˆ2.

Run the algorithm until convergence and show the resulting cluster assignments on a scatter plot either using di↵erent color codes or shape or both. Also plot the log-likelihood vs. the number of iterations. Report your misclassification error. (d) Comment on the results:

(a)     Compare the performance of k-Means and EM based on the resulting cluster assignments.

(b)    Compare the performance of k-Means and EM based on their convergence rate. What is the bottleneck for which method?

(c)     Experiment with 5 di↵erent data realizations (generate new data), run your algorithms, and summarize your findings. Does the algorithm performance depend on di↵erent realizations of data?

2.          Reinforcement Learning, 65 pts. In this portion of the assignment, you will write software to implement a simulated environment and build a reinforcement learning agent that discovers the optimal (shortest) path to a goal. The agent’s environment will look like:

 

Each cell is a state in the environment. The cell labeled “I” is the initial state of the agent. The cell labeled “G” is the goal state of the agent. The black cells are barriers—states that are inaccessible to the agent. At each time step, the agent may move one cell to the left, right, up, or down. The environment does not wraparound. Thus, if the agent is in the lower left corner and tries to move down, it will remain in the same cell. Likewise, if the agent is in the initial state and tries to move up (into a barrier), it will remain in the same cell.

2.1. Implementation . You should implement a Q learning algorithm that selects moves for the agent. The algorithm should perform exploration by choosing the action with the maximum Q value 90% of the time, and choosing one of the four actions at random the remaining 10% of the time. We should ”break-ties” when the Q-values are zero for all the actions (happens initially) by essentially choosing uniformly from the action. So now you have two conditions to act randomly: for ✏ amount of the time, or if the Q values are all zero.

The simulation consist of a series of trials, each of which runs until the agent reaches the goal state, or until it reaches a maximum number of steps, which you can set to 100. The reward at the goal is 10, but at every other state is 0. You can set the parameter to 0.9.

2.2. Experiments.

1.    Basic Q learning experiments .

(a)     Run your algorithm several times on the given environment.

(b)    Run your algorithm by passing in a list of 2 goal locations: (1,8) and (5,6). Note: we are using 0-indexing, where (0,0) is top left corner. Report on the results.

2.    Experiment with the exploration strategy, in the original environment (20 points).

(a)     Try di↵erent ✏ values in ✏-greedy exploration: We asked you to use a rate of ✏=0.1, but try also 0.5 and 0.01. Graph the results (for the 3 ✏-values) and discuss the costs and benefits of higher and lower exploration rates.

(b)    Try exploring with policy derived from the softmax of Q-values described in the Q learning lecture. Use the values 2 {1,3,6} for your experiment, keeping fixed throughout the training.

(c)     Instead of fixing the = 0 to the initial value, we will increase the value of as the number of episodes t increase:

                       (2.1)                                                                          (t) = 0ekt

That is, the value is fixed for a particular episode. Run the training again for di↵erent values of k 2{0.05,0.1,0.25,0.5}, keeping 0 = 1.0. Compare the results obtained with this approach to those obtained with a static value.

3.    Stochastic environments .

(a)     Make the environment stochastic (uncertain), such that the agent only has say a 95% chance of moving in the chosen direction, and a 5% chance of moving in a random direction.

(b)    Change the learning rule to handle the non-determinism, and experiment with di↵erent values of the probability that the environment performs a random action prand 2 {0.05,0.1,0.25,0.5} in this new rule. How does performance vary as the environment becomes more stochastic?

Use the same parameters as in the first part, except change the alpha to be less than 1, e.g., ↵ = 0.5. Report your results.

2.3. Write-up. Hand in a brief summary of your experiments. For each sub-section, this summary should include a one paragraph overview of the problem and your implementation. It should include a graph showing number of steps required to reach the goal as a function of learning trials (one trial is one run of the agent through the environment until it reaches the goal or maximum number of steps). You should also make a figure showing the policy of your agent for the first 2.1.1 section. The policy can be summarized by making an array of cells corresponding to the states of the environment, and indicating the direction (up, down, left,right) that the agent is most likely to move if it is in that state.

More products