$25
The code in the ipynb le should do Problem 1 if you set hw7MC.algorithm = 'value'.
It should do Problem 2 if you set hw7MC.algorithm = 'policy'
Problem 1
Complete the coding of the provided ipynb le which prices the Bermudan put option under GBM, with the same parameters as in the Excel worksheet from class (which has been posted on Canvas), using the Longsta -Schwartz method.
Report an estimated price, based on 10000 paths.
At each exercise date, do the regression using only the paths that are in-the-money (at that speci c date so there may be di erent subsamples on di erent dates), not all of the paths.
Problem 2
The Longsta -Schwartz method can be regarded as an example of a Reinforcement Learning (RL) algorithm. It selects actions ( exercise vs. continue ) to try to maximize an expected reward (option payo ) that depends on the transitions of a state variable (the underlying X).
In particular, Longsta -Schwartz takes a Value-function approach to solving the dynamic programming formulation of the Reinforcement Learning problem. It nds an estimate fˆn (same notation as L7) of the value function for the continuation action, by using OLS regression, of simulated continuation payo s on the state variable. This estimated continuation value fˆn is compared against the value function for the exercise action, which is just the payo function (for example Payoff(X) = K − X in the case of a put):
If fˆn(Xtn) > Payoff(Xtn) then continue to hold at time tn
If fˆn(Xtn) ≤ Payoff(Xtn) then exercise at time tn
Here we will consider a di erent approach to RL.
In contrast to Value-function RL, another approach to Reinforcement Learning is the Policybased approach. Rather than trying to estimate the value function (for the continuation action), it tries to more directly optimize the time-tn policy function, let’s denote it Φ, which maps each X to one of two outputs: {0,1}, where 0 denotes continuing to hold, while 1 denotes stopping
(exercising).
If Φ(Xtn) = 0 then continue to hold at time tn
If Φ(Xtn) = 1 then exercise at time tn
1
In the particular one-dimensional example of put pricing that we have been studying, we know what form the stopping policy function should take. In theory it should be an indicator function
Φcn(X) = 1X≤cn
with a parameter cn is a speci c critical or threshold level of the stock price X. Below cn you should exercise, and above cn you should continue to hold the put. So, in principle, we could try to estimate the optimal threshold cn by choosing it to maximize the average, across all simulated paths, of the simulated payout resulting from the policy Φcn at time tn.
However, this optimization has some numerical di culties, due to the discontinuity of this hard stopping decision function Φ which only has two outputs {0,1}. So suppose that we optimize a smoother function, a soft stopping decision function ϕ which produces outputs in the interval between 0 and 1. Let ϕ have two parameters a,b (which may depend on the time slice n) and speci cally let ϕ be[1] a sigmoid or logistic function of b(X − a):
. (*)
For large negative b, the ϕa,b will behave similarly to Φa, in that it’s near 1 for X < a and near 0 for X > a. But unlike the hard stopping decision function, the soft decision function ϕ is more optimizer-friendly, because it varies continuously between 0 and 1. It can be interpreted as making the exercise decision randomly, with probability ϕa,b(X) of exercising, and probability 1− ϕa,b(X) of continuing to hold, conditional on X. At time tn the optimizer should optimize
(Continuation payout on the mth path)
where Xm denotes the mth simulated path. Then calculate payouts by converting this optimized soft stopping decision into a hard stopping decision by
Φ(Xtn) = 1ϕa,ˆˆb(Xtn)≥0.5 ×1Payoff(Xtn)>0
where aˆ and ˆb denote the optimized parameter values. Multiplying by 1Payoff(Xtn)>0 makes sure that you are not exercising OTM options. It should not be needed if your ϕ has been trained correctly, but we include it as a precaution.
Implement this policy optimization approach, by completing the code in the ipynb le. Most of the coding is already provided.
[1] On this problem, which is simple in the sense that the exercise region in X-space is just a one-dimensional interval, a single sigmoid function (*) is su cient to approximate the optimal stopping policy.
On harder problems, where the exercise region may be a complicated subset of a multidimensional X-space, the function (*) can be upgraded to a deep neural network.
For instance see http://jmlr.org/papers/volume20/18-232/18-232.pdf