$30
Have you ever wondered how Matlab generates Gaussian random numbers? The method it uses, called the Ziggurat algorithm, is the topic of this midterm project. Each of you will implement the Ziggurat algorithm for a different continuous distribution. (See the end of this document for your assigned distribution.)
First, read and understand this document explaining the algorithm. Then write code implementing it. (As usual, Matlab is recommended but not required.) In addition, the project requires you to do various analytical tasks to correctly set up the algorithm and analyze it. You should turn in a report which includes your well-documented code, as well as a write-up describing your implementation details, and the answers to specific questions asked below. Specific tasks that must be covered in your project report will be labeled by the bold word Task.
1 Ziggurat algorithm overview
The Ziggurat algorithm is a version of rejection sampling, optimized in a way so that some significant pre-computation is required, but once this pre-computation is done, generating each sample from the distribution is, on average, extremely fast. For example, Matlab can generate Gaussian random numbers almost as fast as uniform random numbers, even though the Gaussian distribution is much more complicated.
As described in Homework 4 problem 3, the idea of rejection sampling is to define a twodimensional region R that contains the target PDF fS. This idea can be generalized slightly by using a function g(x) with is proportional to the target PDF fS(x); that is, fS(x) = cg(x) for some constant c. Then the region R should be such that it includes the function g(x). (See Figure 1.) A random pair X,Y is generated uniformly in the region R. If Y < g(X), then the sample X is accepted. Otherwise the sample is rejected and another random pair is generated, repeating until a sample is accepted. To make rejection sampling efficient, two objectives must be satisfied:
1. The region R should be designed so that it is easy to generate random pairs X,Y uniformly.
2. The probability of accepting a given pair should be as large as possible. This requires that theregion R does not include much area above g(x).
The Ziggurat algorithm assumes that the function g(x) is monotonically decreasing. The region R consists of n sub-regions stacked on top of each other: n−1 of these sub-regions are rectangles, and the nth is rectangle adjoined to the tail of the distribution. This idea is illustrated in Figure 2, where each of the n regions (in the figure n = 4) is shaded in a different color.
The rectangles are set up so that each of the n sub-regions (including the nth region, which is not a rectangle) have exactly the same area. Thus, a point X,Y can be uniformly chosen from the full region as follows: First, choose a sub-region from a discrete uniform distribution—i.e.,
Figure 1: The basic rejection sampling concept.
Figure 2: Diagram of the Ziggurat algorithm with n = 4. The name “Ziggurat algorithm” comes from the fact that this picture looks a little bit like a Ziggurat.
each with probability 1/n—in Matlab this can be done with the function randi. Then, for the rectangular regions, X can be chosen from a uniform distribution along the length of the rectangle, and Y can be chosen from a uniform distribution along the height of the rectangle. The nth region requires a more complicated algorithm, but since this is only needed with probability 1/n, this more complicated algorithm only needs to be run rarely. Thus, choosing a candidate X,Y is easy, which satisfies objective 1. If the number of regions n is large enough, then the stack of regions closely matches the function g(x), which satisfies objective 2.
If we focus on a single one of the rectangular regions, we can see another feature that makes the Ziggurat algorithm so fast. Figure 3 shows the kth region, for some k < n. The kth region consists of a rectangle extending horizontally from x1 to xk+1, and vertically from yk+1 to yk. As shown in Figure 2, yk = g(xk) for each k. Since g(x) is decreasing, yk+1 < yk. We can easily choose a point uniformly from this rectangle by generating X ∼ U(x1,xk+1) and Y ∼ U(yk+1,yk). After generating this point, we need to check whether Y < g(X). However, if X < xk, then Y < g(X) no matter what Y is! This means that we do not even need to generate Y , nor evaluate the function g(X), unless X > xk. If n is chosen to be large enough, each rectangle will be much wider than it is tall, so the case where X > xk occurs only rarely.
To implement the Ziggurat algorithm, first the values x1,x2,...,xn and y1,y2,...,yn must be pre-computed. This pre-computation requires some effort, but it only needs to be done once. After
Figure 3: Diagram of the kth region of the Ziggurat algorithm.
these values are computed, the algorithm to generate one sample from the target distribution is summarized by the following pseudo-code.
The above pseudo-code does not give details about how do deal with the nth region. We’ll get back to that! Note that, most of the time, to generate a sample of the target distribution, all we need to do is generate one uniform discrete variable K, generate one continuous uniform random variable X, and do one comparison X < xk. Even though the alternative paths—when k = n, or X > xk, or Y > g(X)—are more complicated, these alternative paths are rare, so most of the time we do not need to execute them.
2 Implement a baseline sampling algorithm
Each of you is assigned a continuous distribution with a PDF of the form
where g(x) is a monotonically decreasing function in the range x > x1. See Section 6 for your distribution assignments.
Task 1 Find the constant c so that your PDF is normalized correctly.
Calculating c requires evaluating the integral
Z ∞
g(x)dx.
x1
For some distributions, this integral is solvable in closed form. If it is not, you may use numerical integration software to calculate c numerically. (In Matlab, you should use the integral function.)
Task 2 Write a function that computes the CDF of your distribution.
As above, for some distributions the CDF will have a closed form expression, but for others it will require numerical integration.
Task 3 Implement a baseline sampling algorithm via the transformation S = FS−1(U) where U ∼ U(0,1).
This baseline algorithm will be used to compare against the more efficient Ziggurat algorithm. To implement this baseline algorithm requires inverting the CDF. If you cannot find a closed-form expression for this inverse function, you may need to implement the bisection algorithm, which is summarized below. This algorithm can be used for any monotonic function h(x). The idea is, in order to find a value x where h(x) = u, maintain values xmin and xmax where h(xmin) < u and h(xmax) > u. A point x is chosen at the midpoint between xmin and xmax. Based on the value of h(x), either xmin or xmax is updated to halve the difference between them. (This pseudocode assumes that h is increasing; the same algorithm can be used for monotonically decreasing functions with a slight variation.)
Input: u, xmin, xmax where h(xmin) < u and h(xmax) > u
The tolerance parameter tol determines exactly how precise the result is. Setting tol = 10−12 or smaller is often a good choice.
Task 4 Generate at least 1000 samples from your baseline sampling algorithm. Keep track of the time it takes to run on your computer. (In Matlab, this can be done with the commands tic and toc.) Plot an estimated PDF from these samples, and make sure it matches the true PDF.
3 Set up the Ziggurat
Before generating samples with the Ziggurat algorithm, the values x1,x2,...,xn must be precomputed, as well as the corresponding values y1,y2,...,yn, where yk = g(xk) for each k. These
Figure 4: Diagram of the nth region of the Ziggurat algorithm.
numbers should be selected so that each of the n regions have exactly the same area. For k < n, the kth region is a rectangle, so its area is
(xk+1 − x1)(yk − yk+1).
For the nth region, the area is
To set up the Ziggurat, first choose a guess A for the area of each region. Based on this A, find x2 so that the 1st region has area A. This can be done using the bisection algorithm. Then find x3 so that the 2nd region has area A. Continue computing x4,...,xn. Given the value that you get for xn, compute the resulting area of the nth region. If this area is less than A, then the original guess for A must have been too large; if greater than A, then the original choice of A must have been too small. Again, the bisection algorithm can be used to find the exact value of A.
Task 5 Calculate x1,x2,...,xn and y1,y2,...,yn for three values of n: 4,32,256. Check to make sure x1 < x2 < ··· < xn. Be sure to include the code that you used. Also list the xk and yk numbers for the n = 4 case. (Your report does not need to list the numbers for n = 32,256.)
4 Sampling from the nth region
Since the nth region of the Ziggurat is not a rectangle, it requires a different algorithm to sample from. However, since this region only comes up with probability 1/n, the algorithm does not need to be especially efficient. Figure 4 shows the nth region. It can be separated into two parts: (i) a rectangle extending horizontally from x1 to xn and vertically from 0 to yn, and (ii) an infinitely long tail extending from xn to ∞, and upper bounded by g(x).
Task 6 If the point X,Y is chosen uniformly from this region, compute the probability p that it falls into the rectangular part of the nth region.
Remember that we do not really need Y , we only need X. When X,Y is in the rectangular part, then X is simply uniform from x1 to xn. Thus, we can do the following: with probability p, generate X ∼ U(x1,xn). Otherwise, generate X from the tail distribution.
Generating X from the tail distribution is the last piece of the Ziggurat algorithm. This requires generating a random variable from the PDF
where the constant c0 is chosen appropriately.
Task 7 Find and implement a method to sample from the tail distribution fT. One of two methods can be used, depending on your distribution:
1. If g(x) has a closed-form integral, then a version of the inverse-CDF transformation can be used, as in your baseline sampling algorithm.
2. If g(x) does not have a closed-form integral, you can use another rejection sampling algorithm. Here you should use the method of Homework 4 problem 3: choose a distribution fR defined on x > xn that is easy to sample from, and a constant M such that MfR(x) > g(x) for all x > xn. For the distribution fR, you may wish to use the exponential distribution restricted to x > xn; another option is the Pareto distribution, given by PDF
where α > 0 is a parameter. The exact choice of distribution fR is up to you, but be sure to explain your choice. To implement this rejection sampling algorithm, generate X ∼ fR, and Y ∼ U(0,MfR(X)). Accept if Y < g(X), and repeat if not. Note that this rejection sampling loop should be done inside a single iteration of the outer rejection sampling loop.
5 Implement and analyze the Ziggurat algorithm
Task 8 Complete the implementation of the Ziggurat algorithm for n = 4, n = 32, and n = 256.
Task 9 For each of the three n values, run your Ziggurat algorithm to generate at least 1,000,000 samples. Make sure you compute the x1,...,xn and y1,...,yn constants only once. For each of the n values, use these samples to plot an estimated PDF, and compare against the true PDF. Carefully check that these match for your n = 4 algorithm, to make sure that your tail algorithm is working correctly.
Task 10 As you generate samples from the three variants of your algorithm, keep track of the following data:
1. How long it takes to run. Compare the time per sample with your baseline algorithm.[1] Make sure you run your code on the same computer, so that the comparison in meaningful.
2. How often each of the following possible outcomes occurs in the rejection loop:
(a) X is accepted because X < xk,
(b) X > xk but X is accepted because Y < g(X),
(c) Y > g(X) so X is rejected,
(d) k = n and a sample is drawn from the rectangular part of the nth region, (e) k = n and the tail algorithm is run.
Task 11 Can you think of any ways to further improve the performance of your algorithm?
6 Distribution assignments
The following table lists the distribution assignments for each student. Recall that the distribution you are simulating is
where c is a constant that you must determine. The table lists the function g(x) and the boundary point x1.
Student Distribution name g(x) x1
Arkan Abuyazid Half-Student’s t (1+4x2)−5/8 0
Curtis Anderson Half-Cauchy
−x3
Cooper Bertke Half-generalized Gaussian e 0
0.5
−x4
Mihir Kotak Half-generalized Gaussian e 0
-Gaussian
Yuye Ran Restricted inverse-gamma x−2e−1/x 0.5
Samuel Roark Restricted Erlang x3e−x
0.5
−0.5
Note: Φ is the standard Gaussian CDF, which can be computed in Matlab using the normcdf function
[1] Some of you may find that your baseline algorithm actually runs faster than the Ziggurat algorithm. If this is happens even for n = 256, be sure to check your Ziggurat implementation to make sure it is working correctly. However, even if everything is working right, the baseline algorithm may still run faster. This could happen for two reasons. First, for some distributions with closed-form CDFs, computing the inverse CDF FS−1 can be quite fast. Second, since Matlab is an interpreted language, it sometimes runs quite slowly compared to the same algorithm implemented in a language like C. For example, Matlab is notoriously slow when it comes to loops. Since the goal of this project is for you to understand the algorithm rather than to build a commercial product, this is nothing to worry about.