$30
Parallel Space Solution of a Nonlinear PDE using OpenMP
The Fisher’s equation as an example of a reaction-diffusion PDE
The simple OpenMP exercises such as the π-computation or the parallel Mandelbrot set are good examples for the basic understanding of OpenMP constructs, but due to their simplicity they are far away from practical applications. In the HPC Lab for CSE we are going to use a parallel PDE mini-application that solves something more sophisticated, but where the code is nevertheless still relatively short and easy to understand. Our mini-application will solve a prototype of a reaction-diffusion equation (1). It is the Fisher’s Equation that can be used to simulate travelling waves and simple population dynamics, and is given by
in Ω, (1)
where s := s(x,y,t), D is the diffusion constant and R is the reaction constant. The left hand side represents the rate of change of s over time. On the right hand side, the first term describes the diffusion of s in space and the second term describes the reaction or growth of the wave or population. We set the domain to be the unit square, i.e. Ω = (0,1)2, and prescribe Dirichlet boundary conditions on the whole boundary ∂Ω with a fixed constant value of 0.1. The initial conditions sinit are also set to 0.1 over the entire domain, except for a circular region in the bottom left corner that is initialized to the value 0.2. The domain Ω is discretized using a uniform grid with (n + 2)×(n + 2) points, where a grid point is indicated by xi,j, for i,j ∈ {0,1,...,n + 1}, and where ski,j is the approximation of s at the grid-point xi,j at time-step k, see Figure 1.
We use a second-order finite difference discretization to approximate the spatial derivatives of s for all inner grid, i.e.,
, (2)
for all (i,j) ∈ {1,...,n}, and where ∆x = 1/(n + 1), for ∆x sufficiently small. In order to approximate the time derivative, we use a first-order finite difference scheme, which at time-step k gives
, (3)
where ∆t is the time-step size. Putting together the components from Eq.(2) & (3), we obtain the following discretization of Equation (1):
, (4)
which we will attempt to solve in order to obtain an approximate solution for s. By moving all terms on the right-hand side and multiplying with ∆x2/D, we can reformulate (4) as
(5)
for each tuple (i,j) and time-step k with α := ∆x2/(D∆t) and β := R∆x2/D. At time-step k = 1, we initialize , and we look for approximate values ski,j that fulfill Equation (5). For all (i,j) at a fixed k,
1
Figure 1: Discretization of the domain Ω.
we obtain a system of N = n2 equations. We can see that each equation is quadratic in ski,j, and therefore nonlinear which makes the problem more complicated to solve. We tackle this problem using Newton’s method with which we iteratively try to find better approximations of the solution of Equation (5). In order to formulate the Newton iteration we introduce the following notation: Let s be a vectorized version of the approximate solution at time-step k. Then, we can consider the set of equations fi,jk as functions depending on sk and define f . For Newton’s method, we then have
yl+1 = yl − [Jf(yl)]−1f(yl), (6)
where Jf(yl) ∈ RN×N is the Jacobian of f. We start with the initial guess y0 := sk−1. However, for each iteration the inverse of the Jacobian [Jf(yl)]−1 is not readily available. We do not compute it directly but instead use a matrix-free Conjugate Gradient solver that solves the following linear system of equations for δyl+1:
[Jf(yl)]δyl+1 = f(yl) (7)
and for which it follows that yl+1 = yl − δyl+1. We iterate over l in Equation (6) till a stopping criterion is reached and we obtain a final solution yfin. This is the approximate solution for time-step k, i.e. sk := yfin that we originally set out to find in Equation (5). It is then in turn used as the initial guess to compute an approximate solution for the next time-step k + 1.
Code Walkthrough
The provided code for the project already contains most of the functionalities described above, a brief overview of the code is presented in the this section. The main task of this project will be (i) to complete some parts of the sequential code and (ii) the parallelization of the code using OpenMP. This project will also serve as an example for using the message-passing interface MPI in another project. There are three files of interest:
• main.cpp: initialization and main time-stepping loop
• linalg.cpp: the BLAS level 1 (vector-vector) kernels and conjugate gradient solver
(a) Interior grid points
(b) Boundary grid points
Figure 2: Visualization of stencil operators. The blue grid points indicate the grid points on the boundary of the domain with fixed values (see right side, set to zero in our setting). The orange grid points indicate inner grid points of the domain, whose stencil does not depend on the boundary values. The green grid points are still inside the domain but their stencils require boundary values.
– This file defines simple kernels for operating on 1D vectors, including
* dot product: x · y: hpcdot()
* linear combination: z = alpha ∗ x + beta ∗ y: hpclcomb()
* ...
– All the kernels of interest in this HPC Lab for CSE start with hpcxxxxx
• operators.cpp: the stencil operator for the finite difference discretization
– This file has a function/subroutine that defines the stencil operator
– The stencil operators differ depending on the position of the grid point, see Figures (2) and Algorithms 1 and 2:
for j = 2 : ydim − 1 do for i = 2 : xdim − 1 do
fi,j = [−(4 + α)si,j + si−1,j + si+1,j + si,j−1 + si,j+1 + βsi,j(1 − si,j)]k+1 + αski,j = 0
end end
Algorithm 1: Stencil: interior grid points
i = xdim
for j = 2 : ydim − 1 do
end
Algorithm 2: Stencil: boundary grid points
Compile and run the PDE mini-app on the ICS cluster
Log-in to the ICS cluster and afterwards load the gcc and python modules.
[user@login]$ module load gcc python
Go to the Project3 directory and use the makefile to compile the code
[user@login]$ make
Run the application on a compute node with selected parameters, e.g. domain size 128 × 128, 100 time steps and simulation time 0 − 0.005s, for now using a single OpenMP thread.
[user@login]$ salloc
[user@icsnodeXX]$ export OMP_NUM_THREADS=1
[user@icsnodeXX]$ ./main 128 100 0.005
After you implement the first part of the assignment, the output of the mini-application should look like this:
======================================================================== Welcome to mini-stencil! version :: C++ Serial mesh :: 128 * 128 dx = 0.00787402 time :: 100 time steps from 0 .. 0.005
iteration :: CG 200, Newton 50, tolerance 1e-06
======================================================================== -------------------------------------------------------------------------------simulation took 0.438155 seconds
1513 conjugate gradient iterations, at rate of 3453.12 iters/second
300 newton iterations
-------------------------------------------------------------------------------Goodbye!
1. Task: Implementing the linear algebra functions and the stenciloperators [35 Points]
The provided implementation is a serial version of the PDE mini-application with some code missing. Your first task is to implement the missing code to get a working PDE mini-application.
• Open the file linalg.cpp and implement the functions hpcxxxxx(). Follow the comments in the code as they are there to help you with the implementation. [18 Points]
• Open file operators.cpp and implement the missing stencil kernels. [15 Points]
After completion of the above steps, the mini-app should produce correct results. Compare the number of conjugate gradient iterations and the Newton iterations with the reference output above. If the numbers are about the same, you have probably implemented everything correctly.
• Plot the solution with the script plotting.py and include it in your report. It should look like in Figure 3.
[2 Points]
$ ./plotting.py
Figure 3: Output of the mini-app for a domain discretization into 128×128 grid points, 100 time steps with a simulation time from 0 − 0.005s
Note
The script plotting.py reads the results computed by ./main and creates an image output.png.
If X11 forwarding is enabled, it can also show the image in a window. To enable X11 forwarding, connect to ICS cluster with parameter -Y
$ ssh -Y icsmaster
Then allocate a compute node with X11 forwarding and execute the plotting script on the compute node
$ salloc --x11
Note that to use the X11 forwarding, you need X server installed on your computer. On most Linux distributions it is already installed. On MacOS you have to download and install XQuartz and on Windows you have to download and install Xming. After the installation, you might need to reboot your computer. If the plotting script prints an error
No module named ’matplotlib’
you probably forgot to load the python module. Please call
$ module load python
2. Task: Adding OpenMP to the nonlinear PDE mini-app [50 Points]
When the serial version of the mini-app is working we can add OpenMP directives. This allows you to use all cores on one compute node. In this project 3, you will measure and improve the scalability of your PDE simulation code. Scalability is the changing performance of a parallel program as it utilizes more computing resources. In this project we are interested in a performance analysis of both strong scalability and weak scalability. Strong scaling is identifying how a threaded PDE solver gets faster for a fixed PDE problem size. That is, we have the same discretization points per direction, but we run it with more and more threads and we hope/expect that it will compute its result faster. Weak scaling speaks to the latter point: how effectively can we increase the size of the problem we are dealing with? In a weak scaling study, each compute core has the same amount of work, which means that running on more threads increases the total size of the PDE simulation.
Replace welcome message in main.cpp [2 Points]
Replace the welcome message in main.cpp with a message that informs the user about the following points:
• That this is the OpenMP version.
• The number of threads it is using.
For example:
[user@login]$ salloc
[user@icsnodeXX]$ export OMP_NUM_THREADS=8
[user@icsnodeXX]$ ./main 128 100 0.01
======================================================================== Welcome to mini-stencil! version :: C++ OpenMP threads :: 8 ...
Linear algebra kernel [15 Points]
• Open linalg.cpp and add OpenMP directives to all functions hpcXXXX(), except for hpccg().
In order to ensure to maintain the correctness of your results you should recompile frequently and run with multiple threads to check that you are still getting the right answer. Once finished with this file, you can already have a look if your changes made a performance improvement, e.g. by comparing run times for the different grid sizes 128×128 and 256×256.
The diffusion stencil [10 Points]
• The final step is to parallelize the stencil operator in operators.cpp.
The nested for loop and the inner grid points might be obvious targets. What role do the boundary loops play?
2.1. Strong scaling [10+3 Points]
How does your solution scale for different resolutions? Plot the time to solution for the grid sizes below for 1-24 threads (depending on the Euler compute node). And consequently analyze and interpret your results.
• 64×64
• 128×128
• 256×256
• 512×512
• 1024×1024
Additionally, argue if a threaded OpenMP PDE solver can be implemented which produces bitwise-identical results without any parallel side effects or not?
2.2. Weak scaling [10 Points]
Produce a weak scaling plot by running the PDE code with different numbers of threads and a fixed constant grid size per thread. Analyze and interpret your results.
When performing this weak-scaling study, we are asking a complementary question to the one asked in a strongscaling study. Instead of keeping the problem size fixed, we increase the problem size relative to the number of cores. For example, we might initially run a PDE simulation on a grid of 64×64 using 4 cores. Later, we might want to compare 128×128 using 16 cores. Will the time remain constant, since the work per core has remained the same, or will the time increase due, for example, to increased communications overhead associated with the larger problem size, or number of cores, or a higher number of nonlinear iterations? A weak scaling study is one method of documenting this behavior for your application, allowing you to make a guess at the amount of computing time you need.
3. Task: Quality of the Report [15 Points]
Each project will have 100 points (out of which 15 points will be given to the general quality of the written report).
Bonus Question [5-10 Points]
• Can you make your implementation even faster by using SIMD instructions?
Additional notes and submission details
We only accept submissions using our Latex template and C/C++ code.
Submit the source code files (together with your used Makefile) in an archive file (tar, zip, etc.), and summarize your results and observations for all exercises by writing an extended Latex report. Use the Latex template provided on the webpage and upload the Latex summary as a PDF to iCorsi .
• Your submission should be a gzipped tar archive, formatted like project number lastname firstname.zip or project number lastname firstname.tgz. It should contain:
– all the source codes of your OpenMP solutions.
– your write-up with your name project number lastname firstname.pdf,
• Submit your .zip/.tgz through iCorsi .