$25
For both programming questions you need to write multithreaded code, and you may use pthreads, or C++ threads.
Q1. Programming question – calculating 𝛑
Improve the performance of an existing single-threaded calcpi program by converting it to a multi-threaded implementation. Download the starter code, compile it, and run it:
$ git clone https://gitlab.com/cpsc457f21/pi-calc.git
$ cd pi-calc
$ make
$ ./calcpi
Usage: ./calcpi radius n_threads where 0 <= radius <= 100000 and 1 <= n_threads <= 256
The calcpi program estimates the value of π using an algorithm described in https://en.wikipedia.org/wiki/Approximations_of_%CF%80#Summing_a_circle's_area, and it is implemented inside the function count_pi() in file calcpi.cpp.
The included driver (main.cpp) parses the command line arguments, calls count_pi() and prints the results. The driver takes 2 command line arguments: an integer radius and number of threads. For example, to estimate the value of π using radius of 10 and 2 threads, you would run it like this:
$ ./calcpi 10 2
Calculating PI with r=10 and n_threads=2
count: 317 PI: 3.17
The function uint64_t count_pi(int r, int N) takes two parameters – the radius and number of threads, and returns the number of pixels inside the circle or radius 𝑟 centered at (0,0) for every pixel (𝑥, 𝑦) in squre −𝑟 ≤ 𝑥, 𝑦 ≤ 𝑟. The current implementation is single threaded, so it ignores the N argument. Your job is to re-implement the function so that it uses N threads to speed up its execution, such that it runs N times faster with N threads on hardware where N threads can run concurrently. Please note that your assignment will be marked both for correctness and the speedup it achieves.
You need to find a way to parallelize the algorithm without using any synchronization mechanisms, such as mutexes, semaphores, atomic types, etc. You are only allowed to create and join threads.
Write all code into calcpi.cpp and submit this file for grading. Make sure your calcpi.cpp works with the included driver program. Your TAs may use a different driver program during marking, so it is important that your code follows the correct API. Make sure your program runs on linuxlab.cpsc.ucalgary.ca.
Assume 0 ≤ 𝑟 ≤ 100,000 and 1 ≤ 𝑛𝑡ℎ𝑟𝑒𝑎𝑑𝑠 ≤ 256.
Timing on linuxlab
Please note that not all linuxlab machines are the same. Some have 4 core CPUs, some have 6 cores CPUs, some have 3.6GHz CPUs, others have 3.2GHz. You can check which CPU you have by running lscpu command. When you are running your timings, please make sure you do all of them on the same machine. Otherwise, you will get very inconsistent results.
My basic multi-threaded implementation achieves the following timings using r=100000. I expect your solutions to achieve similar results.
CPU
1 thread
2 threads
4 threads
8 threads
16 threads
i7-4770
12.900
6.470
3.410
2.733
2.740
i7-4790
12.571
6.306
3.311
2.653
2.688
i7-7700
11.473
5.857
3.006
1.765
1.801
i7-8700
10.476
5.374
2.793
1.610
1.161
Q2 – Written answer
Time your multi-threaded solution from Q1 with 𝑟 = 50000 using the time command on linuxlab.cpsc.ucalgary.ca. Record the real-time for 1, 2, 3, 4, 6, 8, 12, 16, 24 and 32 threads. Also record the timings of the original single-threaded program.
Make a table with these timings, and a bar graph, both formatted similar to these:
The numbers in the above table and graph are random. Your timings should look different.
When you run your implementation with N threads, you should see N-times speed up compared to the original single threaded program. Do you observe this in your timings for all values of N?
Why do you stop seeing the speed up after some value of N?
Q3. Programming question – detecting primes
Convert a single-threaded program detectPrimes to a multi-threaded implementation. Start by downloading the single-threaded code, then compile it and run it:
$ git clone https://gitlab.com/cpsc457f21/detect-primes.git
$ cd detect-primes
$ make
$ cat example.txt
0 3 19 25
4012009 165 1033
$ ./detectPrimes 5 < example.txt
Using 5 threads.
Identified 3 primes:
3 19 1033
Finished in 0.0000s
$ seq 100000000000000000 100000000000000300 | ./detectPrimes 2
Using 2 threads.
Identified 9 primes:
100000000000000003 100000000000000013 100000000000000019 100000000000000021
100000000000000049 100000000000000081 100000000000000099 100000000000000141
100000000000000181 Finished in 5.6863s
The detectPrimes program reads integers in range [2, 263 − 2] from standard input, and then prints out the ones that are prime numbers. The first invocation example above detects prime numbers 3, 19 and 1033 in a file example.txt. The second invocation uses the program to find all primes in the range [1017, 1017 + 300]. If duplicate primes appear in the input, they will be duplicated in the output.
detectPrimes accepts a single command line argument – a number of threads. This parameter is ignored in the current implementation because it is single threaded. Your job is to improve the execution time of detectPrimes by making it multi-threaded, and your implementation should use the number of threads given on the command line. To do this, you will need to re-implement the function:
std::vector<int64_t>
detect_primes(const std::vector<int64_t> & nums, int n_threads);
which is defined in detectPrimes.cpp. The function takes two parameters: the list of numbers to test, and the number of threads to use. The function is called by the driver (main.cpp) after parsing the standard input and command line. Your implementation should use n_threads number of threads. Ideally, if the original single-threaded program takes time 𝑇 to complete a test, then your multi-threaded implementation should finish that same test in 𝑇/𝑁 time when using 𝑁 threads. For example, if it takes 10 seconds to complete a test for the original singlethreaded program, then it should take your multi-threaded program only 2.5 seconds to complete that same test with 4 threads. To achieve this goal, you will need to design your program so that:
You give each thread the same amount of work;
your multi-threaded implementation does the same amount of work as the single-threaded version; and
the synchronization mechanisms you utilize are efficient.
Your TAs will mark your assignment by running the code against multiple different inputs and using different numbers of threads. To get full marks for this assignment, your program needs to output correct results but also achieve near optimal speedup for the given number of threads and available cores. If your code does not achieve optimal speedup on all inputs, you will lose some marks for those tests.
You may assume that there will be no more than 100,000 numbers in the input, and that all numbers will be in the range [2, 263 − 2]. Some inputs will include many numbers, some inputs will include just few numbers, some numbers will be large, some small, some will be prime numbers, others will be large composite numbers, etc… For some numbers it will take long time to compute their smallest factor, for others it will take very little time. You need to take all these possibilities into consideration. Design your own test inputs and test your code thoroughly.
Write all code into detectPrimes.cpp and submit it for grading. Make sure your detectPrimes.cpp works with the included main.cpp driver. We may use a different driver during marking, so it is important that your code follows the correct API. Make sure your program runs on linuxlab.cpsc.ucalgary.ca.
You may use any of the synchronization mechanisms we covered in lectures, such as semaphores, mutexes, condition variables, spinlocks, atomic variables, and barriers. Don’t forget to make sure your code compiles and runs on linuxlab.cpsc.ucalgary.ca.
Please note that the purpose of this question is NOT to find a more efficient factorization algorithm. You must implement the exact same factorization algorithm as given in the skeleton code, except you need to make it multi-threaded.
Q4 – Written question
Time the original single-threaded detectPrimes.cpp as well as your multi-threaded version on three files: medium.txt, hard.txt and hard2.txt. For each of these files, you will run your solution 6 times, using 1, 2, 3, 4, 8 and 16 threads. You will record your results in 3 tables, one for each file, formatted like this:
medium.txt
# threads
Observed timing
Observed speedup compared to original
Expected speedup
original program
1.0
1.0
1
1.0
2
2.0
3
3.0
4
4.0
8
8.0
16
16.0
The ‘Observed timing’ column will contain the raw timing results of your runs. The ‘Observed speedup’ column will be calculated as a ratio of your raw timing with respect to the timing of the original single-threaded program. Once you have created the tables, explain the results you obtained. Are the timings what you expected them to be? If not, explain why they differ.