Write a parallel program to search a given corpus and return the most relevant search results. You are given a corpus called Aristo Mini Corpus (https://www.kaggle.com/allenai/aristo-mini-corpus).
Aristo Mini Corpus:
The Aristo Mini corpus contains 1,197,377 science-relevant sentences drawn from public data. It provides simple science-relevant text that may be useful to help answer elementary science questions. You will work on 1500 sentence only divided across 50 File, each file is 30 lines.
Input: a given query in form of a sentence or a question.
Output: search results that contain all the words of the query.
Example:
Search query:
Capital of Egypt
If the corpus has the following sentences:
File1:
There is a capital for each country.
Capital of Egypt is Cairo.
File2:
The Capital of Egypt is Cairo.
You can visit the country you want.
Output should be:
Capital of Egypt is Cairo.
The Capital of Egypt is Cairo.
Pseudo code of search steps applied for each file:
For each Sentence in File:
Match = true;
For each word in the query:
IF word not in CurrentSentence:
MatchScore = false; IF MatchingScore is true:
Store Sentence;
ResultsFound += 1;
Parallel Scenario:
ü You will use Master Slave Paradigm.
ü Master will distribute the corpus files on slaves.
ü Slaves will search the given part of a corpus.
ü Each slave will return number of search results found and the corresponding relevant sentences. ü Master will collect the number of search results and write them to a file.
Expected input/output format:
Enter your query: sunlight energy nutrients
Output File:
Search Results Found = 2
Chlorophyll can make food the plant can use from carbon dioxide, water, nutrients, and energy from sunlight.
A process by which a plant produces its food using energy from sunlight, carbon dioxide from the air,and water and nutrients from the soil.
Requirements:
1- Study the MPI lab of the scatter and gather methods.
2- You have one week for questions about the assignment and the lab ( 22 Mar. to 28 Mar.).
3- Use all functions you learned so far in MPI library. (For Allreduce and Allgather it is not a must to use them).
4- You have to choose your functions carefully, which means if there is a value that should be sent to all slaves use MPI_Bcast, if there are values to be reduced using a specific operator use MPI_Reduce and so on.
5- Calculate the running time of the parallel program.
6- Run your code on the attached test cases, to ensure your result is right.
Grading Criteria:
Master workload distribution across slaves: Using suitable MPI functions 50 Slave work:
• Reading files and tokenizing queries.
• Perform search and send back to master. 60 Master collection of results:
writing them to a file (# of Search Results, and the results itself) 50 Handling remaining workload 30 Running and valid output 30 Calculate the parallel running time 10 Total 230