Starting from:

$30

EECS485-Project 5 Wikipedia Search Engine Solved

Build a scalable search engine that is similar to a commercial search engine. The search engine in this assignment has several features:

Indexing implemented with MapReduce so it can scale to very large corpus sizes

Information retrieval based on both tf-idf and PageRank scores

A new search engine interface with two special features: user-driven scoring and summarization.

The learning goals of this project are information retrieval concepts like PageRank and tf-idf, parallel data processing with MapReduce, and writing an end-to-end search engine.

Setup


Group registration

Register your group on the Autograder.

AWS account and instance
You will use Amazon Web Services (AWS) to deploy your project. AWS account setup may take up to 24 hours, so get started now. Create an account, launch and configure the instance. Don’t deploy yet. AWS Tutorial.

Project folder
Create a folder for this project (instructions). Your folder location might be different.

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine

Version control
Set up version control using the Version control tutorial.

Be sure to check out the Version control for a team tutorial.

After you’re done, you should have a local repository with a “clean” status and your local repository should be connected to a remote GitLab repository.

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine

$ git status

On branch master

Your branch is up-to-date with 'origin/master'.

 nothing to commit, working tree clean

$ git remote -v

origin https://gitlab.eecs.umich.edu/awdeorio/p5-search-engine.git (fetch) origin https://gitlab.eecs.umich.edu/awdeorio/p5-search-engine.git (push) You should have a .gitignore file (instructions).

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine

$ head .gitignore

This is a sample .gitignore file that's useful for EECS 485 projects.

...

Python virtual environment
Create a Python virtual environment using the Python Virtual Environment Tutorial.

Check that you have a Python virtual environment, and that it’s activated (remember source env/bin/activate ).

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine

$ ls -d env env

$ echo $VIRTUAL_ENV

/Users/awdeorio/src/eecs485/p5-search-engine/env

Starter files
Download and unpack the starter files.

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine

$ wget https://eecs485staff.github.io/p5-search-engine/starter_files.tar.gz $ tar -xvzf starter_files.tar.gz

Move the starter files to your project directory and remove the original starter_files/ directory and tarball.

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine

$ mv starter_files/* .

$ rm -rf starter_files starter_files.tar.gz

This project involves very few starter files. You have learned everything you need to build this project from (almost) scratch. You are responsible for converting the starter files structure into the final structure. Each part of the spec will walk you through what we expect in terms of structure.

 bin/ : As in previous projects, this is where convenience shell scripts will be located. You will

write search , index , and indexdb  hadoop/ : MapReduce programs live here. See Inverted Index with Mapreduce and the

following Hadoop setup and example instructions for more details.  hadoop/word_count/ : Sample MapReduce program, with input.  hadoop/inverted_index/ : Build a pipeline of MapReduce programs in this directory. You will

write several map*.py and reduce*.py files, as well as a pipeline.sh script. index/index/ : Python code for the index server. See Index server for more details. search/search/ : Search interface app. See Search interface for more details.

Both the index server and the search server are separate Flask apps. The search server will serve a bundle.js of your react source code. The index server will be styled after the project 3 REST API. As in previous projects, both will be python packages that should have a setup.py .

At the end of this project your directory structure should look something like this:

$ tree --matchdirs -I 'env|__pycache__'

.

├── bin

│   ├── index

│   ├── indexdb

│   └── search

├── hadoop

│   ├── hadoop-streaming-2.7.2.jar

│   ├── inverted_index

│   │   ├── input.txt

│   │   ├── input_split.py

│   │   ├── map0.py

│   │   ├── map1.py

│   │   ├── ...

│   │   ├── output_sample.txt

│   │   ├── pipeline.sh

│   │   ├── reduce0.py

│   │   ├── reduce1.py

│   │   ├── ...

│   │   └── stopwords.txt

│   └── word_count

│       ├── input

│       │   ├── file01

│       │   └── file02

│       ├── map.py

│       └── reduce.py

├── index

│   ├── index

│   │   ├── __init__.py

│   │   ├── api

│   │   │   └── *.py

│   │   ├── inverted_index.txt

│   │   ├── pagerank.out

│   │   └── stopwords.txt

│   └── setup.py

├── search

│   ├── node_modules

│   ├── package-lock.json

│   ├── package.json

│   ├── search

│   │   ├── __init__.py

│   │   ├── api

│   │   │   └── *.py

│   │   ├── config.py │   │   ├── js

│   │   │   └── *.jsx

│   │   ├── sql

│   │   │   └── wikipedia.sql

│   │   ├── static

│   │   │   └── js

│   │   │       └── bundle.js

│   │   ├── templates

│   │   │   └── *.html

│   │   ├── var

│   │   │   └── wikipedia.sqlite3

│   │   └── views

│   │       └── *.py

│   ├── setup.py

│   └── webpack.config.js

└── tests

Install Hadoop
Hadoop does not work on all machines. For this project, we will use a lightweight Python implementation of Hadoop. The source code is in tests/utils/hadoop.py .

$ pwd # Make sure you are in the root directory

/Users/awdeorio/src/eecs485/p5-search-engine

$ echo $VIRTUAL_ENV # Check that you have a virtual environment

/Users/awdeorio/src/eecs485/p5-search-engine/env

$ pip install sh # Install a required third party sh module Collecting sh

...

$ pushd $VIRTUAL_ENV/bin

$ ln -sf ../../tests/utils/hadoop.py hadoop # Link to provided hadoop implementation

$ popd

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine

$ which hadoop

/Users/awdeorio/src/eecs485/p5-search-engine/env/bin/hadoop 

$ hadoop -h # hadoop command usage usage: hadoop [-h] -D PROPERTIES -input INPUT -output OUTPUT -mapper MAPPER

              -reducer REDUCER 

Lightweight Hadoop work-alike.

 optional arguments:

  -h, --help        show this help message and exit  required arguments:

  -D PROPERTIES

  -input INPUT

  -output OUTPUT

  -mapper MAPPER

  -reducer REDUCER

If you want to play around with the real Hadoop, Appendix: B has the installation steps listed. However, for this project, we will be using our Python implemented version of Hadoop. You do not need to use the real Hadoop for this project.

End-to-End Testing

See the End-to-End Testing Tutorial in Project 3 to install Google Chrome and ChromeDriver.

Install script
Installing all of the tool chain requires a lot of steps! Follow the Project 3 Tutorial and write a bash script bin/install to install your app. However, we need to make some small changes from your previous install script.

This project has a different folder structure, so the server and front-end installation instruction needs to change. You will also have two backend servers running for this project, so you will need to install two different python packages.

 Install back end

pip install -e index # index server pip install -e search # search server

 Install front end

pushd search npm install . popd

If running the install script on CAEN, pip install might throw PermissionDenied errors when updating packages. To allow installation and updates of packages using pip add the following lines before the first pip install command.

 Tell pip to write to a different tmp directory.

mkdir -p tmp export TMPDIR=tmp

You will also need to install the python implementation of hadoop. Add this to the end of your install script.

 Install the hadoop implementation

pushd $VIRTUAL_ENV/bin

ln -sf ../../tests/utils/hadoop.py hadoop popd

“Hello World” with Hadoop
Next we will run a sample mapreduce word count program with Hadoop. This example will run on both real Hadoop and the provided lightweight Python implementation.

The example MapReduce program lives in hadoop/word_count/ . Change directory.

$ cd hadoop/word_count/

$ ls

input  map.py  reduce.py

The inputs live in hadoop/word_count/input/ .

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine/hadoop/word_count $ tree

.

├── input

│   ├── file01

│   └── file02

├── map.py

└── reduce.py $ cat input/file* hadoop map reduce file map  map streaming file reduce  map reduce is cool hadoop file system google file system

Run the Hadoop mapreduce job.

 jar index/hadoop/hadoop-streaming-2.7.2.jar is required by the real Hadoop. The

provided implementation ignores this argument.

-D mapreduce.job.maps=N number of mappers

-D mapreduce.job.reduces=N number of reducers

-input DIRECTORY input directory

-output DIRECTORY output directory

-mapper FILE mapper executable

-reducer FILE reducer executable

$ hadoop \   jar ../hadoop-streaming-2.7.2.jar \

  -D mapreduce.job.maps=2 \

  -D mapreduce.job.reduces=2 \

  -input input \

  -output output \

  -mapper ./map.py \

  -reducer ./reduce.py

Starting map stage

+ ./map.py < output/hadooptmp/mapper-input/part-00000 output/hadooptmp/mapper-output/par + ./map.py < output/hadooptmp/mapper-input/part-00001 output/hadooptmp/mapper-output/par Starting group stage

+ cat output/hadooptmp/mapper-output/* | sort output/hadooptmp/grouper-output/sorted.out Starting reduce stage

+ ./reduce.py < output/hadooptmp/grouper-output/part-00000 output/hadooptmp/reducer-outp + ./reduce.py < output/hadooptmp/grouper-output/part-00001 output/hadooptmp/reducer-outp Output directory: output



You will see an output directory created. This is where the output of the MR job lives. You are interested in all part-XXXXX files.

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine/hadoop/word_count

$ ls output/ hadooptmp  part-00000  part-00001

$ cat output/part-* cool 1 google 1 is 1 reduce 3 system 2 file 4

hadoop 2

map 4

streaming 1

Inverted Index with MapReduce


You will be building and using a MapReduce based indexer in order to process the large dataset for this project. You will be using Hadoop’s command line streaming interface that lets you write your indexer in the language of your choice instead of using Java (which is what Hadoop is written in). However, you are limited to using Python 3 so that the course staff can provide better support to students. In addition, all of your mappers must output both a key and a value (not just a key). Moreover, keys and values BOTH must be non-empty. If you do not need a key or a value, common practice is to emit a “1”

There is one key difference between the MapReduce discussed in class and the Hadoop streaming interface implementation: In the Java Interface (and in lecture) one instance of the reduce function was called for each intermediate key. In the streaming interface, one instance of the reduce function may receive multiple keys. However, each reduce function will receive all the values for any key it receives as input.

You will not actually run your program on hundreds of nodes; it is possible to run a MapReduce program on just one machine: your local one. However, a good MapReduce program that can run on a single node will run fine on clusters of any size. In principle, we could take your MapReduce program and use it to build an index on 100B web pages.

For this project you will create an inverted index for the documents in hadoop/inverted_index/input.txt through a series of MapReduce jobs. This inverted index will

contain the idf, term frequency, and document normalization factor as specified in the information retrieval lecture slides. The format of your inverted index must follow the format, with each data element separated by a space:

<term <idf <doc_id_x <occurrences in doc_id_x <doc_id_x normalization factor before sq



A sample of the correctly formatted output for the inverted index can be found in hadoop/inverted_index/output_sample.txt . You can also use this sample output to check the

accuracy of your inverted index.

For your reference, here are the definitions of term frequency and idf:



And here is the definition for normalization factor for a document:



Splitter script
Each document has three properties: doc_id , doc_title , and doc_body . Your mapper code will be receiving input via standard in and line-by-line. As a result, the input is newline separated and each document is represented by 3 lines: 1st is the doc_id , 2nd is the doc_title , and the 3rd is the doc_body . This will allow you to read the input in your mapper function easily.

Input for Hadoop programs is an input directory, not file. As you will notice, our input is in one large file called hadoop/inverted_index/input.txt but your code runs with multiple mappers, and each mapper gets one input file. Consequently, you must write a custom Python script to break the large input.txt input file with over 3,000 documents (3 lines each) into smaller files and place these files into the hadoop/inverted_index/input directory. You can name your smaller input files however you like, as long as they are under the input/ directory. Do not split the input file into one file per document; this will make your MapReduce pipeline run very slowly! We do not require a specific number of smaller input files, but ~30 is fine.

Example

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine/hadoop/inverted_index

$ ./input_split.py 

Output directory: input/ $ ls -d input.txt input input/  input.txt

At this point, you should be able to pass test_pipeline03_split_file in test_pipeline_public.py .

$ pytest -v ./tests/test_pipeline_public.py::TestPipelinePublic::test_pipeline03_split_fil



Hadoop pipeline script


This script will execute several MR jobs. The first one will be responsible for counting the number of documents in the input data. Name the mapper and reducer for this map0.py and reduce0.py, respectively. The next N jobs will be a “pipeline”, meaning that you will chain multiple jobs together such that the input of a job is the output of the previous job.

We have provided hadoop/inverted_index/pipeline.sh , which has helpful comments for getting started.

Job 0
The first MapReduce job that you will create counts the total number of documents. This should be run with as many mapper workers as input files and only ONE reducer worker. The mapper and reducer executables should be named map0.py and reduce0.py , respectively. The reducer should save the total number of documents in a file called total_document_count.txt . The only data in this file will be an integer representing the total document count. The total_document_count.txt file should be created in whichever directory the pipeline is executed

from.

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine/hadoop/inverted_index

$ ./pipeline.sh

...

Output directory: output

$ cat total_document_count.txt

3268

MapReduce pipeline
You will be going from large datasets to an inverted index, which involves calculating quite a few values for each word. As a result, you will need to write several MapReduce jobs and run them in a pipeline to be able to generate the inverted index. The first MapReduce job in this pipeline will get input from input/ and write its output to output/ . The starter file pipeline.sh shows how to pipe the output from one MapReduce job into the next one.

To test your MapReduce program, we recommend that you make a new test file, with only 10-20 of the documents from the original large file. Once you know that your program works with a single input file, try breaking the input into more files, and using more mappers.

Each of your MapReduce jobs in this pipeline should have mappers and reducers equal to the number of your input files. However, your code should still work when the number of input files and correspondingly the number of mappers is changed. You may only use maximum of 9 MapReduce jobs in this pipeline (but the inverted index can be produced in fewer). The first job in the pipeline (the document counter) must have mapper and reducer executables named

map0.py , reduce0.py respectively, and the second job should be map1.py , reduce1.py , etc.

The format of your inverted index must follow the format, with each data element separated by a space: word - idf - doc_id - number of occurrences in doc_id - doc_id's normalization factor (before sqrt)

Sample of formatted output can be found in hadoop/inverted_index/output_sample.txt . The order of words in the inverted index does not matter. If a word appears in more than one doc it does not matter which doc_id comes first in the inverted index. Note that the log for idf is computed with base 10 and that some of your decimals may be slightly off due to rounding errors. In general, you should be within 5% of these sample output decimals.

When building the inverted index file, you should use a list of predefined stopwords to remove words that are so common that they do not add anything to search quality (“and”, “the”, etc). We have given you a list of stopwords to use in hadoop/inverted_index/stopwords.txt .

When creating your index you should treat capital and lowercase letters as the same (case insensitive). You should also only include alphanumeric words in your index and ignore any other symbols. If a word contains a symbol, simply remove it from the string. Do this with the following code snippet:

import re

re.sub(r'[^a-zA-Z0-9]+', '', word)

You should remove non-alphanumeric characters before you remove stopwords. Your inverted index should include both doc_title and doc_body for each document.

To construct the inverted index file used by the index server, concatenate all the files in the output directory from the final MapReduce job, and put them into a new file. Make sure to copy any updated inverted index file to your index Flask app ( index/index ). This can either be done at the end of the pipeline.sh or run manually after the ouput files are created.

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine/hadoop/inverted_index cat output/* inverted_index.txt

MapReduce specifications
There are a few things you must ensure to get your mappers and reducers playing nicely with both hadoop and our autograder. Here is the list:

Emit both keys and values from all mappers.

Do not emit empty keys or values.

Split key value pairs using a tab. It should be as such “key\tvalue”

Please have a newline at the end of each of your mapper and reducer outputs at every stage

 Your map-reduce output must emit more than a single line.

Sample input.

1

The Document: A

This document is about Mike Bostock. He made d3.js and he's really cool 2

The Document: B

Another flaw in the human character is that everybody wants to build and nobody wants to d

3

Document C:

Originality is the fine art of remembering what you hear but forgetting where you heard it



Sample output.

character 0.47712125471966244 2 1 1.593512841936855 maintenance 0.47712125471966244 2 1 1.593512841936855 mike 0.47712125471966244 1 1 1.138223458526325 kurt 0.47712125471966244 2 1 1.593512841936855 peter 0.47712125471966244 3 1 2.048802225347385 flaw 0.47712125471966244 2 1 1.593512841936855 heard 0.47712125471966244 3 1 2.048802225347385 cool 0.47712125471966244 1 1 1.138223458526325 remembering 0.47712125471966244 3 1 2.048802225347385 laurence 0.47712125471966244 3 1 2.048802225347385 d3js 0.47712125471966244 1 1 1.138223458526325 made 0.47712125471966244 1 1 1.138223458526325 build 0.47712125471966244 2 1 1.593512841936855 document 0.0 2 1 1.593512841936855 3 1 2.048802225347385 1 2 1.138223458526325 originality 0.47712125471966244 3 1 2.048802225347385 bostock 0.47712125471966244 1 1 1.138223458526325 forgetting 0.47712125471966244 3 1 2.048802225347385 hear 0.47712125471966244 3 1 2.048802225347385 art 0.47712125471966244 3 1 2.048802225347385 human 0.47712125471966244 2 1 1.593512841936855 fine 0.47712125471966244 3 1 2.048802225347385 vonnegut 0.47712125471966244 2 1 1.593512841936855

Common problems
Your map and reduce programs should use relative paths when opening an input file. For example, to access stopwords.txt :

with open("stopwords.txt", "r") as stopwords:     for line in stopwords:

        # Do something with line

Testing
Once you have implemented your MapReduce pipeline to create the inverted index, you should be able to pass all tests in test_doc_counts_public.py and test_pipeline_public.py .

$ pytest -v tests/test_doc_counts_public.py tests/test_pipeline_public.py

Index server


The index server is a separate service from the search interface that handles search queries and returns a list of relevant results. Your index server is a RESTful API that returns search results in JSON format that the search server processes to display results to the user.

Directory structure
In this project since you have two servers, the index server and the search server, you will need separate directories for each of your server applications. Your index server code is going to be inside the directory index/index and your setup.py is going to be inside the directory index . This is to allow you to use python packages with two different servers in the same project root directory. Note that the setup.py is always in a directory above the directory in which your application’s code is. Below is an example of what your final index directory structure should look like.

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine

$ tree index index/

├── index

│   ├── __init__.py

│   ├── api

│   │   ├── __init__.py

│   │   └── *.py

│   ├── inverted_index.txt

│   ├── pagerank.out

│   └── stopwords.txt

Your index and index/api directories need to be python packages.

PageRank integration
In lecture, you learned about PageRank, an algorithm used to determine the relative importance of a website based on the sites that link to that site, and the links to other sites that appear on that website. Sites that are more important will have a higher PageRank score, so they should be returned closer to the top of the search results.

In this project, you are given a set of pages and their PageRank scores in pagerank.out , so you do not need to implement PageRank. However, it is still important to understand how the PageRank algorithm works!

Your search engine will rank documents based on both the query-dependent tf-idf score, as well as the query-independent PageRank score. The formula for the score of a query q on a single document d should be:



where w is a decimal between 0 and 1, inclusive. The value w will be a URL parameter. The w parameter represents how much weight we want to give the document’s PageRank versus its cosine similarity with the query. This is a completely different variable from the “wik” given by the formula in Inverted Index with MapReduce. The final score contains two parts, one from pagerank and the other from a tf-idf based cosine similarity score. The PageRank(d) is the pagerank score of document d, and the tfIdf(q, d) is the cosine similarity between the query and the document tf-idf weight vectors. Each weight in a tf-idf weight vector is calculated via the formula in Inverted Index with MapReduce for “wik”. Treat query q as a simple AND, non-phrase query(that is, assume the words in a query do not have to appear consecutively and insequence).

Integrating PageRank scores will require creating a second index, which maps each doc_id to its corresponding precomputed PageRank score, which is given to you in pagerank.out . You can do this where you read the inverted index. This index should be accessed at query time by your index server.

If you still have some confusion with these calculations, please refer to Appendix A

Returning hits
When the Index server is run, it will load the inverted index file, pagerank file, and stopwords file into memory and wait for queries. It should only load the inverted index and pagerank into memory once, when the index server is first started.

 Note: The t2 micro instance used to create the AWS instance has a small memory capacity, so please make sure the data structure used to store the inverted index is memory effiicent where possible. You can check how much memory your server is using by starting up the index server and checking the memory footprint of it using top or other MacOS/Windows memory analysis tools.

Every time you send an appropriate request to the index server, it will process the user’s query and use the inverted index loaded into memory to return a list of all of the “hit” doc_ids . A hit will be a document that is similar to the user’s query. When finding similar documents, only include documents that have every word from the query in the document. The index server should not load the inverted index or pagerank into memory every time it receives a query!

Index Server Endpoint
Route: http://{host}:{port}?w=<w&q=<query

Your index server should only have one endpoint, which receives the pagerank weight and query as URL parameters w and q, and returns the search results, including the relevance score for each result (calculated according to the previous section). For example, the endpoint ?

w=0.3&q=michigan%20wolverine would correspond to a pagerank weight of 0.3, and a query “michigan wolverine”.

Your index server should return a JSON object in the following format. A full walkthrough of the example is shown in Appendix A.

{

  "hits": [

    {

      "docid": 868657,

      "score": 0.071872572435374

    }

  ]

}

The documents in the hits array must be sorted in order of relevance score, with the most relevant documents at the beginning, and the least relevant documents at the end. If multiple documents have the same relevance score, sort them in order of doc_id (with the smallest

doc_id first).

When you get a user’s query make sure you remove all stopwords. Since we removed them from our inverted index, we need to remove them from the query. Also clean the user’s query of any nonalphanumeric characters as you did with the inverted index.

Running the index server
Install you index server app by running the command below in your project root directory:

$ pip install -e index

To run the index server, you will use environment variables to put the development server in debug mode, specify the name of your app’s python module and locate an additional config file. (These commands assume you are using the bash shell.)

$ export FLASK_DEBUG=True

$ export FLASK_APP=index

$ export INDEX_SETTINGS=config.py

$ flask run --host 0.0.0.0 --port 8001

*  Serving Flask app "index"

*  Forcing debug mode on

*  Running on http://127.0.0.1:8001/ (Press CTRL+C to quit)

Common problems
Make sure that your index server reads stopwords.txt , pagerank.out from the index/index/ directory.

Pro-tip: use an absolute path and Python’s __file__ variable. The following code will result in stopwords_filename = "index/index/stopwords.txt" .

"""index/index/api/views.py"""

index_package_dir = os.path.dirname(os.path.dirname(__file__)) stopwords_filename = os.path.join(index_package_dir, "stopwords.txt")

Testing
Once you have implemented your Index Server, you should be able to pass all tests in test_index_server_public.py .

$ pytest -v tests/test_index_server_public.py

Search interface


The third component of the project is a React driven interface for the search engine. The search interface app will provide a GUI for the user to enter a query and specify a Pagerank weight, and will then send a request to your index server. When it receives the search results from the index server, it will display them on the webpage.



Directory structure
As mentioned earlier, in this project since you have two separate servers, the index server and the search server, you will need a separate directory for you search server. Your search interface server code is going to be inside the directory search/search and your setup.py is going to be inside the directory search . Your search server is supposed to be a python package. Below is an example of what your final search directory structure should look like.

$ tree search

├── search

│   ├── package.json

│   ├── package-lock.json

│   ├── node_modules

│   │   └── */

│   ├── search

│   │   ├── api

│   │   │   └── *.py

│   │   ├── js

│   │   │   └── *.jsx

│   │   ├── config.py

│   │   ├── sql

│   │   │   └── wikipedia.sql

│   │   ├── static

│   │   │   └── js

│   │   │       └── bundle.js

│   │   ├── templates

│   │   │   └── *.html

│   │   ├── var

│   │   │   └── wikipedia.sqlite3

│   │   └── views

│   │       └── *.py

│   ├── setup.py

│   └── webpack.config.js

Your search , views , and api directories need to be Python packages.

Database
You will need to create a new database for project 5, with a table called Documents as follows:

docid : INT and PRIMARY KEY title : VARCHAR with max length 100 categories : VARCHAR with max length 5000 image : VARCHAR with max length 200 summary : VARCHAR with max length 5000

The SQL to create this table and load the necessary data into it is provided in search/search/sql/wikipedia.sql inside the starter files.

GUI
Make a simple search interface for your database of Wikipedia articles which allows users to input a query and a w value, and view a ranked list of relevant docs. This is the main route ( http://{host}:{port}/ ).



Users can enter their query into a text input box ( <input type="text" ), and specify the w value using a slider. The query is executed when a user presses a ( <input type="submit" ). You can assume anyone can use your search engine; you do not need to worry about access rights or sessions like past projects. Your engine should receive a simple AND, non-phrase query (that is, assume the words in a query do not have to appear consecutively and in-sequence), and return the ranked list of docs. Your search results will be titles of Wikipedia documents, displayed exactly as they appear in the database. Search results will show at most 10 documents. Any other content should appear below.



This page must include a slider ( <input type="range" ) that will set the w GET query parameter in the URL, and will be a decimal value between 0-1, inclusive. You should set the range slider’s step value to be 0.01.

When the user clicks the submit button of the search GUI, you should fetch the appropriate titles from the index server. For each title returned, display the title in a p tag with a class=”doc_title” ( <p className="doc_title" ) as well as a button to display the summary for that title with a value=”Show Summary”( <input type="submit" value="Show Summary" )

If there are no search results, no action is required

Upon clicking the ‘show summary’ button for a title, the list of results sould be replaced with a summary of the document. You may simply place the summary data within a <p tag with a

class="doc\_summary" ( <p className="doc_summary" ). In addition to having this summary, this

page will also have a “Similar Documents” section, which will show at most 10 documents using w=0.15, and the current document’s title as the search query. For each related document title, render a p tag with class=”doc_title” ( <p className="doc_title" ).



Running the search server
Install you search server app by running the command below in your project root directory:

$ pip install -e search

To run the search server, you will use environment variables to put the development server in debug mode, specify the name of your app’s python module and locate an additional config file. (These commands assume you are using the bash shell.)

$ export FLASK_DEBUG=True $ export FLASK_APP=search

$ export SEARCH_SETTINGS=config.py

$ flask run --host 0.0.0.0 --port 8000

*  Serving Flask app "search"

*  Forcing debug mode on

*  Running on http://127.0.0.1:8000/ (Press CTRL+C to quit)

In order, to be able to search on the search server, you must have your index server up and running, since it is the response of the index server that the search server uses to display the search results.

Shell scripts to launch both servers
You will be responsible for writing shell scripts to launch both the index server and the search server. Hint: these will be somewhat similar to last project’s bin/mapreduce script. The name of these scripts must be bin/index and bin/search .

Example: start search.

$ ./bin/search start starting search server ...

+ export FLASK_APP=search

+ export SEARCH_SETTINGS=config.py

+ flask run --host 0.0.0.0 --port 8000 & /dev/null &

Example: stop search.

$ ./bin/search stop stopping search server ...

+ pkill -f 'flask run --host 0.0.0.0 --port 8000' Example: restart search. Your PIDs may be different.

$ ./bin/search restart stopping search server ...

+ pkill -f 'flask run --host 0.0.0.0 --port 8000' starting search server ...

+ export FLASK_APP=search

+ export SEARCH_SETTINGS=config.py

+ flask run --host 0.0.0.0 --port 8000 & /dev/null &

Example: start search when something is already running on port 8000

$ ./bin/search start

Error: a process is already using port 8000 Example: start index.

$ ./bin/index start starting index server ...

+ export FLASK_APP=index

+ flask run --host 0.0.0.0 --port 8001 & /dev/null & Example: stop index.

$ ./bin/index stop stopping index server ...

+ pkill -f 'flask run --host 0.0.0.0 --port 8001' Example: restart index.

$ ./bin/index restart stopping index server ...

+ pkill -f 'flask run --host 0.0.0.0 --port 8001' starting index server ...

+ export FLASK_APP=index

+ flask run --host 0.0.0.0 --port 8001 & /dev/null &

Example: start index when something is already running on port 8001

$ ./bin/index start

Error: a process is already using port 8001

flask run --host 0.0.0.0 --port 8000 & /dev/null & will start up your flask server in the

background. The & /dev/null will prevent your Flask app from outputting text to the terminal.

Index database management script
You will also write a database script for the index server.

Example create.

$ ./bin/indexdb create

+ mkdir -p search/search/var/

+ sqlite3 search/search/var/wikipedia.sqlite3 < search/search/sql/wikipedia.sql Example create when database already exists.

$ ./bin/indexdb create

Error: database already exists Example destroy.

$ ./bin/indexdb destroy

+ rm -f search/search/var/wikipedia.sqlite3 Example reset.

$ ./bin/indexdb reset

+ rm -f search/search/var/wikipedia.sqlite3

+ mkdir -p search/search/var/

+ sqlite3 search/search/var/wikipedia.sqlite3 < search/search/sql/wikipedia.sql

Deploy to AWS
You should have already created an AWS account and instance (instructions). Resume the AWS Tutorial - Deploy a web app.

Once you have installed your front-end, go to your instance. In the “Description”, click on the name of your security group.



In this case, the security group is “launch-wizard 2”.

On this page, click on “Actions”, then “Edit inbound rules”.




Then, add a “Custom TCP Protocol” for port 8001.



Remember that you need to run two servers, the index server and search server.

$ gunicorn -b localhost:8000 -D search:app

$ gunicorn -b 0.0.0.0:8001 -D index:app

After you have deployed your site, download the search page along with a log. Also download the index server’s response with a log. Do this from your local machine.

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine

$ curl -v <Public DNS (IPv4)/ deployed_search.html 2 deployed_search.log

$ curl -v "<Public DNS (IPv4):8001/?q=world%20flags&w=0.5" deployed_index.json 2 deplo



Verify that the output in deployed_search.log and deployed_index.log doesn’t include errors like “Couldn’t connect to server”.

Testing


Run published unit tests.

$ pwd

/Users/awdeorio/src/eecs485/p5-search-engine $ pytest -v

Code style
As in previous projects, all your Python code for the Flask servers is expected to be pycodestyle , pydocstyle , and pylint compliant. You don’t need to lint the mapper/reducer executables

(although it may be a good idea to do so still!)

Also as usual, you may not use any external dependencies aside from what is provided in the setup.py’s.

More products