$35
# Overview
In this project you will practice inheritance, graph search, and web
scraping. You'll hand-in a module called `scrape.py`.
Make sure to run the tests before handing in. During development, we
recommend having a debug.ipynb notebook to make calls to your module.
## Testing
Run `python3 tester.py` often and work on fixing any issues.
# Group Part (75%)
For this portion of the project, you may collaborate with your group
members in any way (even looking at working code). You may also seek
help from 320 staff (mentors, TAs, instructor). You <b>may not</b>
seek receive help from other 320 students (outside your group) or
anybody outside the course.
## Part 1: DFS on Matrices (`MatrixSearcher`)
Complete the lab exercises if you haven't already done so: https://github.com/cs320-wisc/s22/blob/main/labs/lab6.md
## Part 2: BFS on Files (`FileSearcher`)
Add a `bfs_search` method to `GraphSearcher`. It should behave the same as
`dfs_search`, but use the BFS algorithm instead of DFS. The
difference will be evident at the end if someone looks at the `.order`
attribute.
Note that without changing `MatrixSearcher`, it now supports both DFS
and BFS search since it inherits from `GraphSearcher`.
Add another class, `FileSearcher`, which also inherits from
`GraphSearcher`. It should have three methods (besides those
inherited): `__init__`, `go`, and `message`.
The nodes of this graph are files in the `file_nodes` directory. For
example, `1.txt` contains this:
```
M
2.txt,4.txt
```
This means the value for node `1.txt` is "M", and the children of
`1.txt` are `2.txt` and `4.txt`.
All the files will have two lines like this, with a value on the first
line, and a comma-separated list of children on the second line.
The `go` method should read a node file and return a list of children.
For example:
```python
import scrape
f = scrape.FileSearcher()
print(f.go("1.txt"))
```
Expected result: `['2.txt', '4.txt']`. You could test this by pasting
the sample code to a `debug.ipynb` notebook in your `p3` directory.
`go` will also somehow record the values (1st lines) of the nodes that
are visited, in order. The `message` method should return all the
values concatenated together. Take a look at `bfs_test` in
`tester.py` for an example of how this should work.
In general, reading test cases is a great way to see how your classes
are supposed to work. Specifications, like this document you're
reading now, are sometimes ambiguous (sorry!), but test cases tell you
what is expected from your code with complete precision.
It's often useful to copy/paste code snippets from `tester.py` to your
`debug.ipynb` when your debugging an issue too.
## Part 3: Web Crawling (`WebSearcher`)
Don't start this part until we learned about Selenium in class and how
to run it in "headless" mode.
For this part of the project you'll need to install a Chrome and a few
packages on your VM:
```
pip3 install selenium==4.1.2 Flask lxml html5lib
sudo apt -y install chromium-browser
```
When it's all done, run both of the following, and verify that both
commands print the same version and it is 98+ (like "98.X.X.X", but it
may be a bigger number if there are browser updates before P3 is
complete):
```
chromium-browser --version
chromium.chromedriver --version
```
**Note**: launching many web browsers via code can quickly eat up
all the memory on your VM. You can run the `htop` command to see
how much memory you have (hit "q" to quit when done). If you're low
on memory (you might notice your VM being sluggish), you can run
`pkill -f -9 chromium` shutdown all browser instances hanging around
in the background.
### Launching the Website
You'll be scraping a website implemented as a web application built
using the Flask framework (you don't need to know flask for this
project, though you'll learn it soon and get a chance to build your
own website in the next project). In an SSH session, run the
following to launch it:
```
python3 application.py
```
Then, open `http://<YOUR-VM-IP>:5000` in your web browser. **Do not**
use the IP address that is printed to console in the ssh session (it
won't work). It should look like this:
<img src="webpage.png" width=600>
If you click "TRAVEL HISTORY", you'll enter a graph of pages, each
with a table fragment. Your job is to search the graph (using the
search methods you wrote earlier), collect all the table fragments,
and concatenate them into one big DataFrame.
### `WebSearcher` Class
Write a `WebSearcher` class that inherits from `GraphSearcher`. The
constructor should take a Chrome webdriver object as a parameter so
that it is possible to create `WebSearcher` object with `ws =
WebSearcher(some_driver)`.
For example, one could run the following:
```python
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium import webdriver
options = Options()
options.headless = True
service = Service(executable_path="chromium.chromedriver")
driver = webdriver.Chrome(options=options, service=service)
ws = WebSearcher(driver)
```
The `go` method of `WebSearcher` should treat the node as a URL. It
should use the webdriver to visit that page and return the URLs of
other pages to which the visited page has hyperlinks. See `web_test`
in the tester for examples of how it should behave.
The `go` method should also use the following to read any table
fragments on a visited page and store them somewhere (for example, in
an attribute):
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
`WebSearcher` should have a `table()` method that
[concatenates](https://pandas.pydata.org/docs/reference/api/pandas.concat.html)
all the fragments in the order they were visited and returns one big
DataFrame. Use `ignore_index=True` when concatenating.
### Manual Debugging
Here is a code snippet you can use as you write your methods to help
test whether they're working:
```python
import os
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium import webdriver
import scrape
# kill previous chrome instance if still around (to conserve memory)
os.system("pkill -f -9 chromium")
options = Options()
options.headless = True
service = Service(executable_path="chromium.chromedriver")
driver = webdriver.Chrome(options=options, service=service)
# TODO: use IP address of your VM
start_url = "http://YOUR_VM_IP:5000/Node_1.html"
s = scrape.WebSearcher(driver)
print(s.go(start_url))
s = scrape.WebSearcher(driver)
s.bfs_search(start_url)
print(s.table())
driver.close()
```
Expected output:
```
['http://YOUR_VM_IP:5000/Node_2.html', 'http://YOUR_VM_IP:5000/Node_4.html']
clue latitude longitude description
0 1 43.089034 -89.416128 Picnic Point in Madison
1 7 38.105507 126.910613 Silver Beach in Hawaii
2 1 65.044901 -16.712836 Shore of a Volcanic Lake in Iceland
3 3 48.860945 2.335773 The Louvre in Paris
4 8 51.180315 -1.829659 Stonehenge in the UK
5 5 37.434183 -122.321990 Redwood forest in San Francisco
6 2 27.987586 86.925002 Mt. Everest in Nepal
7 4 34.134117 -118.321495 Hollywood Sign in LA
8 5 38.655100 90.061800 Cahokia Mounds in IL
9 9 40.748400 73.985700 Empire State Building in NY
10 4 29.975300 31.137600 Great Sphinx of Giza in Egypt
11 1 47.557600 10.749800 Neuschwanstein Castle in Germany
12 5 38.624700 90.184800 The Gateway Arch in St. Louis
13 3 30.328500 35.444400 Petra in Jordan
14 2 41.480800 82.683400 Cedar Point in OH
15 6 43.070010 -89.409450 Quick Trip on Monroe Street
```
# Individual Part (25%)
You have to do the remainder of this project on your own. Do not
discuss with anybody except 320 staff (mentors, TAs, instructor).
## Part 4: `reveal_secrets` function
Write a function (remember that functions aren't inside any class) in
`scrape.py` like the following:
```python
def reveal_secrets(driver, url, travellog):
....
```
The function should do the following:
1. generate a password from the "clues" column of the `travellog` DataFrame. For example, if `travellog` is the big DataFrame built after doing BFS (as shown earlier), the password will start with "17138..."
2. visit `url` with the `driver`
3. automate typing the password in the box and clicking "GO"
4. wait until the pages is loaded (perhaps with `time.sleep`)
5. click the "View Location" button and wait until the result finishes loading
6. save the image that appears to a file named 'Current_Location.jpg' (use the `requests` module to do the download, once you get the URL from selenium)
7. return the current location that appears on the page (should be "BASCOM HALL")
**Hints for step 6:** jpeg files are a binary format (they don't contain text for a human to read). You'll need to do some searching online to learn how to (a) download binary data and (b) write it to a file. Remember to cite any code you copy/paste. Here are some example Google searches you might start with to find how to do these things:
* "how to write bytes to a file in python"
* "how to fetch a binary file with python requests"