Starting from:

$29.99

COMP598 Homework 7 – Data Scraping Solution


Non-standard (i.e., built-in) python libraries you can use:
- pandas
- requests
- BeautifulSoup
Task 1: Scraping relationships (10 pts)

Write a script collect_relationships.py that collects the relationships for a set of celebrities provided in a JSON configuration file as follows:

python scripts/collect_relationships.py -c <config-file.json> -o <output_file.json>

where config-file.json contains a single JSON dictionary with the following structure (the exact path and list of celebrities can, obviously, change):

{
“cache_dir”: “.data/wdw_cache”,
“target_people”: [ “robert-downey-jr”, “justin-bieber” ]
}


The output format for the file is:

{
“robert-downey-jr”: [ “person-1”, “person-2”, “person-3” ],
“justin-bieber”: []
}

Where the identifiers in the list are the people the person had a relationship with. If the person has had no relationships, then they will have an empty list.
Task 2: Getting course information (20 pts)
python scripts/scrape_courses.py -c <caching_dir> <page#>
Your script must cache to the directory specified. The page# indiciates which URL will be loaded. The courses should be printed in CSV format to stdout with the following columns (header included): CourseID, Course Name, # of credits
You should assume that all courses will be delivered with structure like this:

Where “ACCT 626” is the CourseID, “Data Analytics in Accounting” is the course name, and “1.5” is the # of credits. If the course encountered does NOT have this structure, ignore it. (Note that the course # if the course ID can have letters in it as well, e.g., “ACCT 645D1”).
Submission Instructions
Your MyCourses submission must be a single zip file entiled HW7_<studentid>.zip. It should contain the following items:
- scripts/ o collect_relationships.py – script for Task 1 o scrape_courses.py – script for Task 2

More products