Starting from:

$30

CS4395 Homework 6 -Solved


1. Build a web crawler function that starts with a URL representing a topic (a sport, your 
favorite film, a celebrity, a political issue, etc.) and outputs a list of at least 15 relevant 
URLs. The URLs can be pages within the original domain but should have a few outside 
the original domain. 
2. Write a function to loop through your URLs and scrape all text off each page. Store each 
page’s text in its own file. 
3. Write a function to clean up the text from each file. You might need to delete newlines 
and tabs first. Extract sentences with NLTK’s sentence tokenizer. Write the sentences for 
each file to a new file. That is, if you have 15 files in, you have 15 files out. 
4. Write a function to extract at least 25 important terms from the pages using an 
importance measure such as term frequency, or tf-idf. First, it’s a good idea to lower
case everything, remove stopwords and punctuation. Print the top 25-40 terms. 
5. Manually determine the top 10 terms from step 4, based on your domain knowledge. 
6. Build a searchable knowledge base of facts that a chatbot (to be developed later) can 
share related to the 10 terms. The “knowledge base” can be as simple as a Python dict 
which you can pickle. More points for something more sophisticated like sql. 
7. In a doc: (1) describe how you created your knowledge base, include screen shots of the 
knowledge base, and indicate your top 10 terms; (2) write up a sample dialog you would 
like to create with a chatbot based on your knowledge base 
8. Create a link to the report and code on your index pageCS 4395 Intro to NLP 
Dr. Karen Mazidi 
Caution: All course work is run through plagiarism detection software comparing 
students’ work as well as work from previous semesters and other sources. 
Be prepared to present your results to class: 
- what was your starter site 
- what kind of data did you get 
- how did you clean up the data 
- what were your top terms 
- show us your knowledge base 
- how might you use this data for a chatbot 

More products