Starting from:

$25

SI630– Homework 0 Regular Expressions Solved

1          Background
Email addresses are everywhere online. Especially in personal web pages, people provide their email as way of easily getting in touch with them. However, unscrupulous spammers also look for these addresses to send unwanted email to people. As a result, some web page authors have resorted to obfuscating their address so that a human could still figure out what the address is without a machine being able to easily detect it. For example, someone might write myname@domain.edu as myname at domain dot edu.

2          Task
You’ve been asked to perform a security audit for a large university. They want to know what kinds of email addresses might be recoverable from each web page. Conveniently, they’ve already put together all of the web pages for you into a single file, where the HTML page for each page is on one line. Further, every page is guaranteed to have one email on it at most, since no one lists two email addresses for themselves on a page. However, not everyone lists their email on a page, so some pages have no email addresses! The big challenge is that there is no consistency in how the addresses are formatted!

Problem 1. Write a program that uses regular expressions to extract and canonicalize email addresses from web pages. Hint: regex groups may come in handy here. You will be provided with a large file of web pages on Canvas where each page is on a separate line. Your program will produce a new file the canonicalized email address found on each page or the word None if no email address was found. By canonicalized, we mean that if the author wrote myname at domain dot edu, you would output myname@domain.edu in your file. Your output should have the same number of output lines as the input file.

More products