Lab 9 :: Regex and Web Crawling
This lab lab focuses on regular expressions. We will use regular expressions to identify links in webpages and extract useful information like phone numbers and email addresses.
Step 0: Lab Preparation
- Review Lecture 26 on regex.
- The html href tag.
- The surprisingly complex email address format.
- North American telephone format
- Rules for web crawling.
Step 1: Source Code
- Clone your private repo to an appropriate directory in your home folder
(
~/labs
is a good choice):$ git clone https://github.com:williams-cs/<git-username>-lab9.git
Remember, you can always get the repo address by using the ssh copy-to-clipboard link on github. - Once inside your <git-username>-lab9 directory, create a virtual environment using
$ pyvenv venv
Remember to use pyvenv instead of virtualenv. - Activate your environment by typing:
$ . venv/bin/activate
- Remember that you must always activate your virtual environment when opening a new terminal
- Install the requests library using:
$ pip install requests
- Type
$ git branch
and notice that you are currently editing the master branch. - Create a new branch with
$ git branch crawl
- Checkout this branch by typing
$ git checkout crawl
- Any changes you make to the repository are now isolated on this branch.
Step 2: filtering text for URLs, email addresses, and phone numbers
The file filters.py contains three functions that search for information in text:
filter_urls
: a function that returns a list of valid urls contained in the given string. This function is provided for you! Please read it carefully.filter_emails
: a function that returns a list of valid email addresses contained in the given string.filter_phones
: a function that returns a list of valid phone numbers contained in the given string. The list should have uniform formatting.
Here are some implementation notes.
- The
filter_urls
function, which is provided for you, extracts all urls that: - Appear inside an "a" tag:
<a href="url-goes-here">test</a>
- Start with "http:"
- Are part of the "williams.edu" domain (by default)
- Are not media files (.jpeg, .svg, .mp3, .png, ...)
- In
filter_phones
, you should extract valid 7-digit phone numbers that conform to the North American Numbering Plan. These phone numbers may appear in many formats, but they all have common characteristics:- A 3-digit area code NPA
- A 7-digit subscriber number with a 3-digit prefix NXX and 4-digit suffix XXXX
re.finditer
useful in conjunction with groups. You may also find the \D pattern helpful as it matches all non-digit characters.
Some of these functions can be broken into smaller steps. One or more helper functions might make your life easier.
Test your code:
>>> from filters import * >>> phones = "413-597-2309 (413) 597-2211 413.597.4444 (413) 597 - 4711" >>> filter_phones(phones) ['413-597-2309', '413-597-2211', '413-597-4444', '413-597-4711'] >>> emails = "someone@gmail.com prof@cs.williams.edu l337-h@xors.biz invalide@ .com fake@bogus" >>> filter_emails(emails) 'someone@gmail.com', 'prof@cs.williams.edu', 'l337-h@xors.biz']
These are very basic examples. When writing good code, we should think of and evaulate corner-case behaviors. Please think of other inputs and check how well your functions handle those strings.
Step 3: The WebPage
class
Develop a class called WebPage
in crawl.py
that encapsulates the contents of a particular webpage. Besides __init__
, your class should support a method called populate
. This method uses the requests library to fetch the content associated with the url and populate the WebPage object's internal state with a set of urls, a set of emails, and a set of phone numbers. You should also write a magic method called __eq__(self, page)
. This method is called automaticaly when you compare two WebPage
objects with ==
. In other words, __eq__
should return true if and only if self.url()
is equal to page.url()
. The __hash__
method is provided for you; it says the hash of a web page is the hash of its url. This allows you to add WebPage
objects to Python sets.
class WebPage: def __init__(self, url): """ Initializes a WebPage's state including data structures to store: - the set of urls in the WebPage's source - the set of emails in the WebPage's source - the set of phone numbers in the WebPage's source Args: url (str): the url to search """ def __hash__(self): """Return the hash of the URL""" return hash(self.url()) def __eq__(self, page): """ return True if and only if the url of this page equals the url of page. """ def populate(self): """ fetch this WebPage object's webpage text and populate its content """ def url(self): """return the url asssociated with the WebPage""" def phone_numbers(self): """return the set of phone numbers associated with this WebPage""" def emails(self): """return the set of emails associated with this WebPage""" def urls(self): """return the set of urls embedded with this WebPage"""
In populate
, you will use the requests
library to fetch the text associated with a url. Recall the syntax:
>>> import requests >>> r = requests.get('http://www.williams.edu')
The request object returned by the get
call has a public member variable called status_code
, which you can compare against requests.code.ok
to see whether the request was successful:
>>> r.status_code == requests.codes.ok >>> True
If your request is successful, then you can grab the text of the request (i.e., the underlying HTML) using the text
member variable:
>>> r.text >>>'\n<!DOCTYPE html>\n<html lang="en-US" class="no-js">\n<head>\n...''
If you observe an error, then it is acceptable to gracefully return from the _populate
method. You would then have empty sets of urls, emails, and phones.
Here is some example usage:
>>> w = WebPage("http://www.cs.williams.edu") >>> w.populate() >>> sorted(w.urls())[:5] ['http://admission.williams.edu/visit/plan/gettinghere', 'http://science.williams.edu/', 'http://www.cs.williams.edu/2014-2015/', 'http://www.cs.williams.edu/about-the-computer-science-department/', 'http://www.cs.williams.edu/about-the-computer-science-department/#more-82'] >>> w.emails() {'requestinfo@cs.williams.edu'} >>> w.phone_numbers() {'413-597-3218', '920-350-1000', '374-374-1723', '614-614-7524', '456-456-2243', '373-373-7523', '625-038-1802', '400-400-2017', '437-437-7522', '413-597-4250', '376-376-2389'} >>> w2 = WebPage("http://www.cs.williams.edu") >>> w2 == w True
Step 4: The WebCrawler
class
In crawl.py
, develop a class called WebCrawler
that programmatically traverses the web in a controlled way. The WebCrawler
class supports a crawl
method, which, starting from an initial URL, creates a WebPage
object and adds the URLs found on that web page to the end of a list. The crawling continues by popping the first element off the list and using this new link as the next URL to search. It should stop crawling completely after traversing max_links
total links or if there are no new web pages to crawl.
Your WebCrawl
class
should support the following methods:
__init__
: accepts both abase_url
parameter and amax_links
parameter. An instance ofWebCrawl
should also maintain a set ofWebPage
objects that is has visited.crawl
: starting from thebase_url
and until the stopping conditions are met,crawl
creates aWebPage
object, adds the URLs on this webpage to the end of a list, pops of the first URL off the front of the list, and then continues exploring.crawl
should never visit a web page twice. In this way, yourWebCrawler
should maintain a list of visited web pages that serves both as a set from which you can test visitation (please test before populating) and assemble global lists of emails and phone numbers.all_emails
: returns the set of all email addresses harvested during a successful crawlall_phones
: returns the set of all phone numbers harvested during a successful crawlall_urls
: returns the set of all urls that were successfully traversedoutput_results
: writes the urls, emails, and phone numbers to the specified file in an easy-to-read format.
class WebCrawler: def __init__(self, base_url, max_links=50): """ Initialize the data structures required to crawl the web. Args: base_url (str): the starting point of our crawl max_links (int): after traversing this many links, stop the crawl """ def crawl(self): """ starting from self._base_url and until stopping conditions are met, creates WebPage objects and recursively explores their links. """ def all_emails(self): """ returns the set of all email addresses harvested during a successful crawl """ def all_phones(self): """ returns the set of all phone numbers harvested during a successful crawl """ def all_urls(self): """ returns the set of all urls traversed during a crawl """ def output_results(filename): """ In an easy-to-read format, writes the report of a successful crawl to the file specified by 'filename'. This includes the starting url, the set of urls traversed, all emails encountered, and the set of phone numbers (recorded in a standardized format of NPA-NXX-XXXX). """
Some implementation notes:
- In
crawl
, you may want to take the following steps:
- Initialize a counter to 0 and a
to_vist
list with the base URL. - While your list isn't empty or your haven't searched
max_links
URLS:- Remove a URL from the front of your list.
- Decide if you have traversed this url before by creating a
WebPage
from it and checking if it is in your set of visited pages. - If not, populate your newly created WebPage object, and add it to your collection of traversed web pages
- Add the set of urls found by that
WebPage
to yourto_visit
list
- Initialize a counter to 0 and a
- You may find it helpful to add informative prints that indicate when certain events have occurred. For example, if you have reached a stopping condition or if you encounter a candidate link has already been traversed.
- In
output_results
, we suggest the following format:- A header line that indicates the name of the starting url and any other parameters. This will allow someone who reads your results to understand and recreate your process.
- A section header for each of the three types of information (URLs:/Emails:/Phones:)
- The information for each category, with one element per line
You can run your code from the command line using the following:
$ python3 crawl.py http://www.cs.williams.edu report.txt
Step 5: Submission
- Now commit those additions to the repository:
$ git commit -a -m "replace with your log message"
- Push your changes back to github repo:
$ git push
You will probably be asked to type$ git push --set-upstream origin plot
which you should do. This pushes your crawl branch back up to the GitHub Repo. - Now navigate to your GitHub repo using a web browser. You should see a list of recently pushed branches with links to compare and pull request. Go ahead and issue a PR