Computer Science 135 :: Lab 9 :: Regex and Web Crawling

Lab 9 :: Regex and Web Crawling

This lab lab focuses on regular expressions. We will use regular expressions to identify links in webpages and extract useful information like phone numbers and email addresses.

Step 0: Lab Preparation

Review Lecture 26 on regex.
The html href tag.
The surprisingly complex email address format.
North American telephone format
Rules for web crawling.

Step 1: Source Code

Clone your private repo to an appropriate directory in your home folder (~/labs is a good choice):
```
$ git clone https://github.com:williams-cs/<git-username>-lab9.git
```
Remember, you can always get the repo address by using the ssh copy-to-clipboard link on github.
Once inside your <git-username>-lab9 directory, create a virtual environment using
```
$ pyvenv venv
```
Remember to use pyvenv instead of virtualenv.
Activate your environment by typing:
```
$ . venv/bin/activate
```
Remember that you must always activate your virtual environment when opening a new terminal
Install the requests library using:
```
$ pip install requests
```
Type
```
$ git branch
```
and notice that you are currently editing the master branch.
Create a new branch with
```
$ git branch crawl
```
Checkout this branch by typing
```
$ git checkout crawl
```
Any changes you make to the repository are now isolated on this branch.

Step 2: filtering text for URLs, email addresses, and phone numbers

The file filters.py contains three functions that search for information in text:

filter_urls: a function that returns a list of valid urls contained in the given string. This function is provided for you! Please read it carefully.
filter_emails: a function that returns a list of valid email addresses contained in the given string.
filter_phones: a function that returns a list of valid phone numbers contained in the given string. The list should have uniform formatting.

Here are some implementation notes.

The filter_urls function, which is provided for you, extracts all urls that:

Appear inside an "a" tag: <a href="url-goes-here">test</a>
Start with "http:"
Are part of the "williams.edu" domain (by default)
Are not media files (.jpeg, .svg, .mp3, .png, ...)

In filter_phones, you should extract valid 7-digit phone numbers that conform to the North American Numbering Plan. These phone numbers may appear in many formats, but they all have common characteristics:
- A 3-digit area code NPA
- A 7-digit subscriber number with a 3-digit prefix NXX and 4-digit suffix XXXX
Your function should extract each number, regardless of spacing or punctuation, and store it as a string in the NPA-NXX-XXXX format. You may find the re.finditer useful in conjunction with groups. You may also find the \D pattern helpful as it matches all non-digit characters.

Some of these functions can be broken into smaller steps. One or more helper functions might make your life easier.

Test your code:

    >>> from filters import *
    >>> phones = "413-597-2309 (413) 597-2211 413.597.4444 (413) 597 - 4711"
    >>> filter_phones(phones)
    ['413-597-2309', '413-597-2211', '413-597-4444', '413-597-4711']
    >>> emails = "someone@gmail.com prof@cs.williams.edu l337-h@xors.biz invalide@ .com fake@bogus"
    >>> filter_emails(emails)
    'someone@gmail.com', 'prof@cs.williams.edu', 'l337-h@xors.biz']

These are very basic examples. When writing good code, we should think of and evaulate corner-case behaviors. Please think of other inputs and check how well your functions handle those strings.

Step 3: The `WebPage` class

Develop a class called WebPage in crawl.py that encapsulates the contents of a particular webpage. Besides __init__, your class should support a method called populate. This method uses the requests library to fetch the content associated with the url and populate the WebPage object's internal state with a set of urls, a set of emails, and a set of phone numbers. You should also write a magic method called __eq__(self, page). This method is called automaticaly when you compare two WebPage objects with ==. In other words, __eq__ should return true if and only if self.url() is equal to page.url(). The __hash__ method is provided for you; it says the hash of a web page is the hash of its url. This allows you to add WebPage objects to Python sets.

class WebPage:
    def __init__(self, url):
        """
        Initializes a WebPage's state including data structures to store:
        - the set of urls in the WebPage's source
        - the set of emails in the WebPage's source
        - the set of phone numbers in the WebPage's source
        Args:
            url (str): the url to search
        """

    def __hash__(self):
        """Return the hash of the URL"""
        return hash(self.url())

    def __eq__(self, page):
        """
        return True if and only if the url of this page equals the url
        of page.
        """

    def populate(self):
        """
        fetch this WebPage object's webpage text and populate its content
        """

    def url(self):
        """return the url asssociated with the WebPage"""

    def phone_numbers(self):
        """return the set of phone numbers associated with this WebPage"""

    def emails(self):
        """return the set of emails associated with this WebPage"""

    def urls(self):
        """return the set of urls embedded with this WebPage"""

In populate, you will use the requests library to fetch the text associated with a url. Recall the syntax:

    >>> import requests
    >>> r = requests.get('http://www.williams.edu')

The request object returned by the get call has a public member variable called status_code, which you can compare against requests.code.ok to see whether the request was successful:

    >>> r.status_code == requests.codes.ok
    >>> True

If your request is successful, then you can grab the text of the request (i.e., the underlying HTML) using the text member variable:

    >>> r.text
    >>>'\n<!DOCTYPE html>\n<html lang="en-US" class="no-js">\n<head>\n...''

If you observe an error, then it is acceptable to gracefully return from the _populate method. You would then have empty sets of urls, emails, and phones.

Here is some example usage:

  >>> w = WebPage("http://www.cs.williams.edu")
  >>> w.populate()
  >>> sorted(w.urls())[:5]
  ['http://admission.williams.edu/visit/plan/gettinghere',
   'http://science.williams.edu/', 'http://www.cs.williams.edu/2014-2015/',
   'http://www.cs.williams.edu/about-the-computer-science-department/',
   'http://www.cs.williams.edu/about-the-computer-science-department/#more-82']
  >>> w.emails()
  {'requestinfo@cs.williams.edu'}
  >>> w.phone_numbers()
  {'413-597-3218', '920-350-1000', '374-374-1723', '614-614-7524', '456-456-2243', '373-373-7523', '625-038-1802', '400-400-2017', '437-437-7522', '413-597-4250', '376-376-2389'}
  >>> w2 = WebPage("http://www.cs.williams.edu")
  >>> w2 == w
  True

Step 4: The `WebCrawler` class

In crawl.py, develop a class called WebCrawler that programmatically traverses the web in a controlled way. The WebCrawler class supports a crawl method, which, starting from an initial URL, creates a WebPage object and adds the URLs found on that web page to the end of a list. The crawling continues by popping the first element off the list and using this new link as the next URL to search. It should stop crawling completely after traversing max_links total links or if there are no new web pages to crawl.

Your WebCrawl class should support the following methods:

__init__: accepts both a base_url parameter and a max_links parameter. An instance of WebCrawl should also maintain a set of WebPage objects that is has visited.
crawl: starting from the base_url and until the stopping conditions are met, crawl creates a WebPage object, adds the URLs on this webpage to the end of a list, pops of the first URL off the front of the list, and then continues exploring. crawl should never visit a web page twice. In this way, your WebCrawler should maintain a list of visited web pages that serves both as a set from which you can test visitation (please test before populating) and assemble global lists of emails and phone numbers.
all_emails: returns the set of all email addresses harvested during a successful crawl
all_phones: returns the set of all phone numbers harvested during a successful crawl
all_urls: returns the set of all urls that were successfully traversed
output_results: writes the urls, emails, and phone numbers to the specified file in an easy-to-read format.

class WebCrawler:
    def __init__(self, base_url, max_links=50):
        """
        Initialize the data structures required to crawl the web.
        Args:
           base_url (str): the starting point of our crawl
           max_links (int): after traversing this many links, stop the crawl
        """

    def crawl(self):
        """
        starting from self._base_url and until stopping conditions are met,
        creates WebPage objects and recursively explores their links.
        """

    def all_emails(self):
        """
        returns the set of all email addresses harvested during a
        successful crawl
        """

    def all_phones(self):
        """
        returns the set of all phone numbers harvested during a
        successful crawl
        """

    def all_urls(self):
        """
        returns the set of all urls traversed during a crawl
        """

    def output_results(filename):
        """
        In an easy-to-read format, writes the report of a successful crawl
        to the file specified by 'filename'.
        This includes the starting url, the set of urls traversed,
        all emails encountered, and the set of phone numbers (recorded in
        a standardized format of NPA-NXX-XXXX).
        """

Some implementation notes:

In crawl, you may want to take the following steps: Initialize a counter to 0 and a to_vist list with the base URL. While your list isn't empty or your haven't searched max_links URLS: Remove a URL from the front of your list. Decide if you have traversed this url before by creating a WebPage from it and checking if it is in your set of visited pages. If not, populate your newly created WebPage object, and add it to your collection of traversed web pages Add the set of urls found by that WebPage to your to_visit list


    
      You may find it helpful to add informative prints that indicate
      when certain events have occurred. For example, if you have
      reached a stopping condition or if you encounter a candidate link
      has already been traversed.
    
    In output_results, we suggest the following format:
      
	A header line that indicates the name of the starting url
	and any other parameters. This will allow someone who reads your
	results to understand and recreate your process.
	A section header for each of the three types of
	  information (URLs:/Emails:/Phones:)
	The information for each category, with one element per line


  




  
  You can run your code from the command line using the following:
  
  $ python3 crawl.py http://www.cs.williams.edu report.txt
  


  Step 5: Submission

  
  Now commit those additions to the repository:
  $ git commit -a -m "replace with your log message"

  Push your changes back to github repo:
  $ git push
  You will probably be asked to type $ git push --set-upstream origin plot
  which you should do.  This pushes your crawl branch back up to the GitHub Repo.

  
Now navigate to your GitHub repo using a web browser.  You should see a list of recently pushed branches with links to compare and pull request.  Go ahead and issue a PR

Lab 9 :: Regex and Web Crawling

Step 0: Lab Preparation

Step 1: Source Code

Step 2: filtering text for URLs, email addresses, and phone numbers

Step 3: The WebPage class

Step 4: The WebCrawler class

Step 5: Submission

Step 3: The `WebPage` class

Step 4: The `WebCrawler` class