Diving Into the Deluge of Data Computer Science 135 :: Lab 9 :: Regex and Web Crawling

Lab 9 :: Regex and Web Crawling

This lab lab focuses on regular expressions. We will use regular expressions to identify links in webpages and extract useful information like phone numbers and email addresses.

Step 0: Lab Preparation

  • Review Lecture 26 on regex.
  • The html href tag.
  • The surprisingly complex email address format.
  • North American telephone format
  • Rules for web crawling.

Step 1: Source Code

Step 2: filtering text for URLs, email addresses, and phone numbers

The file filters.py contains three functions that search for information in text:

Here are some implementation notes.

Some of these functions can be broken into smaller steps. One or more helper functions might make your life easier.


Test your code:

    >>> from filters import *
    >>> phones = "413-597-2309 (413) 597-2211 413.597.4444 (413) 597 - 4711"
    >>> filter_phones(phones)
    ['413-597-2309', '413-597-2211', '413-597-4444', '413-597-4711']
    >>> emails = "someone@gmail.com prof@cs.williams.edu l337-h@xors.biz invalide@ .com fake@bogus"
    >>> filter_emails(emails)
    'someone@gmail.com', 'prof@cs.williams.edu', 'l337-h@xors.biz']

These are very basic examples. When writing good code, we should think of and evaulate corner-case behaviors. Please think of other inputs and check how well your functions handle those strings.

Step 3: The WebPage class

Develop a class called WebPage in crawl.py that encapsulates the contents of a particular webpage. Besides __init__, your class should support a method called populate. This method uses the requests library to fetch the content associated with the url and populate the WebPage object's internal state with a set of urls, a set of emails, and a set of phone numbers. You should also write a magic method called __eq__(self, page). This method is called automaticaly when you compare two WebPage objects with ==. In other words, __eq__ should return true if and only if self.url() is equal to page.url(). The __hash__ method is provided for you; it says the hash of a web page is the hash of its url. This allows you to add WebPage objects to Python sets.

class WebPage:
    def __init__(self, url):
        """
        Initializes a WebPage's state including data structures to store:
        - the set of urls in the WebPage's source
        - the set of emails in the WebPage's source
        - the set of phone numbers in the WebPage's source
        Args:
            url (str): the url to search
        """

    def __hash__(self):
        """Return the hash of the URL"""
        return hash(self.url())

    def __eq__(self, page):
        """
        return True if and only if the url of this page equals the url
        of page.
        """

    def populate(self):
        """
        fetch this WebPage object's webpage text and populate its content
        """

    def url(self):
        """return the url asssociated with the WebPage"""

    def phone_numbers(self):
        """return the set of phone numbers associated with this WebPage"""

    def emails(self):
        """return the set of emails associated with this WebPage"""

    def urls(self):
        """return the set of urls embedded with this WebPage"""
  

In populate, you will use the requests library to fetch the text associated with a url. Recall the syntax:

    >>> import requests
    >>> r = requests.get('http://www.williams.edu')
    

The request object returned by the get call has a public member variable called status_code, which you can compare against requests.code.ok to see whether the request was successful:

    >>> r.status_code == requests.codes.ok
    >>> True
    

If your request is successful, then you can grab the text of the request (i.e., the underlying HTML) using the text member variable:

    >>> r.text
    >>>'\n<!DOCTYPE html>\n<html lang="en-US" class="no-js">\n<head>\n...''
    

If you observe an error, then it is acceptable to gracefully return from the _populate method. You would then have empty sets of urls, emails, and phones.

Here is some example usage:

  >>> w = WebPage("http://www.cs.williams.edu")
  >>> w.populate()
  >>> sorted(w.urls())[:5]
  ['http://admission.williams.edu/visit/plan/gettinghere',
   'http://science.williams.edu/', 'http://www.cs.williams.edu/2014-2015/',
   'http://www.cs.williams.edu/about-the-computer-science-department/',
   'http://www.cs.williams.edu/about-the-computer-science-department/#more-82']
  >>> w.emails()
  {'requestinfo@cs.williams.edu'}
  >>> w.phone_numbers()
  {'413-597-3218', '920-350-1000', '374-374-1723', '614-614-7524', '456-456-2243', '373-373-7523', '625-038-1802', '400-400-2017', '437-437-7522', '413-597-4250', '376-376-2389'}
  >>> w2 = WebPage("http://www.cs.williams.edu")
  >>> w2 == w
  True
  

Step 4: The WebCrawler class

In crawl.py, develop a class called WebCrawler that programmatically traverses the web in a controlled way. The WebCrawler class supports a crawl method, which, starting from an initial URL, creates a WebPage object and adds the URLs found on that web page to the end of a list. The crawling continues by popping the first element off the list and using this new link as the next URL to search. It should stop crawling completely after traversing max_links total links or if there are no new web pages to crawl.

Your WebCrawl class should support the following methods:

class WebCrawler:
    def __init__(self, base_url, max_links=50):
        """
        Initialize the data structures required to crawl the web.
        Args:
           base_url (str): the starting point of our crawl
           max_links (int): after traversing this many links, stop the crawl
        """

    def crawl(self):
        """
        starting from self._base_url and until stopping conditions are met,
        creates WebPage objects and recursively explores their links.
        """

    def all_emails(self):
        """
        returns the set of all email addresses harvested during a
        successful crawl
        """

    def all_phones(self):
        """
        returns the set of all phone numbers harvested during a
        successful crawl
        """

    def all_urls(self):
        """
        returns the set of all urls traversed during a crawl
        """

    def output_results(filename):
        """
        In an easy-to-read format, writes the report of a successful crawl
        to the file specified by 'filename'.
        This includes the starting url, the set of urls traversed,
        all emails encountered, and the set of phone numbers (recorded in
        a standardized format of NPA-NXX-XXXX).
        """
  

Some implementation notes:



You can run your code from the command line using the following:

  $ python3 crawl.py http://www.cs.williams.edu report.txt
  

Step 5: Submission

  • Now commit those additions to the repository:
    $ git commit -a -m "replace with your log message"
  • Push your changes back to github repo:
    $ git push
    You will probably be asked to type $ git push --set-upstream origin plot which you should do. This pushes your crawl branch back up to the GitHub Repo.
  • Now navigate to your GitHub repo using a web browser. You should see a list of recently pushed branches with links to compare and pull request. Go ahead and issue a PR