import requests
from bs4 import BeautifulSoup

Fun Stuff: Python & Webpages

You might find it useful to pull data from the web…

Try This

  1. Right-click a webpage

  2. Select “View Page Source”

  3. Copy/Paste/Save the text with a .html file extension in a text editor (like VS Code).

For this notebook, we’re going to use index.html that’s in our directory. The text from this file looks as follows:

<html>
  <head>
    <title>CS134 Simple Page</title>
  </head>

  <body>
    Hello CS 134!  This is a simple web page.

    <br><br>
    Looking for Jeannie?  Click <a href="http://www.cs.williams.edu/~jeannie">here</a>.

    <br><br>
    Looking for Iris?  Click <a href="http://www.cs.williams.edu/~iris">here</a>.

    <br><br>
    Looking for Pixel?  Click <a href=“https://www.cs.williams.edu/~iris/website/img/HAILab.jpg">
    here</a>.

    <br><br>
    Here are some images.
    <br><br>
    <img src="http://sysnet.cs.williams.edu/Williams-Logo.jpg" alt="purple cow">
    <br><br>
    <img src="http://sysnet.cs.williams.edu/williams.gif" alt="seal">
    <br><br>
    <img src="http://sysnet.cs.williams.edu/reading-cow.jpg" alt="reading cow">

  </body>

</html>
  Input In [2]
    <html>
    ^
SyntaxError: invalid syntax

HTML

This text data is in a computer language called HyperText Markup Language, or HTML. It specifies how to format text for your Internet Browser and uses different tags/symbols to specify how your computer should display text. HTML is a markup language, not a programming language!

Just like in Python, we can modify our code file, and then view how the output differs. In this example, we add the text <font color="blue"> and </font> to the beginning and end of the first line after the <body> tag. To view the output of these changes, we have to open the file in an Internet Browser.

<html>
  <head>
    <title>CS134 Simple Page</title>
  </head>

  <body>
    <font color="blue">Hello CS 134!  This is a simple web page.</font>

    <br><br>
    Looking for Jeannie?  Click <a href="http://www.cs.williams.edu/~jeannie">here</a>.

    <br><br>
    Looking for Iris?  Click <a href="http://www.cs.williams.edu/~iris">here</a>.

    <br><br>
    Looking for Pixel?  Click <a href=“https://www.cs.williams.edu/~iris/website/img/HAILab.jpg">
    here</a>.

    <br><br>
    Here are some images.
    <br><br>
    <img src="http://sysnet.cs.williams.edu/Williams-Logo.jpg" alt="purple cow">
    <br><br>
    <img src="http://sysnet.cs.williams.edu/williams.gif" alt="seal">
    <br><br>
    <img src="http://sysnet.cs.williams.edu/reading-cow.jpg" alt="reading cow">

  </body>

</html>
  Input In [3]
    <html>
    ^
SyntaxError: invalid syntax

Notice that the <font color="blue"> tage has made the text contained within the tags blue. We can change that blue to be another color, or change color="blue" to be size="+4" to see how else we can change the webpage. Each time you edit the .html file, you’ll need to refresh your webpage to see the changes.

HTML Tags

There are many other HTML tags that are useful for formatting text!

  • <h1>Text goes here</h1> – Makes a level1 heading

  • Guess: there’s also an <h2></h2>, and <h3></h3>, and …

  • <b>Text goes here</b> – Makes the text bold (also <strong>)

  • <i>Text goes here</i> – Makes the text italic (also <em>)

  • <a href="http://url-here.edu">Link Text here</a> – Makes a hyperlink

  • <font face="courier">Text goes here</font> – Changes the font

  • <font size="+2">Text goes here</font> – Changes font size

  • <font color="green">Text goes here</font> – Changes font color

  • <p>Text goes here</p> – Paragraph definition (~2 newlines)

  • <br> – Line break (~1 newline)

We can also make lists with HTML tags:

  • <ol> – Begins numbered list (i.e., ordered list)

    • <li>Text goes here</li>

    • <li>Another numbered bullet item</li>

  • </ol> – Ends numbered list

  • <ul> – Begins bulleted list(i.e., unordered list)

    • <li>Text goes here</li>

    • <li>Another numbered bullet item</li>

    • <li>Yet another numbered bullet item</li>

  • </ul> – Ends bulleted list

Or even tables:

  • <table> – Begins the table

    • <tr> – Begins a row

      • <td>Text in cell 1</td> – Adds a column within the row

      • <td>Text in cell  2</td> – Adds a column within the row

      • <td>Text in cell  3</td> – Adds a column within the row

    • </tr> – Ends a row

    • <tr> – Begins 2nd row

      • <td>Text in cell 4</td> – Adds a column within 2nd row

    • </tr> – Ends 2nd row

  • </table> – Ends the table

HTML File Structure

Well-formed HTML files have a few additional structural requirements that are handy to be aware of:

  • <html> – Defines what markup language is being used

  • <head> Text & Tags in here are part of the header </head>

  • <title> This title appears in the web browser </title>

  • <body> Text & Tags in here are part of the body text </body>

  • </html> – Ends HTML file

Pulling Source Code from Web Pages

At the beginning of this notebook, we stepped you through pulling source code from web pages manually, but we can also do this through Python! You’ll need to pip install requests, and the following bit of code with our index.html will give us the following:

import requests
r = requests.get('http://www.cs.williams.edu/~cs134/basic.html')
r.text
'<html>\n  <head>\n    <title>CS134 Simple Page</title>\n  </head>\n\n  <body>\n    Hello CS 134!  This is a simple web page.\n\n    <br><br>\n    Looking for Jeannie?  Click <a href="http://www.cs.williams.edu/~jeannie">here</a>.\n\n    <br><br>\n    Looking for Iris?  Click <a href="http://www.cs.williams.edu/~iris">here</a>.\n\n    <br><br>\n    Looking for Pixel?  Click <a href="https://www.cs.williams.edu/~iris/website/img/HAILab.jpg">here</a>.\n\n    <br><br>\n    Here are some images.\n\n    <br><br>\n    <img src="http://sysnet.cs.williams.edu/Williams-Logo.jpg" alt="purple cow">\n\n    <br><br>\n    <img src="http://sysnet.cs.williams.edu/williams.gif" alt="seal">\n\n    <br><br>\n    <img src="http://sysnet.cs.williams.edu/reading-cow.jpg" alt="reading cow">\n\n  </body>\n\n</html>\n    \n\n    \n'

But this isn’t formatted nearly nicely enough! If we want to parse the HTML text from a string, we’ll need to pip install beautifulsoup4 and use the bs4 module as below:

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
print(soup.prettify())
<html>
 <head>
  <title>
   CS134 Simple Page
  </title>
 </head>
 <body>
  Hello CS 134!  This is a simple web page.
  <br/>
  <br/>
  Looking for Jeannie?  Click
  <a href="http://www.cs.williams.edu/~jeannie">
   here
  </a>
  .
  <br/>
  <br/>
  Looking for Iris?  Click
  <a href="http://www.cs.williams.edu/~iris">
   here
  </a>
  .
  <br/>
  <br/>
  Looking for Pixel?  Click
  <a href="https://www.cs.williams.edu/~iris/website/img/HAILab.jpg">
   here
  </a>
  .
  <br/>
  <br/>
  Here are some images.
  <br/>
  <br/>
  <img alt="purple cow" src="http://sysnet.cs.williams.edu/Williams-Logo.jpg"/>
  <br/>
  <br/>
  <img alt="seal" src="http://sysnet.cs.williams.edu/williams.gif"/>
  <br/>
  <br/>
  <img alt="reading cow" src="http://sysnet.cs.williams.edu/reading-cow.jpg"/>
 </body>
</html>

What did the .prettify() method do?

Here are some other useful methods from the bs4 module:

# What does this do?
print(soup.title)
<title>CS134 Simple Page</title>
# What does this do?
soup.title.name
'title'
# What does this do?
soup.title.string
'CS134 Simple Page'
# What does this do?
soup.title.parent.name
'head'
# What does this do?
soup.img
<img alt="purple cow" src="http://sysnet.cs.williams.edu/Williams-Logo.jpg"/>
# What does this do?
soup.a
<a href="http://www.cs.williams.edu/~jeannie">here</a>
# What does this do?
soup.find_all('a')
[<a href="http://www.cs.williams.edu/~jeannie">here</a>,
 <a href="http://www.cs.williams.edu/~iris">here</a>,
 <a href="https://www.cs.williams.edu/~iris/website/img/HAILab.jpg">here</a>]
# What does this do?
for link in soup.find_all('a'): 
    print(link.get("href"))
http://www.cs.williams.edu/~jeannie
http://www.cs.williams.edu/~iris
https://www.cs.williams.edu/~iris/website/img/HAILab.jpg
# What does this do?
for link in soup.find_all('a'): 
    print(link.get_text())
here
here
here

beautifulsoup4 Documentation

There’s a lot more the beautifulsoup4 module can do, but you’ll have to go to the documentation to find out: https://beautiful-soup-4.readthedocs.io/en/latest/

Python has lots more accessible modules that do other fun things, like play or create music, process images, generate text, perform statistical operations, among many many others! But you’ll have to consult the online documentation to find out!

Why Scrape HTML Data?!

Can you think of a reason why scraping HTML data might be useful?

Maybe you’re building a web crawler, documenting all the webpages on the Internet so their text can be searchable

Maybe you’re a sports recruiter and you need to pull wins/losses data from the local amateur leagues

Maybe you’re a designer building software that will make stock market transactions based on the weather. You’ll need local weather data, and stock market performance…

Maybe you’re a PR firm tracking in vivo mentions of particular products or brands

Maybe you’re a humanitarian gathering evidence on organized crime groups at the recruitment stage of human trafficking

Maybe you’re an Artificial Intelligence researcher gathering data on paint color names so that you can create an AI model to generate new paint color names

Take-away

Python is a powerful tool that:

  • processes, manipulates, organizes data

  • accesses data

  • creates beautiful things: art, solutions, puzzles, …

  • expands human capabilities

But also! It communicates complex computational ideas!