An Introduction to BeautifulSoup

Beautiful Soup, so rich and green, Waiting in a hot tureen! Who for such dainties would not stoop? Soup of the evening, beautiful Soup!

The BeautifulSoup library was named after a Lewis Carroll poem in Alice’s Adventures in Wonderland. In the story, this poem is sung by a character called the Mock Turtle (a pun on the popular Victorian dish Mock Turtle Soup made not of turtle but of cow).

Like its Wonderland namesake, BeautifulSoup tries to make sense of the nonsensical; it helps format and organize the messy web by fixing bad HTML and presenting us with easily traversable Python objects representing XML structures.

Installing BeautifulSoup

BeautifulSoup is not a default Python library, so it must be installed. For experienced Python developers, simply use your preferred installer, or follow the general instructions below.

To get started with BeautifulSoup, you’ll need to install the BeautifulSoup 4 (BS4) library. The official documentation and installation instructions are available at Crummy.com. If you’re familiar with ScraperWiz, a web data scraper desktop app, you may find BeautifulSoup similarly helpful for organizing web data.

Using pip to Install BeautifulSoup

Most Python developers use pip to install libraries. Check if pip is already installed by typing:

$ pip

If the command isn’t recognized, you may need to install pip using apt-get on Linux, brew on macOS, or by downloading get-pip.py, saving it, and running:

$ python get-pip.py

Once you have pip, use it to install BeautifulSoup:

$ pip install bs4

If you have multiple Python versions installed, you might need to use:

$ pip3 install bs4

Keeping Libraries Organized with Virtual Environments

Virtual environments are essential if you plan on working on multiple Python projects to avoid potential conflicts between libraries. To create a virtual environment, run:

$ virtualenv scrapingEnv

Activate it with:

$ cd scrapingEnv/ $ source bin/activate

Once activated, install BeautifulSoup:

(scrapingEnv) $ pip install beautifulsoup4

Leave the environment by using:

(scrapingEnv) $ deactivate

Running BeautifulSoup

The most commonly used object in the BeautifulSoup library is the BeautifulSoup object. Here’s an example:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen(&#039;http://www.pythonscraping.com/pages/page1.html&#039;)
bs = BeautifulSoup(html.read(), &#039;html.parser&#039;)
print(bs.h1)

This outputs:

<h1>An Interesting Title</h1>

BeautifulSoup makes it easy to navigate complex HTML structures. For instance, bs.h1, bs.body.h1, and bs.html.body.h1 all refer to the same h1 tag.

Choosing a Parser

When creating a BeautifulSoup object, you specify the parser:

bs = BeautifulSoup(html.read(), &#039;html.parser&#039;)

Popular alternatives include lxml and html5lib. Install lxml with:

$ pip install lxml

And use it like this:

bs = BeautifulSoup(html.read(), &#039;lxml&#039;)

lxml can handle malformed HTML better than html.parser, while html5lib is even more forgiving but slower.

Final Thoughts

This introduction to BeautifulSoup highlights its utility in web scraping. With it, you can extract data from HTML or XML, provided there’s an identifying tag. More complex uses of BeautifulSoup, including regular expressions, are covered in later sections.

scraperwiz