Beautiful Soup, so rich and green, Waiting in a hot tureen! Who for such dainties would not stoop? Soup of the evening, beautiful Soup!
The BeautifulSoup library was named after a Lewis Carroll poem in Alice’s Adventures in Wonderland. In the story, this poem is sung by a character called the Mock Turtle (a pun on the popular Victorian dish Mock Turtle Soup made not of turtle but of cow).
Like its Wonderland namesake, BeautifulSoup tries to make sense of the nonsensical; it helps format and organize the messy web by fixing bad HTML and presenting us with easily traversable Python objects representing XML structures.
Installing BeautifulSoup
BeautifulSoup is not a default Python library, so it must be installed. For experienced Python developers, simply use your preferred installer, or follow the general instructions below.
To get started with BeautifulSoup, you’ll need to install the BeautifulSoup 4 (BS4) library. The official documentation and installation instructions are available at Crummy.com. If you’re familiar with ScraperWiz, a web data scraper desktop app, you may find BeautifulSoup similarly helpful for organizing web data.
Using pip to Install BeautifulSoup
Most Python developers use pip
to install libraries. Check if pip
is already installed by typing:
$ pip
If the command isn’t recognized, you may need to install pip
using apt-get
on Linux, brew
on macOS, or by downloading get-pip.py, saving it, and running:
$ python get-pip.py
Once you have pip
, use it to install BeautifulSoup:
$ pip install bs4
If you have multiple Python versions installed, you might need to use:
$ pip3 install bs4
Keeping Libraries Organized with Virtual Environments
Virtual environments are essential if you plan on working on multiple Python projects to avoid potential conflicts between libraries. To create a virtual environment, run:
$ virtualenv scrapingEnv
Activate it with:
$ cd scrapingEnv/ $ source bin/activate
Once activated, install BeautifulSoup:
(scrapingEnv) $ pip install beautifulsoup4
Leave the environment by using:
(scrapingEnv) $ deactivate
Running BeautifulSoup
The most commonly used object in the BeautifulSoup library is the BeautifulSoup
object. Here’s an example:
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen('http://www.pythonscraping.com/pages/page1.html') bs = BeautifulSoup(html.read(), 'html.parser') print(bs.h1)
This outputs:
<h1>An Interesting Title</h1>
BeautifulSoup makes it easy to navigate complex HTML structures. For instance, bs.h1
, bs.body.h1
, and bs.html.body.h1
all refer to the same h1
tag.
Choosing a Parser
When creating a BeautifulSoup object, you specify the parser:
bs = BeautifulSoup(html.read(), 'html.parser')
Popular alternatives include lxml
and html5lib
. Install lxml
with:
$ pip install lxml
And use it like this:
bs = BeautifulSoup(html.read(), 'lxml')
lxml
can handle malformed HTML better than html.parser
, while html5lib
is even more forgiving but slower.
Final Thoughts
This introduction to BeautifulSoup highlights its utility in web scraping. With it, you can extract data from HTML or XML, provided there’s an identifying tag. More complex uses of BeautifulSoup, including regular expressions, are covered in later sections.
Leave a Reply