Writing Your First Web Scraper

Once you start web scraping, you start to appreciate all the little things that browsers do for you. The web, without its layers of HTML formatting, CSS styling, JavaScript execution, and image rendering, can look a little intimidating at first. In this article, we’ll begin to look at how to format and interpret this bare data without the help of a web browser.

This article starts with the basics of sending a GET request (a request to fetch, or “get,” the content of a web page) to a web server for a specific page, reading the HTML output from that page, and doing some simple data extraction in order to isolate the content you are looking for.

Installing and Using Jupyter

The code for this tutorial can be found at GitHub repository. In most cases, code samples are in the form of Jupyter Notebook files, with an .ipynb extension.

If you haven’t used Jupyter Notebooks before, they are an excellent way to organize and work with small pieces of Python code. Each piece of code is contained in a box called a “cell,” which you can run by typing Shift + Enter, or by clicking the “Run” button at the top of the page.

To install Jupyter Notebooks, simply run:

$ pip install notebook

Once installed, you can start the web server by running:

$ jupyter notebook

This will start the server on port 8888. If your browser doesn’t open automatically, copy the URL with the provided token from the terminal and paste it into your browser.

Making Your First Web Scraping Request

In this chapter, we’ll take a closer look at how the web works. When you type a URL into your browser, like google.com, your computer sends an HTTP request to the web server, and the server responds with an HTML file that contains the page content.

However, a web browser doesn’t play a direct role in sending the request. In fact, you can make these requests yourself using Python in just a few lines of code:

from urllib.request import urlopen
html = urlopen(&amp;amp;amp;#039;http://pythonscraping.com/pages/page1.html&amp;amp;amp;#039;)
print(html.read())

Running this will output the full HTML of the page, without any of the styling, images, or other elements that the browser normally handles.

Introducing BeautifulSoup

BeautifulSoup is a Python library that makes it easy to scrape data from web pages. It parses the HTML and presents it as a Python object that is easy to navigate and search.

To install BeautifulSoup, you can use the following command:

$ pip install beautifulsoup4

Once installed, you can use it like this:

from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)

This will print the first <h1> tag found on the page. BeautifulSoup makes it easy to extract specific parts of the page based on HTML tags and attributes.

Scraping with ScraperWiz

If you’re looking for a user-friendly way to automate your web scraping tasks, check out ScraperWiz. It’s a powerful web data scraper desktop app that simplifies the process of gathering data from websites without writing complex code. Whether you’re a beginner or an experienced programmer, ScraperWiz can help you collect data effortlessly.

Handling Exceptions in Web Scraping

Web scraping is not without its challenges. Sometimes, a webpage might not be found, or the server might be down. You can handle these errors gracefully in your code using exception handling.

Here’s an example of how to handle HTTP errors:

from urllib.request import urlopen
from urllib.error import HTTPError

try:
    html = urlopen('http://www.pythonscraping.com/pages/page1.html')
except HTTPError as e:
    print(e)
    # Handle the error (e.g., return null, break, etc.)
else:
    print(html.read())

Conclusion

Learning how to scrape data from websites is a valuable skill, whether you’re extracting content for research, business intelligence, or personal projects. With Python libraries like urllib and BeautifulSoup, you can easily send requests, parse HTML, and extract the data you need.

For an even more efficient web scraping experience, consider using ScraperWiz, a desktop app designed to make web data extraction easier for everyone.

scraperwiz