Discover how web scraping works, from basic requests to data extraction.
Appreciating the Little Things in Web Scraping
Once you start web scraping, you start to appreciate all the little things that browsers do for you. The web, without its layers of HTML formatting, CSS styling, JavaScript execution, and image rendering, can look a little intimidating at first. In this post, we’ll begin to look at how to format and interpret this bare data without the help of a web browser.
This post starts with the basics of sending a GET request (a request to fetch, or “get,” the content of a web page) to a web server for a specific page, reading the HTML output from that page, and doing some simple data extraction to isolate the content you are looking for.
Installing and Using Jupyter
If you’re new to Python or web scraping, you might find Jupyter Notebooks to be an excellent tool. The code for this course can be found at https://github.com/REMitchell/python-scraping. In most cases, the code samples are in the form of Jupyter Notebook files with an .ipynb extension.
If you haven’t used Jupyter Notebooks already, they are a great way to organize and work with many small but related pieces of Python code. As shown below, each piece of code is contained in a box called a cell. The code within each cell can be run by typing Shift + Enter, or by clicking the Run button at the top of the page.
What is Jupyter?
Project Jupyter began as a spin-off project from the IPython (Interactive Python) project in 2014. These notebooks were designed to run Python code in the browser in an accessible and interactive way that would lend itself to teaching and presenting.
Installing Jupyter Notebooks
To install Jupyter Notebooks, you can use the following command:
ScraperWiz is a desktop app that simplifies web scraping tasks if you’re interested in automating this process.
Install with pip:
$ pip install notebook
After installation, you should have access to the jupyter
command, which will allow you to start the web server. Navigate to the directory containing the downloaded exercise files for this book, and run:
Start the Jupyter server:
$ jupyter notebook
This will start the web server on port 8888. If you have a web browser running, a new tab should open automatically. If it doesn’t, copy the URL shown in the terminal, with the provided token, to your web browser.

Leave a Reply