Scraping Dynamic Websites with Webdriver and Python

Scraping dynamic web pages can become a burden of wading through multiple XHR requests, juggling cookies, and parsing responses from various scripts. Webdriver streamlines this process and makes web scraping dynamic pages nothing short of delightful!
selenium and python web scraping

Web scraping is the practice of programmatically extracting data from web pages. Python is an essential tool for such practice and has an ecosystem rich with web scraping-oriented libraries, however—many fall short when it comes to scraping dynamic pages.

Dynamic pages often require the parsing of scripts, authenticating, or otherwise interacting with a webpage to reveal the desired content. Simple HTTP request libraries like requests don’t provide simple solutions for these pages—at least not commonly. Fortunately, Selenium’s Webdriver provides a robust solution for scraping dynamic content!

Introduction

Selenium is an ecosystem of software designed to make software testing more seamless. Arguably, the most popular library among the Selenium ecosystem is webdriver. This is an automated browser tool that allows developers to program user interactions for regression testing. In the hands of a data scientist, however—it can be used as a robust tool to extract data from web pages.

There are plenty of “how to scrape with Webdriver” tutorials out there—this isn’t going to be another one of those. Rather, this guide will cover how to use seleniumwire and webdriver_manager along with webdriver to create a more seamless and environment-agnostic tool. First, let’s go over the common gotchas of webdriver to better understand why we need these tools in the first place.

Webdriver Common Gotchas

Webdriver is a browser simulation tool—it can be instructed to use Chrome, Firefox, or a host of other common browsers. Webdriver provides APIs for developers to issue commands to interact with webpages in ways that allow the parsing, loading, and interaction with dynamic content. It’s not a web-scraping tool in and of itself however and we’ll need to get some other components set up as well.

Incorrect Driver Version

Webdriver utilizes .exe files to determine the type of browser that’s being simulated. For this guide, we’ll be using the Chromdriver executable which can be downloaded from the official ChromeDriver distribution page. After downloading the executable to a local directory, a new webdriver instance can be created as such:

from selenium import webdriver

# Create new Driver
driver = webdriver.Chrome('./chromedriver.exe')

# Get a webpage
driver.get('https://www.alpharithms.com')

Depending on which version of Chrome you have installed on your local machine, you might see this error:

selenium.common.exceptions.SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 95
Current browser version is 94.0.4606.81 with binary path C:\Program Files\Google\Chrome\Application\chrome.exe

The easiest way around this is to return to the ChromeDriver downloads page and get the version that supports the major release installed on your local machine. However, this becomes quite brittle when considering distribution across various environments. Fortunately, the webdriver_manager library exists and can lend us a hand. This library ensures seamless integration with the correct webdriver version and gets implemented as such:

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager


# Create new Driver
driver = webdriver.Chrome(ChromeDriverManager().install())

# Get a webpage
driver.get('https://www.alpharithms.com')


# print out confirmation
print(driver.title)
>>> αlphαrithms - Staying Ahead of the Curve

When running the following code you’ll note that this successfully launches a new webdriver instance (first downloading the specified executable if necessary) and then prints out the page title to confirm the page loaded successfully.

The webdriver_manager library has a robust caching feature that will avoid re-downloading any executable it detects as having already been downloaded.

TL;DR – the first time you run a script may take a few seconds but the following iterations will be faster.

Accessing HTTP Response Data

Web scraping is as much of an art as it is a science—doubly so for dynamic pages. Each site presents data with a unique structure and oftentimes developers find themselves having to wade through tricky code to get to the data they are after.

As such, it proves beneficial to have access to as much data as possible including status codes, request and response headers, and cookies. Libraries like requests make this data easily accessible but the closest one can hope for with the vanilla webdriver class is the page_source attribute. Fortunately, the selenium wire library is here to help:

from webdriver_manager.chrome import ChromeDriverManager
from seleniumwire import webdriver

# Create a selenium-wire webdriver instance
driver = webdriver.Chrome(ChromeDriverManager().install())

# Make A GET request
driver.get('https://www.alpharithms.com')

# Print some underlying HTTP request data
print(driver.requests[0].headers, driver.requests[0].response)

>>>
content-length: 1
origin: https://www.google.com
content-type: application/x-www-form-urlencoded
sec-fetch-site: none
sec-fetch-mode: no-cors
sec-fetch-dest: empty
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9

 200

Here we see all kinds of useful information! There are plenty of other methods available via the selenium_wire library. Seleniumwire makes more advanced HTTP requests simple but can also come with a few issues one might need to iron out. Mostly, these are permission-based Windows-centric issues (no surprise there.)

PermissionError

If your project is being executed from a directory that requires admin privileges you may receive the following warning:

PermissionError: [WinError 5] Access is denied: 'C:\\Users\\<USERNAME>\\AppData\\Local\\Temp\\.seleniumwire\\storage-eaf61cf8-3a4e-41b0-a545-b3d54b417974'

There are two options to deal with this:

  1. Move the project to a different directory
  2. Launch the terminal/IDE with admin privileges

This is mostly a clerical error in that Windows simply needs to allow your project directory to be excluded from the firewall. If you launch an IDE like PyCharm in administrator mode and re-run the webdriver_manager script you will see the following prompt:

pycharm windows permissions
Click the “exclude directories” to remove the error on Windows systems.

Headless vs. Full Browser

Selenium’s webdriver is a full-fledged web browser. When running webdriver the first thing most developers notice is the launch of another window on their local machine. When a new webdriver instance is created, it’s the equivalent of double-clicking an icon on one’s desktop and launching an application.

Depending on preference—this might be unwanted behavior. This can be avoided by instructing webdriver to run in headless mode. Each browser version requires a slightly different syntax to configure headless browsing but each is relatively simple. below is some example code of instructing webdriver to run Chrome in headless mode:

# Import Options class for chrome
from selenium.webdriver.chrome.options import Options

# Get regular chrome driver options
options = Options()
options.add_argument("--headless")

# Instantiate webdriver with --headless options enabled.
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)

Back in the day, one had to download PhantomJS to integrate headless browsing. Today, it’s as easy as adding in a few lines of code!

Configuring Webdriver Proxies

Web scraping often results in developers recognizing the need for web proxies. These are software solutions that work as intermediaries between end-user clients for networked communications. Essentially, a proxy is a server that makes a request to another server, on behalf of a client. Proxies allow clients to make requests to servers without revealing their identity.

In the context of web scraping, this can help avoid Geographic firewalls, rate-limiting, and IP-based restrictions. Configuring proxies with webdriver is simple and can be done as such:

from selenium.webdriver.chrome.options import Options

# Define a proxy address
PROXY = "11.23.58.13:2134"

# Create the options object
options = Options()

# Add the proxy server
options.add_argument(f"--proxy-server={PROXY}")

# Create new webdriver instance
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)

This works great for public proxies in the format host:port. For those familiar with such public proxies—the performance of such servers are often abysmal. Public proxies are often blacklisted, congested, or limited in bandwidth. Most web scraping projects—even at the hobbyist level—stand to benefit from more premium proxies. Such proxy use will, in most cases, require authentication. This is where webdriver comes up short.

Authenticated Proxies with Webdriver

Webdriver doesn’t provide an API to allow authenticated proxy specification by default. There are some common workarounds with varying degrees of support/complexity/effectiveness. Here are some good options:

  1. Creating a custom config file.
  2. Using a browser extension (configuring on each launch)
  3. Authenticating via User/Password Dialog prompt on launch

Each of these solutions gets the job done. However, each of these solutions requires is either overly complex, not compatible across different browsers, or lacking support for certain requirements like headless mode. Fortunately, the authors of selenium-wire have again come up with an excellent solution shown in the following code:

from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager


# Create options via seleniumwire for authenticated proxies
sw_options = {
    "proxy": {
        'http': "user:password@host:port",
        'https': "user:password@host:port",
        'no_proxy': 'localhost,127.0.0.1'
    },
}

# Get regular chrome driver options
driver = webdriver.Chrome(ChromeDriverManager().install(), seleniumwire_options=sw_options)

This code still uses the webdriver-manager library to instantiate a new webdriver object. This time, however, we create a dictionary options object to pass along to our webdriver imported from seleniumwire. Again, seleniumwire proves its merit.

Service Objects

For those using Python (anyone following this tutorial) you may get a deprecation warning that looks like this:

DeprecationWarning: executable_path has been deprecated, please pass in a Service object
driver = webdriver.Chrome(ChromeDriverManager().install())

To be clear—this is just a warning and won’t prevent webdriver from launching. To get around this warning one need only implement the following Service object workflow:

# Import service
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager


# Create a service object (still using webdriver-mangager)
service = Service(ChromeDriverManager().install())

# Create new driver instance
driver = webdriver.Chrome(service=serivce)

With this approach, we will be ready for the future of webdriver best practices and ditch that pesky warning. Otherwise—not much has changed.

Putting it all Together

We’ve covered a lot of ground in a short time here. We have leveraged webdriver, seleniumwire, and webdriver-manager to accomplish the following:

  1. Easy webdriver executable configuration
  2. Headless browsing
  3. Requests data access via seleniumwire
  4. Authenticated proxy configuration via seleniumwire.

These four approaches allow for the robust use of webdriver to help better approach web scraping of dynamic pages. The following code puts everything together leaving one with a new webdriver instance, in headless mode, with accessible lower-level HTTP data, and authenticated proxy integration (replace proxy with your server/credentials):

from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service 

# Create an options object for Chrome
options = Options()
options.add_argument("--headless")

# Create options via seleniumwire for authenticated proxies
sw_options = {
    "proxy": {
        'http': "user:password@host:port",      # customize this
        'https': "user:password@host:port",     # customize this
        'no_proxy': 'localhost,127.0.0.1'
    },
}

# Create service instance
service = Service(ChromeDriverManager().install())

# Get regular chrome driver options
driver = webdriver.Chrome(
    service=service,
    options=options,
    seleniumwire_options=sw_options
)

Final Thoughts

Webdriver is an incredible tool for automating browser-based testing. It has also found a home among web scraping developers as a powerful solution for dealing with troublesome dynamic pages. With its friendly APIs however, come some common gotchas. In addition to those discussed here, the official webdriver documentation has a Worst Practices page that should be essential reading for all who use webdriver.

alpharithms discord banner 1
Zαck West
Entrepreneur, programmer, designer, and lifelong learner. Can be found taking notes from Mother Nature when not hammering away at the keyboard.