Web scraping is the practice of programmatically extracting data from web pages. Python is an essential tool for such practice and has an ecosystem rich with web scraping-oriented libraries, however—many fall short when it comes to scraping dynamic pages.
Dynamic pages often require the parsing of scripts, authenticating, or otherwise interacting with a webpage to reveal the desired content. Simple HTTP request libraries like requests
don’t provide simple solutions for these pages—at least not commonly. Fortunately, Selenium’s Webdriver provides a robust solution for scraping dynamic content!
Introduction
Selenium is an ecosystem of software designed to make software testing more seamless. Arguably, the most popular library among the Selenium ecosystem is webdriver
. This is an automated browser tool that allows developers to program user interactions for regression testing. In the hands of a data scientist, however—it can be used as a robust tool to extract data from web pages.
There are plenty of “how to scrape with Webdriver” tutorials out there—this isn’t going to be another one of those. Rather, this guide will cover how to use seleniumwire
and webdriver_manager
along with webdriver
to create a more seamless and environment-agnostic tool. First, let’s go over the common gotchas of webdriver
to better understand why we need these tools in the first place.
Webdriver Common Gotchas
Webdriver is a browser simulation tool—it can be instructed to use Chrome, Firefox, or a host of other common browsers. Webdriver provides APIs for developers to issue commands to interact with webpages in ways that allow the parsing, loading, and interaction with dynamic content. It’s not a web-scraping tool in and of itself however and we’ll need to get some other components set up as well.
Incorrect Driver Version
Webdriver utilizes .exe files to determine the type of browser that’s being simulated. For this guide, we’ll be using the Chromdriver executable which can be downloaded from the official ChromeDriver distribution page. After downloading the executable to a local directory, a new webdriver instance can be created as such:
from selenium import webdriver # Create new Driver driver = webdriver.Chrome('./chromedriver.exe') # Get a webpage driver.get('https://www.alpharithms.com')
Depending on which version of Chrome you have installed on your local machine, you might see this error:
selenium.common.exceptions.SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 95 Current browser version is 94.0.4606.81 with binary path C:\Program Files\Google\Chrome\Application\chrome.exe
The easiest way around this is to return to the ChromeDriver downloads page and get the version that supports the major release installed on your local machine. However, this becomes quite brittle when considering distribution across various environments. Fortunately, the webdriver_manager library exists and can lend us a hand. This library ensures seamless integration with the correct webdriver
version and gets implemented as such:
from selenium import webdriver from webdriver_manager.chrome import ChromeDriverManager # Create new Driver driver = webdriver.Chrome(ChromeDriverManager().install()) # Get a webpage driver.get('https://www.alpharithms.com') # print out confirmation print(driver.title) >>> αlphαrithms - Staying Ahead of the Curve
When running the following code you’ll note that this successfully launches a new webdriver
instance (first downloading the specified executable if necessary) and then prints out the page title to confirm the page loaded successfully.
The webdriver_manager
library has a robust caching feature that will avoid re-downloading any executable it detects as having already been downloaded.
TL;DR – the first time you run a script may take a few seconds but the following iterations will be faster.
Accessing HTTP Response Data
Web scraping is as much of an art as it is a science—doubly so for dynamic pages. Each site presents data with a unique structure and oftentimes developers find themselves having to wade through tricky code to get to the data they are after.
As such, it proves beneficial to have access to as much data as possible including status codes, request and response headers, and cookies. Libraries like requests
make this data easily accessible but the closest one can hope for with the vanilla webdriver
class is the page_source
attribute. Fortunately, the selenium wire library is here to help:
from webdriver_manager.chrome import ChromeDriverManager from seleniumwire import webdriver # Create a selenium-wire webdriver instance driver = webdriver.Chrome(ChromeDriverManager().install()) # Make A GET request driver.get('https://www.alpharithms.com') # Print some underlying HTTP request data print(driver.requests[0].headers, driver.requests[0].response) >>> content-length: 1 origin: https://www.google.com content-type: application/x-www-form-urlencoded sec-fetch-site: none sec-fetch-mode: no-cors sec-fetch-dest: empty user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36 accept-encoding: gzip, deflate, br accept-language: en-US,en;q=0.9 200
Here we see all kinds of useful information! There are plenty of other methods available via the selenium_wire
library. Seleniumwire makes more advanced HTTP requests simple but can also come with a few issues one might need to iron out. Mostly, these are permission-based Windows-centric issues (no surprise there.)
PermissionError
If your project is being executed from a directory that requires admin privileges you may receive the following warning:
PermissionError: [WinError 5] Access is denied: 'C:\\Users\\<USERNAME>\\AppData\\Local\\Temp\\.seleniumwire\\storage-eaf61cf8-3a4e-41b0-a545-b3d54b417974'
There are two options to deal with this:
- Move the project to a different directory
- Launch the terminal/IDE with admin privileges
This is mostly a clerical error in that Windows simply needs to allow your project directory to be excluded from the firewall. If you launch an IDE like PyCharm in administrator mode and re-run the webdriver_manager
script you will see the following prompt:
Headless vs. Full Browser
Selenium’s webdriver
is a full-fledged web browser. When running webdriver
the first thing most developers notice is the launch of another window on their local machine. When a new webdriver
instance is created, it’s the equivalent of double-clicking an icon on one’s desktop and launching an application.
Depending on preference—this might be unwanted behavior. This can be avoided by instructing webdriver
to run in headless mode. Each browser version requires a slightly different syntax to configure headless browsing but each is relatively simple. below is some example code of instructing webdriver
to run Chrome in headless mode:
# Import Options class for chrome from selenium.webdriver.chrome.options import Options # Get regular chrome driver options options = Options() options.add_argument("--headless") # Instantiate webdriver with --headless options enabled. driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)
Back in the day, one had to download PhantomJS to integrate headless
browsing. Today, it’s as easy as adding in a few lines of code!
Configuring Webdriver Proxies
Web scraping often results in developers recognizing the need for web proxies. These are software solutions that work as intermediaries between end-user clients for networked communications. Essentially, a proxy is a server that makes a request to another server, on behalf of a client. Proxies allow clients to make requests to servers without revealing their identity.
In the context of web scraping, this can help avoid Geographic firewalls, rate-limiting, and IP-based restrictions. Configuring proxies with webdriver
is simple and can be done as such:
from selenium.webdriver.chrome.options import Options # Define a proxy address PROXY = "11.23.58.13:2134" # Create the options object options = Options() # Add the proxy server options.add_argument(f"--proxy-server={PROXY}") # Create new webdriver instance driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)
This works great for public proxies in the format host:port
. For those familiar with such public proxies—the performance of such servers are often abysmal. Public proxies are often blacklisted, congested, or limited in bandwidth. Most web scraping projects—even at the hobbyist level—stand to benefit from more premium proxies. Such proxy use will, in most cases, require authentication. This is where webdriver
comes up short.
Authenticated Proxies with Webdriver
Webdriver doesn’t provide an API to allow authenticated proxy specification by default. There are some common workarounds with varying degrees of support/complexity/effectiveness. Here are some good options:
- Creating a custom config file.
- Using a browser extension (configuring on each launch)
- Authenticating via User/Password Dialog prompt on launch
Each of these solutions gets the job done. However, each of these solutions requires is either overly complex, not compatible across different browsers, or lacking support for certain requirements like headless
mode. Fortunately, the authors of selenium-wire have again come up with an excellent solution shown in the following code:
from seleniumwire import webdriver from webdriver_manager.chrome import ChromeDriverManager # Create options via seleniumwire for authenticated proxies sw_options = { "proxy": { 'http': "user:password@host:port", 'https': "user:password@host:port", 'no_proxy': 'localhost,127.0.0.1' }, } # Get regular chrome driver options driver = webdriver.Chrome(ChromeDriverManager().install(), seleniumwire_options=sw_options)
This code still uses the webdriver
-manager library to instantiate a new webdriver
object. This time, however, we create a dictionary options object to pass along to our webdriver
imported from seleniumwire
. Again, seleniumwire
proves its merit.
Service Objects
For those using Python (anyone following this tutorial) you may get a deprecation warning that looks like this:
DeprecationWarning: executable_path has been deprecated, please pass in a Service object driver = webdriver.Chrome(ChromeDriverManager().install())
To be clear—this is just a warning and won’t prevent webdriver
from launching. To get around this warning one need only implement the following Service object workflow:
# Import service from selenium.webdriver.chrome.service import Service from webdriver_manager.chrome import ChromeDriverManager # Create a service object (still using webdriver-mangager) service = Service(ChromeDriverManager().install()) # Create new driver instance driver = webdriver.Chrome(service=serivce)
With this approach, we will be ready for the future of webdriver
best practices and ditch that pesky warning. Otherwise—not much has changed.
Putting it all Together
We’ve covered a lot of ground in a short time here. We have leveraged webdriver
, seleniumwire
, and webdriver
-manager to accomplish the following:
- Easy
webdriver
executable configuration - Headless browsing
- Requests data access via
seleniumwire
- Authenticated proxy configuration via
seleniumwire
.
These four approaches allow for the robust use of webdriver
to help better approach web scraping of dynamic pages. The following code puts everything together leaving one with a new webdriver
instance, in headless mode, with accessible lower-level HTTP data, and authenticated proxy integration (replace proxy with your server/credentials):
from seleniumwire import webdriver from webdriver_manager.chrome import ChromeDriverManager from selenium.webdriver.chrome.options import Options from selenium.webdriver.chrome.service import Service # Create an options object for Chrome options = Options() options.add_argument("--headless") # Create options via seleniumwire for authenticated proxies sw_options = { "proxy": { 'http': "user:password@host:port", # customize this 'https': "user:password@host:port", # customize this 'no_proxy': 'localhost,127.0.0.1' }, } # Create service instance service = Service(ChromeDriverManager().install()) # Get regular chrome driver options driver = webdriver.Chrome( service=service, options=options, seleniumwire_options=sw_options )
Final Thoughts
Webdriver is an incredible tool for automating browser-based testing. It has also found a home among web scraping developers as a powerful solution for dealing with troublesome dynamic pages. With its friendly APIs however, come some common gotchas. In addition to those discussed here, the official webdriver
documentation has a Worst Practices page that should be essential reading for all who use webdriver
.