Decoding Brotli Compression Text in Python

Web Scraping with Python is all fun and games until you bump into an encoding that outputs a bunch of garbled text instead of the content that's most definitely not copyright protected. This article shows how to deal with one of them.,
brotli compression webscraping python requests alpharithms

Brotli is a modern, open-source compression algorithm developed by Google, designed to enable faster and more efficient data transmission over the web. It offers superior compression ratios and speed compared to previous methods like gzip. While useful, certain HTTP-request libraries aren’t able to handle Brotli without additional consideration. In this article, we’ll discuss how to handle Brotli compression with Python and the requests library.

Quick Intro: Spotting Brotli in the Wild

Brotli is supported by most modern web browsers and certainly by Chrome, Firefox, and Safari. It was proposed officially in 2016 via RFC 7932 and since has seen creeping adoption, mostly by large websites due to network compatibility issues. It offers a higher compression ratio and — the real kicker — a much faster decompression speed.

Chances are you are seeing Brotli out in the wild while browsing and simply haven’t noticed. Only through inspection of the response headers of an HTTP request is one likely to discern servers using Brotli. The Content-Encoding: br header is what tells a client (browser) that the server (website) is serving content (loading pages) using the Brotli compression standard.

For those who might be web scraping, you’re likely to first notice Brotli via visual inspection of the content component of an HTTP response, finding something that resembles the following:

 �NVP�^�_�VV�`��	����ξb����"]�
�E@2pZ�Fy裬��V��V��V�ņ��Y�bu�ox\u�m�Nǖ�,؈�Ȃ�ɉ��S�(��}
^�%S���4�'��R4�9�ƛY
��)��MyL�N�;�������2r������F� G/�dy1�	�����caϽ�C�r5h� �L@ �#���HR��//�zv�O�:������bR� ����<%Y�G�
b&3&c,0KW�fȤ�@sv6M����v�g#�r��ҼR�pzoR'�!�k�_��OO�4-��m��W��ǘ)�M3g����(E���|�kO	�
т��=�ൿDׇ�uM�=���Ǎё@�Kf���
'���,�;yB�\�<���õi��9߾���X�y���t��^��*�y���-@�����}�i�9��:�?B�{Wdپ��Ib�2��V}�������/���e��y�/{����[u�K�ߐ��k߷$SI�]��}�'fŒ&���r�0�ؔ�����24��24�ƘKA-�J	F�yŧ�*.���|q`�cݾ
p)Kyu��D���<���}Zl���~�d3�O�ł������>��q�￑!�w@����F4>�Ɂ����d@^�h�D +��%x�d(|��I��Q�g���>4�)��s2�a��w�48�U�����6�_�a@<r@ Vw����u�/}
@$�†W��C�4.�]th��{����̟ٓ��«��]�������]�� �td 34�Н������G>p���뢠��x�?6�F�-ۇ�Q���r��7�8�[nij7�q{�����o��W��������/ �L���MC���x��P�?VLX��|e����V\ҳ�뻧Dž��l�7�_g���,�Wt�.iYK/�77/���s3���Ҥe�Ѣ�u�
~���`TEH�'�VA���)1@-v	�?c��ʳA1f�E%���PT��Bب��(��ꯁcZ� Rm\l\c��TJ��LMw*�
�J<ʵ>�L��VF)=RC��&~u�caE�@�L-0����?�EqF5�?

This is the still-decoded version of the text and, in most cases, may or may not be decoded automatically by many clients like Requests for Python (not) or Axios for JavaScript/Node (automatic).

Handling Brotli in Python with Requests

Brotli is currently in a grey area of adoption with the majority of friction coming from CDN support being slow. It’s out there however and likely to only become more prevalent given the significant improvements in speed offered. In all likelihood, any widely used HTTP library for any modern language will support Brotli encoding eventually. Python’s most popular third-party HTTP library — requests — however, does not support it out-of-the-box as of 11/2023. Let’s take a look at a typical requests-based HTTP request via Python:

import requests

# make the HTTP request
resp = requests.get("https://www.google.com")

# output the decoded response content
print(resp.text)

Chances are the resp.text is going to come out garbled in stdout. This is because A.) Google is likely using Brotli as the Content-Encoding (depending on your geographic location, maybe) and B.) the lack of native support for Brotli via Python’s request library.

Note: Unless you’re making requests from a high-quality IP with proper headers included Google may respond with the “Prove You’re not a Robot” type reCaptcha response.

While the requests library doesn’t offer native Brotli support the fix is as simple as installing another third-party library to enable Brotli support. This library can be installed via pip install brotli which will then enable automatic decoding of Brotli by the requests library. That means no function calls, no imports, no nothing — just use requests like you’re used to and the content will be magically decoded like other types.

Discussion

I almost didn’t bother writing this article because I would bet that native Brotli support will find it’s way to the requests library sooner rather than later. It’s still an issue for now though. Dealing with Brotli from a web scraping perspective is fairly trivial — if you notice the Content-Encoding: br noted or start getting non-decoded text unexpectedly there’s a chance you’ve encountered a Brotli-encoded response. Handling Brotli via your application that delivers the compressed content is another subject all together!

Zαck West
Full-Stack Software Engineer with 10+ years of experience. Expertise in developing distributed systems, implementing object-oriented models with a focus on semantic clarity, driving development with TDD, enhancing interfaces through thoughtful visual design, and developing deep learning agents.