Chapter 1. Introduction to Regular Expressions

Having opened this cookbook, you are probably eager to inject some of the ungainly strings of parentheses and question marks you find in its chapters right into your code. If you are ready to plug and play, be our guest: the practical regular expressions are listed and described in Chapters 4 through 8.

But the initial chapters of this book may save you a lot of time in the long run. For instance, this chapter introduces you to a number of utilities—some of them created by one of the authors, Jan—that let you test and debug a regular expression before you bury it in code where errors are harder to find. And these initial chapters also show you how to use various features and options of regular expressions to make your life easier, help you understand regular expressions in order to improve their performance, and learn the subtle differences between how regular expressions are handled by different programming languages—and even different versions of your favorite programming language.

So we’ve put a lot of effort into these background matters, confident that you’ll read it before you start or when you get frustrated by your use of regular expressions and want to bolster your understanding.

Regular Expressions Defined

In the context of this book, a regular expression is a specific kind of text pattern that you can use with many modern applications and programming languages. You can use them to verify whether input fits into the text pattern, to find text that matches the pattern within a larger body of text, to replace text matching the pattern with other text or rearranged bits of the matched text, to split a block of text into a list of subtexts, and to shoot yourself in the foot. This book helps you understand exactly what you’re doing and avoid disaster.

If you use regular expressions with skill, they simplify many programming and text processing tasks, and allow many that wouldn’t be at all feasible without the regular expressions. You would need dozens if not hundreds of lines of procedural code to extract all email addresses from a document—code that is tedious to write and hard to maintain. But with the proper regular expression, as shown in Recipe 4.1, it takes just a few lines of code, or maybe even one line.

But if you try to do too much with just one regular expression, or use regexes where they’re not really appropriate, you’ll find out why some people say:[1]

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

The second problem those people have is that they didn’t read the owner’s manual, which you are holding now. Read on. Regular expressions are a powerful tool. If your job involves manipulating or extracting text on a computer, a firm grasp of regular expressions will save you plenty of overtime.

Many Flavors of Regular Expressions

All right, the title of the previous section was a lie. We didn’t define what regular expressions are. We can’t. There is no official standard that defines exactly which text patterns are regular expressions and which aren’t. As you can imagine, every designer of programming languages and every developer of text processing applications has a different idea of exactly what a regular expression should be. So now we’re stuck with a whole palate of regular expression flavors.

Fortunately, most designers and developers are lazy. Why create something totally new when you can copy what has already been done? As a result, all modern regular expression flavors, including those discussed in this book, can trace their history back to the Perl programming language. We call these flavors Perl-style regular expressions. Their regular expression syntax is very similar, and mostly compatible, but not completely so.

Writers are lazy, too. We’ll usually type regex or regexp to denote a single regular expression, and regexes to denote the plural.

Regex flavors do not correspond one-to-one with programming languages. Scripting languages tend to have their own, built-in regular expression flavor. Other programming languages rely on libraries for regex support. Some libraries are available for multiple languages, while certain languages can draw on a choice of different libraries.

This introductory chapter deals with regular expression flavors only and completely ignores any programming considerations. Chapter 3 begins the code listings, so you can peek ahead to Programming Languages and Regex Flavors in Chapter 3 to find out which flavors you’ll be working with. But ignore all the programming stuff for now. The tools listed in the next section are an easier way to explore the regex syntax through “learning by doing.”

Regex Flavors Covered by This Book

For this book, we selected the most popular regex flavors in use today. These are all Perl-style regex flavors. Some flavors have more features than others. But if two flavors have the same feature, they tend to use the same syntax. We’ll point out the few annoying inconsistencies as we encounter them.

All these regex flavors are part of programming languages and libraries that are in active development. The list of flavors tells you which versions this book covers. Further along in the book, we mention the flavor without any versions if the presented regex works the same way with all flavors. This is almost always the case. Aside from bug fixes that affect corner cases, regex flavors tend not to change, except to add features by giving new meaning to syntax that was previously treated as an error:

Perl

Perl’s built-in support for regular expressions is the main reason why regexes are popular today. This book covers Perl 5.6, 5.8, and 5.10.

Many applications and regex libraries that claim to use Perl or Perl-compatible regular expressions in reality merely use Perl-style regular expressions. They use a regex syntax similar to Perl’s, but don’t support the same set of regex features. Quite likely, they’re using one of the regex flavors further down this list. Those flavors are all Perl-style.

PCRE

PCRE is the “Perl-Compatible Regular Expressions” C library developed by Philip Hazel. You can download this open source library at http://www.pcre.org. This book covers versions 4 through 7 of PCRE.

Though PCRE claims to be Perl-compatible, and probably is more than any other flavor in this book, it really is just Perl-style. Some features, such as Unicode support, are slightly different, and you can’t mix Perl code into your regex, as Perl itself allows.

Because of its open source license and solid programming, PCRE has found its way into many programming languages and applications. It is built into PHP and wrapped into numerous Delphi components. If an application claims to support “Perl-compatible” regular expressions without specifically listing the actual regex flavor being used, it’s likely PCRE.

.NET

The Microsoft .NET Framework provides a full-featured Perl-style regex flavor through the System.Text.RegularExpressions package. This book covers .NET versions 1.0 through 3.5. Strictly speaking, there are only two versions of System.Text.RegularExpressions: 1.0 and 2.0. No changes were made to the Regex classes in .NET 1.1, 3.0, and 3.5.

Any .NET programming language, including C#, VB.NET, Delphi for .NET, and even COBOL.NET, has full access to the .NET regex flavor. If an application developed with .NET offers you regex support, you can be quite certain it uses the .NET flavor, even if it claims to use “Perl regular expressions.” A glaring exception is Visual Studio (VS) itself. The VS integrated development environment (IDE) still uses the same old regex flavor it has had from the beginning, which is not Perl-style at all.

Java

Java 4 is the first Java release to provide built-in regular expression support through the java.util.regex package. It has quickly eclipsed the various third-party regex libraries for Java. Besides being standard and built in, it offers a full-featured Perl-style regex flavor and excellent performance, even when compared with applications written in C. This book covers the java.util.regex package in Java 4, 5, and 6.

If you’re using software developed with Java during the past few years, any regular expression support it offers likely uses the Java flavor.

JavaScript

In this book, we use the term JavaScript to indicate the regular expression flavor defined in version 3 of the ECMA-262 standard. This standard defines the ECMAScript programming language, which is better known through its JavaScript and JScript implementations in various web browsers. Internet Explorer 5.5 through 8.0, Firefox, Opera, and Safari all implement Edition 3 of ECMA-262. However, all browsers have various corner case bugs causing them to deviate from the standard. We point out such issues in situations where they matter.

If a website allows you to search or filter using a regular expression without waiting for a response from the web server, it uses the JavaScript regex flavor, which is the only cross-browser client-side regex flavor. Even Microsoft’s VBScript and Adobe’s ActionScript 3 use it.

Python

Python supports regular expressions through its re module. This book covers Python 2.4 and 2.5. Python’s regex support has remained unchanged for many years.

Ruby

Ruby’s regular expression support is part of the Ruby language itself, similar to Perl. This book covers Ruby 1.8 and 1.9. A default compilation of Ruby 1.8 uses the regular expression flavor provided directly by the Ruby source code. A default compilation of Ruby 1.9 uses the Oniguruma regular expression library. Ruby 1.8 can be compiled to use Oniguruma, and Ruby 1.9 can be compiled to use the older Ruby regex flavor. In this book, we denote the native Ruby flavor as Ruby 1.8, and the Oniguruma flavor as Ruby 1.9.

To test which Ruby regex flavor your site uses, try to use the regular expression a++. Ruby 1.8 will say the regular expression is invalid, because it does not support possessive quantifiers, whereas Ruby 1.9 will match a string of one or more a characters.

The Oniguruma library is designed to be backward-compatible with Ruby 1.8, simply adding new features that will not break existing regexes. The implementors even left in features that arguably should have been changed, such as using (?m) to mean “the dot matches line breaks,” where other regex flavors use (?s).



[1] Jeffrey Friedl traces the history of this quote in his blog at http://regex.info/blog/2006-09-15/247.

Get Regular Expressions Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.