PHP in Action

Readable regular expressions

dagfinn | 18 March, 2007 21:41

It bothers me: I can dream up more ideas in one afternoon than I can write down in a week. So now I've decided to write down as much of today's ideas as I can before it gets too late in the evening. It might be only a fraction of what I've been thinking, but I guess I'll have to learn to live with that.

Anyway, when I use regular expressions (regexes), I often feel the need to make them more readable. I do all sorts of things to achieve that. Commenting them is a pretty basic way.

The x modifier allows you to insert comments into the regex itself. All whitespace is ignored, and so is text after # at the end of a line:

$regex =
'/(w+s+){6}  # Word followed by spaces, repeated six times
 w+                # Last word
 /x
';

I always thought this was wonderful,  but today I finally realized it's completely unnecessary. Using concatenation to split the regex costs us a few more characters, but it's more flexible, since we can split it anywhere we want:

$regex =
'/(w+s+)'. // Word followed by spaces
'{6}'.         // Repeated six times
'w+'.         // Last word
'/'.

In ordinary program code, it's often useful to extract well-named methods instead of using comments. So naturally I think this way with regexes as well. It's possible to name the parts of the expression. The plainest way, just to show the principle, is using variables for example:

$word = "w+";
$spaces = "s+";
$regex = "/($word$spaces){6}$word/";

But why aren't regular expressions slightly more verbose in the first place? Wouldn't that make them more readable? I came across this blog post by Joshua Flanagan. He has constructed an object-oriented API for generating regular expressions, and there's a lot of interesting discussion in the comments.

Apart from all the points made there, there's one that no one seems to have spotted: Simple regular expressions gain readability by being similar to the strings they match. Take this expression to match a date and time in ISO format:

'dddd-dd-dd dd:dd'

It's simple enough that the eye catches the similarity to an actual date string such as 

'2007-03-18 23:00'

But as soon as they start getting more complex, the terse syntax starts looking cryptic.

(To be continued, I think...)

Share and Enjoy:These icons link to social bookmarking sites where readers can share and discover new web pages.
    blogmarks del.icio.us digg NewsVine Reddit
 
Accessible and Valid XHTML 1.0 Strict and CSS
Powered by LifeType - Design by BalearWeb