dagfinn | 18 March, 2007 21:41
It bothers me: I can dream up more ideas in one afternoon than I can write down in a week. So now I've decided to write down as much of today's ideas as I can before it gets too late in the evening. It might be only a fraction of what I've been thinking, but I guess I'll have to learn to live with that.
Anyway, when I use regular expressions (regexes), I often feel the need to make them more readable. I do all sorts of things to achieve that. Commenting them is a pretty basic way.
The x modifier allows you to insert comments into the regex itself. All whitespace is ignored, and so is text after # at the end of a line:
$regex =
'/(w+s+){6} # Word followed by spaces, repeated six times
w+ # Last word
/x';
I always thought this was wonderful, but today I finally realized it's completely unnecessary. Using concatenation to split the regex costs us a few more characters, but it's more flexible, since we can split it anywhere we want:
$regex =
'/(w+s+)'. // Word followed by spaces
'{6}'. // Repeated six times
'w+'. // Last word
'/'.
In ordinary program code, it's often useful to extract well-named methods instead of using comments. So naturally I think this way with regexes as well. It's possible to name the parts of the expression. The plainest way, just to show the principle, is using variables for example:
$word = "w+";
$spaces = "s+";
$regex = "/($word$spaces){6}$word/";
But why aren't regular expressions slightly more verbose in the first place? Wouldn't that make them more readable? I came across this blog post by Joshua Flanagan. He has constructed an object-oriented API for generating regular expressions, and there's a lot of interesting discussion in the comments.
Apart from all the points made there, there's one that no one seems to have spotted: Simple regular expressions gain readability by being similar to the strings they match. Take this expression to match a date and time in ISO format:
'dddd-dd-dd dd:dd'
It's simple enough that the eye catches the similarity to an actual date string such as
'2007-03-18 23:00'
But as soon as they start getting more complex, the terse syntax starts looking cryptic.
(To be continued, I think...)
| « | January 2009 | » | ||||
|---|---|---|---|---|---|---|
| Su | Mo | Tu | We | Th | Fr | Sa |
| 1 | 2 | 3 | ||||
| 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| 11 | 12 | 13 | 14 | 15 | 16 | 17 |
| 18 | 19 | 20 | 21 | 22 | 23 | 24 |
| 25 | 26 | 27 | 28 | 29 | 30 | 31 |