We’re always searching for something – the file where we wrote that
recipe (Python or baking); the comment in 100,000 lines of code that
points to an unfinished module; the log entry about an iffy connection.
Regular expressions (abbreviated as regexps hereafter, but you’ll also
see regex and re) are a codified method of searching which, to the
unenlightened, suggests line noise. Yet, despite a history that
stretches back to Ken Thompson’s 1968 QED editor, they’re still a
powerful tool today, thanks to grep – ‘global regular expression print’.
Using grep exposes only the limited Basic Regular Expressions (BRE);
grep -E (or egrep) gives Extended Regular Expressions (ERE). For other
languages, most adopt PCRE (Perl Compatible Regular Expressions),
developed in 1997, by Philip Hazel, and understood by many languages,
though not always implemented in exactly the same way. We’ll use grep -P
when we need to access these. Emacs has its own regexp style but, like
grep, has a -P option to use Perl-compatible regexps.
This introduction is mostly aimed at searching from the shell, but
you should easily be able to adapt it to standalone Perl scripts, and
other languages which use PCRE. Even the simplest regexp can make you more productive at the command line
Resources
Your favourite editor
Perl 5.10 (or later)
Step-by-step
Step 01 Word up!
You’re probably used to searching a text file for occurrences of a
word with grep – in that case, the word is the regular expression. More
complicated regexps are simply concise ways for searching for parts of
words, or character strings, in particular positions. Step 02 Reserved character
Some characters mean special things in regexp pattern matching: . * [
] ^ $ \ in Basic Regular Expressions. The ‘.’ matches any character, so
using it above doesn’t just find the full stop unless grep’s -F option
is used to make the string entirely literal. Step 03 Atlantic crossing
Extended Regular Expressions add ? | { } ( ) to the metacharacters.
grep -E or egrep lets you use them, as above, where
‘standardise|standardize’ can match British or American (and ‘Oxford’)
spellings of ‘standardise’. Step 04 Colourful?
‘|’ gives a choice between the two characters in the parentheses –
standardi(s|z)e – saving unnecessary typing. Another way to find both
British and American spellings is ‘?’ to indicate one or zero of the
preceding element, such as the u in colour. Step 05 Mmmmm, cooooool
The other quantifiers are + for at least one of the preceding regexps
(‘_+’ finds lines with at least one underscore) and * for zero or more
(coo*l matches col, cool, coooooooool, but not cl, useful for different
spellings of mmmmmmmmm or zzzzzzzzzz). Step 06 No number
Feeling confident? Good, time for more goodies. [0-9] is short for
[0123456789] and matches any element in the square brackets. The ^
inside the brackets is a negation, here matching any non-number but the
other ^? … Step 07 Start to finish
The ^ matches the expression at the beginning of the line; a $
matches the end. Now you can sort your document.text from text.doc and
find lines beginning with # or ending in a punctuation mark other than a
period. Step 08 A to Z Guide
The range in [] can be anything from the ASCII character set, so [
\t\r\n\v\f] indicates the whitespace characters (tab, newline et al).
[^bd]oom$ matches all words ending in ‘oom’, occurring at the end of the
line, except boom and doom. Step 09 POSIX classes
The POSIX classes for character ranges save a lot of the [A-Za-z0-9],
but perhaps most useful is the non-POSIX addition of [:word:] which
matches [A-Za-z0-9_], the addition of underscore helping to match
identifiers in many programming languages. Step 10 ASCII style
Where character classes aren’t implemented, knowledge of ASCII’s
underpinnings can save you time: so [ -~] is all printable ASCII
characters (character codes 32-127) and its inverse [^ -~] is all
non-printable ASCII characters. Step 11 Beyond grep
Find and Locate both work well with regexps. In The Linux Command
Line (reviewed in LUD 111), William Shotts gave the great example of
find . -regex ‘.*[^-_./0-9a-zA-Z].*’ to find filenames with embedded
spaces and other nasties. Step 12 Nice one Cyril
Speaking of non-standard characters, while [:alpha:]
depends on your locale settings, and may only find ASCII text, you can
still search for characters of other alphabets – from accented French
and Welsh letters to the Greek or Russian alphabet. Step 13 Ranging repeat
While {4} would match the preceding element if it occurred four
times, putting in two numbers gives a range. So, [0-9]{1,3} in the above
screenshot finds one-, two- or three- digit numbers – a quick find for
dotted quads, although it won’t filter out 256-999. Step 14 Bye bye, IPv4
FOSDEM was all IPv6 this year, so let’s not waste any more time on
IPv4 validation, as the future may actually be here. As can be seen in
this glimpse of IPv6 validators, despite some Perl ‘line noise’, it
boils down to checking appropriate amounts of hex. Step 15 Validation
By now regexps should be looking a lot less like line noise, so it’s
time to put together a longer one, just building from some of the
simpler parts. A common programming task, particularly with web forms,
is validating input is in the correct format – such as dates.
In this case we’re looking at validating dates, eg for date-of-birth
(future dates could then be filtered using current date). Note that
(0[1-9]|[12][0-9]|3[01]) checks numbers 01-31, but won’t prevent 31st
February. Step 16 Back to basics
Now we have the basics, and can string them together, don’t neglect
the grep basics – here we’re looking at how many attempts at
unauthorised access were made by SSH in a given period. An unnecessary
pipe replaced with grep -c. Step 17 Why vi?
Whatever your position in the venerable and affectionate vi/Emacs
war, there will be times and servers where vi is your only tool, so grab
yourself a cheat-sheet. Vi and vim mostly follow BRE. Here we see one
of the \< \> word boundaries. Step 18 Boundary guard
As well as ^ and $ for line ends, word boundaries can be matched in
regexps with \b – enabling matches on, say, ‘hat’ without matching
‘chatter’. The escape character, \, is used to add a number of extra
elements, such as \d for numerical digit. Step 19 Literally meta
Speaking of boundaries, placing \Q \E around a regexp will treat
everything within as literals rather than metacharacters – meaning you
can just quote a part of the regexp, unlike grep -F where everything
becomes a literal. Step 20 Lazy = good
Time to think about good practice. * is a greedy operator, expanding
something like <.*> by grabbing the last closing tag and anything
between, including further tags. <.*?> is non- greedy (lazy),
taking the first closing tag. Step 21 Perl -pie
Aside from grep, Perl remains the most comfortable fit with regexps,
as is far more powerful than the former. With perl -pie on the command
line, you can perform anything from simple substitutions on one or more
files, to… Step 22 Perl one-liner
…counting the empty lines in a text file (this from Krumin’s Perl
One-Liners, see next month’s book reviews). /^$/ matches an empty line;
note Perl’s use of // to delimit a regexp; ,, could also be used if / is
one of the literals used. Step 23 A regexp too far
Now you know the basics, you can build slightly more complicated
regexps – but, as Jeff Atwood said: “Regular expressions are like a
particularly spicy hot sauce – to be used in moderation and with
restraint, only when appropriate.” Step 24 Tagged offender
Finally, know the limitations of regexps. Don’t use on HTML, as they
don’t parse complex languages well. Here the legendary StackOverflow
reply by Bob Ince to a query on their use with HTML expresses the
passion this question engenders.
No comments:
Post a Comment