Hide Comments

Regular Expressions

Regular Expressions are a text matching system which is much more flexible than simple wildcard comparisons. They are a bit more difficult to learn, but it can be worth learning at least the basics because similar systems are widely used in programs such as Microsoft Word and other editors, as well as in common programming languages like Javascript.

There are several types of regular expression system with slight differences in their advanced features. VPOP3 uses the PCRE (Perl Compatible Regular Expression) library for its regular expression system, so any tutorial or book describing that system will work with VPOP3's regular expressions.

A good online tutorial is at http://www.regular-expressions.info/ .

In some places in VPOP3, you can specify a wildcard or regular expression to match. In that case, you indicate that you are providing a regular expression by surrounding it with / characters, and specifying any flags after the last /. This is a common way of indicating regular expressions (eg in Javascript and other programming languages). In VPOP3, regular expressions are not automatically anchored to the start and end of the text, but you can explicitly anchor them using the ^ and $ characters.

Below is a basic introduction to regular expressions.

In a regular expression, most characters will match themselves. There are 12 special characters in regular expressions: the backslash \, the dollar symbol $, the caret ^, the dot ., the vertical bar |, the question mark ?, the asterisk *, the plus sign +, parentheses ( and ), the opening square bracket [ , and the opening curly brace {. To match one of the special characters you have to put a backslash \ in front of it (this is called "escaping" the character).

By default comparisons are all case sensitive!

So, some simple regular expressions would be


Special Characters

The special character meanings are:

. - match any character (except space characters - space, carriage-return, line-feed and tab characters).

? - match 0 or 1 of the preceding token. E.g. a? will match "" or "a". .? will match any character or the absence of any character.

* - match 0 or more of the preceding token. E.g. a* will match "" or "a" or "aaaaaaaaaaa". .* will match zero or more of any character (the characters don't have to be the same).

+ - match 1 or more of the preceding token. E.g. a+ will match "a" or "aaaaaaaaa" but not "".

{m} - match m of the preceding token. E.g. a{5} will match only "aaaaa".

{m,n} - match from m to n (inclusive) of the preceding token. Omitted numbers mean either 0 at the start or infinity at the end. So a{2,4} means 2, 3, or 4 'a' characters. a{,5} means 0 to 5 'a' characters.

[...] - defines a "character class". You can put characters inside the square brackets, or ranges using a '-' character. This will match any of the characters in the character class. E.g [a-z] will match any lower case letter. [aeiou] will match any lower-case vowel. [aeiouAEIOU] will match any lower- or upper-case vowel. You can use any character inside the character class without escaping, except for the ] character, which must be escaped. Eg, you can have [[\]] to match either [ or ]. If you want to put a - character in the character class, put it at the end, with nothing after it, or it will be interpreted as a range. Eg, [+-*/] will be interpreted as "+ to * and /" - instead use [+*/-].

| - this is called 'alternation'. It means whatever is before, or whatever is after - so cat|dog will match either cat or dog. To limit the alternation use parentheses. E.g. there is a (cat|dog) over there.

^ - this "anchors" the comparison to the start of the text. Normally regular expressions will match anywhere in the text, but with a ^ it must match at the beginning. Eg cad will match in abracadabra, but ^cad won't. ^.*cad will also match, but it is less efficient. ^abra will match at the start of abracadabra

$ - this "anchors" the comparison to the end of the text. So bra$ will match at the end of abracadabra

(...) - parentheses group things together. For instance (cat)+ will match 1 or more instances of "cat", so will match "catcatcat" or "cat", but not "tac". You can use parentheses for many more things, such a lookaheads, lookbehinds, captures, modifiers etc, but you will need to read a more advanced regular expression manual for that.

\ - this "escapes" the following special character. Do not use it to escape alphanumeric characters, as they will probably not work as you want. For instance \d will match any digit (0-9), \D will match any non-digit character, \b will anchor the comparison to the start or end of a word, \s will match any space character, \S will match any non-space character, \w will match [a-zA-Z0-9_], \t matches a tab character, \n matches a newline character, and \r matches a carriage-return character, and so on.


In VPOP3, modifiers come after the terminating / in a regular expression:

i - make the comparison case insensitive

s - make the . character match space characters as well

m - make the comparison into a multi-line mode. In this case ^ and $ match at the start and end of lines, rather than the start and end of the full text