Regular Expressions May Cause Irregularity

Published: February 24th, 2009 by:

Regular expressions seem rather complex, even like a foreign language to beginning programmers.  The mixture of symbols and characters can bring you to tears if you have a limited understanding, but with the knowledge of the significance of each symbol and construct, regular expressions can bring you from irregularity to peaceful contention bliss.


A regular expression (regex for short) is a pattern describing a string of text.  There are levels of complexity for regular expressions, from simple strings of literal characters to repetition and grouping.

The most basic regular expression consists of a single literal character like a.  It will match that character in a string, but only on the first time, unless you apply repetitive properties to it or are more specific.  To use literal symbols, [, \, ^, $, ., |, ?, *, (, ) you must escape them with a backslash (\).   For example, 1\+1=2 is the correct regular expression to match 1+1=2.

Next are character classes or character sets.  Instead of specifying each letter of the alphabet or digit, it is possible instead to use sets of alphanumeric characters to match a string.  For instance six[a-z]+ matches sixteen, sixty, and sixlet, among others.  Charcter sets should be placed within brackets and can include any range of lower-case characters, upper-case characters, digits from 0 to 9, hyphens, and underscores.

Regular expressions do have some shorthand character classes to ease your mind:  \d for all digits, \w for all alphanumeric characters or the underscore (“_”), \s for whitespace characters including tabs and line breaks, \t for tabs individually, \r for carriage returns, and \n for line feeds.  Remember that Windows text files use \r\n to terminate lines while UNIX text files simply use \n.

The dot or period (“.”) stands for any character except line break characters, meaning the same as [^\n], where the ^ character means the negation of the following character (“anything but …”).  Often times a character class or negated character class is faster and more precise, so be cautious when using the dot.

Alternation allows for some choice in your regular expressions.  Using the vertical pipe bar (“|”), you can make sixteen, sixty, and sixlet all match the following regular expression: six(teen|ty|let).  The question mark (“?”) allows for optional characters, for instance when dealing with British-English and American-English spellings (colou?r matches color and colour).

Lastly is the issue of repetition.  The asterisk (“*”) tells the engine to atempt to match the class zero or more times.  The plus sign (“+”) tries to match one or more times.  An integer in curly brackets following a class can specify an exact amount of instances.

Now for some examples.  The description above can seem quite weighty and boring, but some applications should solidify the ideas and clarify any confusion.

Format of an e-mail address


<?php

$goodEmail = "yourname@yourdomain.com";

$badEmail = "thisIsNoEmailAtAll";

$niceTry = "thisIsClose@domain";

$tooManyAts = "thisIsCloseToo@@domain";

if (@preg_match('/[-a-zA-Z0-9]+@{1}[-a-zA-Z0-9]+[\.]{1}[a-zA-Z]{2,4}[\.]*[-a-zA-Z0-9]*/', $email)) {

echo "Good email!";

}

else {

echo "The format of your e-mail address is unacceptable.";

}

?>

You can run through each e-mail address above, and the one named as good will work while the others will fail.  Notice that we allow for three or four letter domains as well as those like “co.uk” where the domain would have two periods (“.”).

Another common use of regular expressions is in URL rewriting.  For instance, on my website, JeoReview, when a user access a URL like ‘http://www.jeoreview.com/board/Board-Name’ a different page is served to show the board itself.  In truth, that URL does not exist.  The regular expression, as placed in the .htaccess file, looks like this:


Options FollowSymLinks
RewriteEngine On
RewriteRule ^board/(.+)$ loadboard.php?name=$1

The caret (“^”) marks the beginning of the string and the dollar sign (“$”) marks the end.  Because nearly any character can be used for the board name, a dot (“.”) is used to match the name itself.  In truth, the better alternative would be to match exactly what characters can be used to increase speed and accuracy.  In any case, the $1 refers to the first character case, and each subsequent integer is filled with the following character cases, though in this example there is only one.

There are plenty of other great uses of regular expressions.  Check some out by visiting these great resources and cheat sheets:


2 Responses to “Regular Expressions May Cause Irregularity”

  • blogiskewl

    Hi, I recently started a bloghosting platform (based on wordpress MU) and when I stumbled your blog I paid attention to your theme (looking good) so I was wondering can you tell me is it custom made theme or one of those free ones? thanks in advance! regards, blogiskewl

     

  • Andrew

    Hi, the theme is Revolt 6 1.0 by Nuvio Webdesign. Glad you like it! 🙂

     

Leave a Reply





Wordpress doesn't like it when you post PHP code. Go save your code at pastebin, and post the link here.

About the Author

Kurtis has been working with PHP for nearly four years, and he has moderate experience with MySQL as well as other programming languages, like Java and C++.