Regular expressions

From PresenceWiki
Revision as of 11:19, 7 January 2010 by Mattpryor (Talk | contribs)

(diff) ←Older revision | view current revision (diff) | Newer revision→ (diff)
Jump to: navigation, search

Regular Expressions

Regular Expressions are a concise, efficient and powerful tool for searching for patterns of text. Presence allows users to make use of them via Decision Point Nodes, Switch Nodes and Functions.

Consider the following scenario: You have the job of checking the pages on a web server for doubled words (such as "this this"), a common problem with documents subject to heavy editing. Your job is to create a solution that will:

  • Accept any number of files to check, report each line of each file that has doubled words, highlight (using standard ANSI escape sequences) each doubled word, and ensure that the source filename appears with each line in the report.
  • Work across lines, even finding situations where a word at the end of one line is repeated at the beginning of the next.
  • Find doubled words despite capitalization differences, such as with "The the", as well as allow differing amounts of whitespace (spaces, tabs, new lines and the like) to lie between the words.
  • Find doubled words even when separated by HTML tags. HTML tags are for marking up text on a World Wide Web page, for example, to make a word bold.

There are many software solutions one could use to solve the problem, but one with regular expression support can make the job substantially easier.

Regular Expression Structure

A Regular Expression consists of a sequence of special characters which the compiler translates into instructions. Here are some examples of common character sequences used:

. (period) - matches any character

For example, a regular expression of this form:


Will match "cat", "cot", "cut", etc.

Other useful sequences include:

\w - word characters (a-z, A-Z)

\d - digits (0-9)

\s - whitespace characters (newlines, tabs, spaces)

So the following regular expression:


Will match "Presence" or "Personae". The first "." matches any character, "\w" matches any word character.

Multiple Characters

Instead of just searching for one character, we can search for more than one using the following repetition characters:

  • * - the character occurs zero or more times
  •  ? - zero or one times
  • + - one ore more times
  • {n} exactly n times
  • {n1,n2} between n1 and n2 times

For example, let's say we have the following string of text:

"International Presence is a great company to work for"

We want to find out what adjective is used to describe our company. To achieve this, we'd look for the surrounding words and a pattern to describe the adjective, thus:


The first "\s" represents the space before the word. \w+ matches one ore more alphabetical characters which are followed by " company".