Difference between revisions of "Regular expressions"

From PresenceWiki
Jump to: navigation, search
 
Line 17: Line 17:
 
=== Regular Expression Structure ===
 
=== Regular Expression Structure ===
  
A Regular Expression consists of a sequence of special characters which the compiler translates into instructions. Here are some examples of common character sequences used:
+
A Regular Expression consists of a sequence of special characters which the compiler translates into instructions. The most simple form of regular expression is a literal string, which matches exactly the text specified. Examples of special characters include:
  
 
. (period) - matches any character
 
. (period) - matches any character
Line 26: Line 26:
  
 
Will match "cat", "cot", "cut", etc.
 
Will match "cat", "cot", "cut", etc.
 
Other useful sequences include:
 
  
 
\w - word characters (a-z, A-Z)
 
\w - word characters (a-z, A-Z)
Line 60: Line 58:
  
 
The first "\s" represents the space before the word. \w+ matches one ore more alphabetical characters which are followed by " company".
 
The first "\s" represents the space before the word. \w+ matches one ore more alphabetical characters which are followed by " company".
 +
 +
A good introduction to Regular Expressions can be found in the book "Mastering Regular Expressions" by Jeffrey E F Friedl.

Revision as of 11:28, 7 January 2010

Regular Expressions

Regular Expressions are a concise, efficient and powerful tool for searching for patterns of text. Presence allows users to make use of them via Decision Point Nodes, Switch Nodes and Functions.

Consider the following scenario: You have the job of checking the pages on a web server for doubled words (such as "this this"), a common problem with documents subject to heavy editing. Your job is to create a solution that will:

  • Accept any number of files to check, report each line of each file that has doubled words, highlight (using standard ANSI escape sequences) each doubled word, and ensure that the source filename appears with each line in the report.
  • Work across lines, even finding situations where a word at the end of one line is repeated at the beginning of the next.
  • Find doubled words despite capitalization differences, such as with "The the", as well as allow differing amounts of whitespace (spaces, tabs, new lines and the like) to lie between the words.
  • Find doubled words even when separated by HTML tags. HTML tags are for marking up text on a World Wide Web page, for example, to make a word bold.

There are many software solutions one could use to solve the problem, but one with regular expression support can make the job substantially easier.

Regular Expression Structure

A Regular Expression consists of a sequence of special characters which the compiler translates into instructions. The most simple form of regular expression is a literal string, which matches exactly the text specified. Examples of special characters include:

. (period) - matches any character

For example, a regular expression of this form:

"c.t"

Will match "cat", "cot", "cut", etc.

\w - word characters (a-z, A-Z)

\d - digits (0-9)

\s - whitespace characters (newlines, tabs, spaces)

So the following regular expression:

 "P.\ws\wn.e"

Will match "Presence" or "Personae". The first "." matches any character, "\w" matches any word character.

Multiple Characters

Instead of just searching for one character, we can search for more than one using the following repetition characters:

  • * - the character occurs zero or more times
  •  ? - zero or one times
  • + - one ore more times
  • {n} exactly n times
  • {n1,n2} between n1 and n2 times

For example, let's say we have the following string of text:

"International Presence is a great company to work for"

We want to find out what adjective is used to describe our company. To achieve this, we'd look for the surrounding words and a pattern to describe the adjective, thus:

\s\w+\scompany

The first "\s" represents the space before the word. \w+ matches one ore more alphabetical characters which are followed by " company".

A good introduction to Regular Expressions can be found in the book "Mastering Regular Expressions" by Jeffrey E F Friedl.