SF Area/Download
Docs:
User
Internal/Devel
Support This Project
SourceForge.net Logo

CSRegEx Class Reference

The main class for Regular Expressions. More...

#include <csregex.h>

List of all members.

Public Types

enum  RGError {
  rge_ok = 0, rge_too_many_refs = 1, rge_missing_round_bracket = 2, rge_overlapping_chars = 3, rge_esc_eof = 4, rge_missing_square_bracket = 5, rge_invalid_esc_hex = 6,
  rge_invalid_repeat_format = 7, rge_invalid_repeat_range = 8, rge_unbalanced_round_bracket = 9, rge_invalid_range = 10, rge_invalid_backreference = 11, rge_regex_too_long = 12
}
 Error codes for compilation. More...

Public Member Functions

bool Compile (const unsigned char *str)
 Compile a regex before scanning.
bool MatchRE (const unsigned char *str, const unsigned char *re)
 Match a string. re is the plaintext regular expression.
bool Match (const unsigned char *str)
 Match a string.
bool Match (const unsigned char *str, const unsigned char *cmp)
 Match a string. cmp is the compiled regular expression string.
unsigned char * GetCompiledString () const
 Returns a copy of the compiled string. Must delete[] return pointer.
void SetCompiledString (unsigned char *str)
 Sets the compiled string.

Public Attributes

int error
 -1 is ok, anything else is index in str where error occured.
enum RGError error_code
 Error code.
char * error_str
 Human readable error information.
bool bMatchHead
 Set to true if you want to always match start of search string.
int MatchStart
 Results from a match. Index into search string.
int MatchEnd
 Results from a match. Non-inclusive. Index into search string.
int BackStart [10]
 Backreferences.
int BackEnd [10]
 End of Backreferences. Non-inclusive.


Detailed Description

The main class for Regular Expressions.

Implementation of regular expressions. It supports both compiling and matching. Compiling a regular expression converts it into a form that the matching engine can use. This compiled string is stored internally for future use so that there's no need for recompiling the regular expression every time.

All functions support UNICODE! Simply compile with UNICODE support on and all functions will use wchar_t instead of char.

Matching is done non-recursively.

Here is a list of its features.

Elements of a regular expression.

  • letters letters are matched on a one on one basis.
  • . matches any char, even newlines.
  • \c characters can be escaped at any time and will have no special meaning except for below.
  • \# Exceptions to the above are the numbers # = 0 to 9 which are reserved for backreferences.
  • [] matches a single character from a list. - signifies a range. ^ as 1st char negates the sets. anything immediately after the opening [ or ^ is treated as a normal character and has no special meaning unless it is a backslash (ie. becomes an escape sequence). This means you can put [ ] - or any other character and that character will be inserted into the set. [[] and []] are valid sets although they look strange. The first one consists of only '[' and the other ']'.


Special escape characters:

  • \n 0x0A newline (unix & dos differences not parsed. You have to handle that on your own).
  • \r 0x0D linefeed
  • \t 0x09 tab
  • \a 0x07 bell
  • \b 0x08 backspace
  • \f 0x0C formfeed
  • \v 0x0B vertical tab
  • \x## or \X## hex code for character.


Grouping

( ) You can group any items together. These can be backreferenced with \1 to \9 depending on the opening bracket count. Backreferences will match the exact string matched inside the ().

(?: ) If you don't need backreferences, place a ?: after the opening round bracket. This will save time and space during the matching process.

| Match an alternate set of items if the previous set did not match.

Quantifiers

? match previous item 0 or 1 time.
* match previous item 0 to infinite times.
+ match previous item 1 to infinite times.
{n} match previous item exactly n times. n is a number from 1 to 253.
{n,} match previous item n to infinite times. n is a number from 0 to 253.
{n,m} match previous item n to m times. n and m are numbers from 1 to 253. n can also be 0.

The ranges for n and m in the last 3 qualifiers have a maximum of 253 because one byte is used for the range. Also, 255 is used for infinite, so that's reserved. And if want to be able to use the compiled string just as a normal string, 0 can't be used internally either. So that makes 2 numbers that are reserved. So 255-2 = 253 and that's the largest value you can use.

? can be used after the above qualifers to make it lazy. ie. it'll only match as little as necessarry. Ex: c.*?t will return "cat" from "catabctch" instead of the usual "catabct"

^ anchors to the front of the string. Can only be used as 1st char, otherwise, normal char.
$ anchors to the end of the string. Can only be used as last char, otherwise, normal char.

Note about the following examples: I use printf. Sorry if you're used to cout, but I despise streams. I think they are a bad idea that should have never seen the light of day.

Simple example:

 // Search for floats.
 CSRegEx re();
 if (re.MatchRE("bla blae 0.457","[-+]?([0-9]*\.)?[0-9]+([eE][-+]?[0-9]+)?"))
      prinf("Found Match!\n");
 else printf("Fail!\n");

Example with full error checking:

 char *regex = "[-+]?([0-9]*\.)?[0-9]+([eE][-+]?[0-9]+)?";
 char *mystr = "bla blae 0.457";  // Our search string.
 CSRegEx *re = new CSRegEx(); // Create instance of CSRegEx.

 // Compile our regular expression to find floats.
 if (!re->Compile(regex)) 
 {
   // Print the human readable error.
   printf("Error: %s\n",re->error_str); 
   // The following will print the regular expression and below 
   // it will print a ^ where the error occured.
   printf("Reg expr: %s\n",regex);
   printf("          ");
   for(int i=0;i<re->error;i++) printf(" ");
   printf("^\n");
 }
 else if (!re->Match(mystr)) // Try and find a match.
 {
   printf("No match");
 }
 else
 {
   // Success. re->MatchStart and re->MatchEnd (non-inclusive) 
   // contain the location of the match.
   printf("Match: ");
   for(int i=re->MatchStart;i<re->MatchEnd;i++) printf("%c",mystr[i]);
   printf("\n");
 }
 delete re;


Member Enumeration Documentation

enum CSRegEx::RGError
 

Error codes for compilation.

After compilation, error_code will be set to one of these values. If an error (other than rge_ok), then error will hold the location of the error in the source string.

Enumerator:
rge_ok  OK.
rge_too_many_refs  Too many references. ie. More than 9.
rge_missing_round_bracket  Missing closing round bracket.
rge_overlapping_chars  Inside a set [], some characters have been specified multiple times.
rge_esc_eof  Escape char at end of string is not allowed.
rge_missing_square_bracket  Closing square bracket for set not found.
rge_invalid_esc_hex  Invalid hex characters in escape sequence.
rge_invalid_repeat_format  Repeat format {} is invalid.
rge_invalid_repeat_range  Ranges in repeat {} are invalid.
rge_unbalanced_round_bracket  Unbalanced round brackets.
rge_invalid_range  Invalid range in set []. Usually because 1st item is higher than 2nd. ie. [z-a].
rge_invalid_backreference  Backreference specified not yet defined.
rge_regex_too_long  Regular expression is too long (>65535 chars) when compiled.


Member Function Documentation

bool CSRegEx::Compile const unsigned char *  str  ) 
 

Compile a regex before scanning.

After successful compilation, the compiled version of the regular expression is stored internally. You can use GetCompiledString() to retrieve a copy. Also, after success of this function, you can use the Match() function with only the search string.

On error (return false), CSRegEx::error_code and CSRegEx::error_str will be set. Also, CSRegEx::error will point to the location in the plaintext regular expression where the error occurred.

Parameters:
str Regular expression.
Returns:
True if success.
See also:
GetCompiledString()

Match()

unsigned char * CSRegEx::GetCompiledString  )  const
 

Returns a copy of the compiled string. Must delete[] return pointer.

Returns:
Copy of compiled string.
Warning:
You must delete[] the retuned pointer when done.

bool CSRegEx::Match const unsigned char *  str,
const unsigned char *  cmp
 

Match a string. cmp is the compiled regular expression string.

Uses cmp for the regular expression instead of the internally stored one. The internal regular expression is not touched and can be reused after a call to this function.

See Match() to find out more about successful matching.

See also:
Match()
Parameters:
str Search string.
cmp Compiled regular expression.
Returns:
True if match found. False otherwise.
Note:
There exists overloaded versions of this function for signed versions of str and re.

bool CSRegEx::Match const unsigned char *  str  ) 
 

Match a string.

On success (return true), CSRegEx::MatchStart and CSRegEx::MatchEnd (non-inclusive) will be indexes into the search string where the match was found.

CSRegEx::BackStart[10] and CSRegEx::BackEnd[10] are indexes like the above for backreferences. Array index 0 is the same as MatchStart and MatchEnd.

The regular expression used is the one stored internally. It can be set with Compile(), SetCompiledString() or MatchRE().

Parameters:
str Search string.
Returns:
True if match found. False otherwise.
See also:
Compile()

SetCompiledString()

MatchRE()

bool CSRegEx::MatchRE const unsigned char *  str,
const unsigned char *  re
 

Match a string. re is the plaintext regular expression.

The internal regular expression is replaced with the compiled version of re.

Because the internal regular expression is replaced, if you wish to use the same regular expression again, you can simply call Match() with only the search string after a call to this function.

On failure, check if error_code=rge_ok, then everything went ok, but no match was found. If error_code!=rge_ok, then there was an error with the compilation.

See Compile() to find out more about the errors.
See Match() to find out more about successful matching.

See also:
Compile()

Match()

Parameters:
str Search string.
re Plaintext regular expression.
Returns:
True if match found. False otherwise.
Note:
There exists overloaded versions of this function for signed versions of str and re.

void CSRegEx::SetCompiledString unsigned char *  str  ) 
 

Sets the compiled string.

Allows the use of previously compiled strings.

Parameters:
str The compiled string.
Note:
The data is copied, so you retain ownership of the pointer and its data.


The documentation for this class was generated from the following files:

Docs for CSRegEx created on Tue Dec 11 14:35:40 2007 by Doxygen 1.4.3


Webmaster: Cléo Saulnier