CSRegEx Class ReferenceThe main class for Regular Expressions.
More...
#include <csregex.h>
List of all members.
|
Public Types |
enum | RGError {
rge_ok = 0,
rge_too_many_refs = 1,
rge_missing_round_bracket = 2,
rge_overlapping_chars = 3,
rge_esc_eof = 4,
rge_missing_square_bracket = 5,
rge_invalid_esc_hex = 6,
rge_invalid_repeat_format = 7,
rge_invalid_repeat_range = 8,
rge_unbalanced_round_bracket = 9,
rge_invalid_range = 10,
rge_invalid_backreference = 11,
rge_regex_too_long = 12
} |
| Error codes for compilation. More...
|
Public Member Functions |
bool | Compile (const unsigned char *str) |
| Compile a regex before scanning.
|
bool | MatchRE (const unsigned char *str, const unsigned char *re) |
| Match a string. re is the plaintext regular expression.
|
bool | Match (const unsigned char *str) |
| Match a string.
|
bool | Match (const unsigned char *str, const unsigned char *cmp) |
| Match a string. cmp is the compiled regular expression string.
|
unsigned char * | GetCompiledString () const |
| Returns a copy of the compiled string. Must delete[] return pointer.
|
void | SetCompiledString (unsigned char *str) |
| Sets the compiled string.
|
Public Attributes |
int | error |
| -1 is ok, anything else is index in str where error occured.
|
enum RGError | error_code |
| Error code.
|
char * | error_str |
| Human readable error information.
|
bool | bMatchHead |
| Set to true if you want to always match start of search string.
|
int | MatchStart |
| Results from a match. Index into search string.
|
int | MatchEnd |
| Results from a match. Non-inclusive. Index into search string.
|
int | BackStart [10] |
| Backreferences.
|
int | BackEnd [10] |
| End of Backreferences. Non-inclusive.
|
Detailed Description
The main class for Regular Expressions.
Implementation of regular expressions. It supports both compiling and matching. Compiling a regular expression converts it into a form that the matching engine can use. This compiled string is stored internally for future use so that there's no need for recompiling the regular expression every time.
All functions support UNICODE! Simply compile with UNICODE support on and all functions will use wchar_t instead of char.
Matching is done non-recursively.
Here is a list of its features.
Elements of a regular expression.
- letters letters are matched on a one on one basis.
- . matches any char, even newlines.
- \c characters can be escaped at any time and will have no special meaning except for below.
- \# Exceptions to the above are the numbers # = 0 to 9 which are reserved for backreferences.
- [] matches a single character from a list. - signifies a range. ^ as 1st char negates the sets. anything immediately after the opening [ or ^ is treated as a normal character and has no special meaning unless it is a backslash (ie. becomes an escape sequence). This means you can put [ ] - or any other character and that character will be inserted into the set. [[] and []] are valid sets although they look strange. The first one consists of only '[' and the other ']'.
Special escape characters:
- \n 0x0A newline (unix & dos differences not parsed. You have to handle that on your own).
- \r 0x0D linefeed
- \t 0x09 tab
- \a 0x07 bell
- \b 0x08 backspace
- \f 0x0C formfeed
- \v 0x0B vertical tab
- \x## or \X## hex code for character.
Grouping
( ) You can group any items together. These can be backreferenced with \1 to \9 depending on the opening bracket count. Backreferences will match the exact string matched inside the ().
(?: ) If you don't need backreferences, place a ?: after the opening round bracket. This will save time and space during the matching process.
| Match an alternate set of items if the previous set did not match.
Quantifiers
? match previous item 0 or 1 time.
* match previous item 0 to infinite times.
+ match previous item 1 to infinite times.
{n} match previous item exactly n times. n is a number from 1 to 253.
{n,} match previous item n to infinite times. n is a number from 0 to 253.
{n,m} match previous item n to m times. n and m are numbers from 1 to 253. n can also be 0.
The ranges for n and m in the last 3 qualifiers have a maximum of 253 because one byte is used for the range. Also, 255 is used for infinite, so that's reserved. And if want to be able to use the compiled string just as a normal string, 0 can't be used internally either. So that makes 2 numbers that are reserved. So 255-2 = 253 and that's the largest value you can use.
? can be used after the above qualifers to make it lazy. ie. it'll only match as little as necessarry. Ex: c.*?t will return "cat" from "catabctch" instead of the usual "catabct"
^ anchors to the front of the string. Can only be used as 1st char, otherwise, normal char.
$ anchors to the end of the string. Can only be used as last char, otherwise, normal char.
Note about the following examples: I use printf. Sorry if you're used to cout, but I despise streams. I think they are a bad idea that should have never seen the light of day.
Simple example:
CSRegEx re();
if (re.MatchRE("bla blae 0.457","[-+]?([0-9]*\.)?[0-9]+([eE][-+]?[0-9]+)?"))
prinf("Found Match!\n");
else printf("Fail!\n");
Example with full error checking:
char *regex = "[-+]?([0-9]*\.)?[0-9]+([eE][-+]?[0-9]+)?";
char *mystr = "bla blae 0.457";
CSRegEx *re = new CSRegEx();
if (!re->Compile(regex))
{
printf("Error: %s\n",re->error_str);
printf("Reg expr: %s\n",regex);
printf(" ");
for(int i=0;i<re->error;i++) printf(" ");
printf("^\n");
}
else if (!re->Match(mystr))
{
printf("No match");
}
else
{
printf("Match: ");
for(int i=re->MatchStart;i<re->MatchEnd;i++) printf("%c",mystr[i]);
printf("\n");
}
delete re;
Member Enumeration Documentation
|
Error codes for compilation.
After compilation, error_code will be set to one of these values. If an error (other than rge_ok), then error will hold the location of the error in the source string. - Enumerator:
-
rge_ok |
OK. |
rge_too_many_refs |
Too many references. ie. More than 9. |
rge_missing_round_bracket |
Missing closing round bracket. |
rge_overlapping_chars |
Inside a set [], some characters have been specified multiple times. |
rge_esc_eof |
Escape char at end of string is not allowed. |
rge_missing_square_bracket |
Closing square bracket for set not found. |
rge_invalid_esc_hex |
Invalid hex characters in escape sequence. |
rge_invalid_repeat_format |
Repeat format {} is invalid. |
rge_invalid_repeat_range |
Ranges in repeat {} are invalid. |
rge_unbalanced_round_bracket |
Unbalanced round brackets. |
rge_invalid_range |
Invalid range in set []. Usually because 1st item is higher than 2nd. ie. [z-a]. |
rge_invalid_backreference |
Backreference specified not yet defined. |
rge_regex_too_long |
Regular expression is too long (>65535 chars) when compiled. |
|
Member Function Documentation
bool CSRegEx::Compile |
( |
const unsigned char * |
str |
) |
|
|
|
Compile a regex before scanning.
After successful compilation, the compiled version of the regular expression is stored internally. You can use GetCompiledString() to retrieve a copy. Also, after success of this function, you can use the Match() function with only the search string.
On error (return false), CSRegEx::error_code and CSRegEx::error_str will be set. Also, CSRegEx::error will point to the location in the plaintext regular expression where the error occurred. - Parameters:
-
- Returns:
- True if success.
- See also:
- GetCompiledString()
Match()
|
unsigned char * CSRegEx::GetCompiledString |
( |
|
) |
const |
|
|
Returns a copy of the compiled string. Must delete[] return pointer.
- Returns:
- Copy of compiled string.
- Warning:
- You must delete[] the retuned pointer when done.
|
bool CSRegEx::Match |
( |
const unsigned char * |
str, |
|
|
const unsigned char * |
cmp |
|
) |
|
|
|
Match a string. cmp is the compiled regular expression string.
Uses cmp for the regular expression instead of the internally stored one. The internal regular expression is not touched and can be reused after a call to this function.
See Match() to find out more about successful matching. - See also:
- Match()
- Parameters:
-
| str | Search string. |
| cmp | Compiled regular expression. |
- Returns:
- True if match found. False otherwise.
- Note:
- There exists overloaded versions of this function for signed versions of str and re.
|
bool CSRegEx::Match |
( |
const unsigned char * |
str |
) |
|
|
bool CSRegEx::MatchRE |
( |
const unsigned char * |
str, |
|
|
const unsigned char * |
re |
|
) |
|
|
|
Match a string. re is the plaintext regular expression.
The internal regular expression is replaced with the compiled version of re.
Because the internal regular expression is replaced, if you wish to use the same regular expression again, you can simply call Match() with only the search string after a call to this function.
On failure, check if error_code=rge_ok, then everything went ok, but no match was found. If error_code!=rge_ok, then there was an error with the compilation.
See Compile() to find out more about the errors.
See Match() to find out more about successful matching. - See also:
- Compile()
Match()
- Parameters:
-
| str | Search string. |
| re | Plaintext regular expression. |
- Returns:
- True if match found. False otherwise.
- Note:
- There exists overloaded versions of this function for signed versions of str and re.
|
void CSRegEx::SetCompiledString |
( |
unsigned char * |
str |
) |
|
|
|
Sets the compiled string.
Allows the use of previously compiled strings. - Parameters:
-
- Note:
- The data is copied, so you retain ownership of the pointer and its data.
|
The documentation for this class was generated from the following files:
Docs for CSRegEx created on Tue Dec 11 14:35:40 2007 by Doxygen 1.4.3
Webmaster: Cléo Saulnier
|