REGEX.E

V 2.1

22 APR 02

General:

Overview:     -- General Description
Installation: -- Package installation
ReadMe:       -- Credits and licence
PCRE:         -- PCRE specification (all the gory details)
Manifest:     -- What files are included

Constants:

PCRE_CASELESS        =  #0001,   -- ignore case in search
PCRE_MULTILINE       =  #0002,   -- ^ & $ match at each newline
PCRE_DOTALL          =  #0004,   -- '.' matches any including newline
PCRE_EXTENDED        =  #0008,   -- allows comments in the pattern
PCRE_ANCHORED        =  #0010,   -- match only at start of string
PCRE_DOLLAR_ENDONLY  =  #0020,   -- match '$' only at end of string
PCRE_EXTRA           =  #0040,   -- not yet implemented
PCRE_NOTBOL          =  #0080,   -- start of search string is not BOL
PCRE_NOTEOL          =  #0100,   -- end of search string is not EOL
PCRE_UNGREEDY        =  #0200,   --
PCRE_NOTEMPTY        =  #0400,   -- 'empty' patterns cannot match
PCRE_IGNORE_ERROR    =  #2000,   -- Errors do not abort
PCRE_HIDE_ERROR      =  #4000	   -- Errors are not shown to STDERR

Data:

integer  RGXcnt     -- # of subexpressions which matched

sequence RGXhead    -- Portion of subject before the match.
sequence RGXmatch   -- The matched portion of the subject string.
sequence RGXtail    -- Portion of subject after match

Methods:

func RGXscan (aPat, sSubj, iOffset) -- Scan the subject for the pattern. Start at Offset.
func RGXstart (iN)         -- Return the index of the first char of matching substring N
func RGXend (iN)           -- Return the index of the last char of matching substring N

func RGXcompile (sPat)     -- compile the pattern and return handle
proc RGXfree (aHndl)       -- Free a handle returned by RGXcompile()
func RGXsubstring (iN)     -- Return the N'th matching substring
proc RGXsetOptions (iOpts) -- Set options for subsequent searches
func RGXgetErr ()          -- Get {description, offset) of last error.

proc RGXfind (aPat, sSubj) -- Create RGXhead, RGXmatch, and RGXtail from subject
proc RGXsetOffset (iOffs)  -- Set offset for subsequent calls to RGXfind



RGXcompile

Syntax: atom handle = RGXcompile (sPattern)
Description: Compile a search pattern into PCRE's internal format.
Comments: RGXcompile() takes a regular expression and returns a handle suitable for passing to RGXfind(). If the pattern is being used repeatedly, you may wish to do this to avoid multiple compilations of the pattern . The compiled pattern uses a small amount of memory for storage; it may be recovered with RGXfree()
Example:  
 
include ctx\regex.e

digits = RGXcompile ("[0-9]+")
for i = 1 to n do
    RGXfind (digits, astring)
    show (RGXmatch)
end for
digits = RGXfree (digits)
See Also:  




RGXsubstring

Syntax: strng = RGX (iSubStringNumber)
Description: Return a substring from the last search.
Comments: Substrings are defined by parenthesis in the pattern. substring(0) is the entire match; substring(1) is the part which matched the first paren'ed part of the pattern; etc. A constant in regex.e limits the maximum number of substrings to nine.
Example:  
 
include ctx\regex.e

--find a number with a decimal point
RGXfind ("([0-9])+\.([0-9]+)", "1st value = 22.3")

-- show it if found
if RGXcnt = 3 then
    puts (1, "Number is " & RGXsubstring (0))
    puts (1, "Integer part is " & RGXsubstring (1))
    puts (1, "Fractional part is " & RGXsubstring (2))
end if
See Also:  


RGXfree

Syntax: RGXfree (pattern)
Description: Free the memory associated with a compiled pattern.
Comments: Each call to RGXcompile() causes PCRE to allocate a small amount of memory which is hidden from Euphoria's garbage collector, so this procedure can be used to explicitly free the storage. This is unlikely to be important unless RGXcompile() is being called from within a loop. NOTE! RGXfree does not modify the handle. If a handle is freed and then used for a callto RGXfind(), the results are unpredictable.
Example:  
 
include ctx\regex.e

--a search for many patterns
atom pat
for i = 1 to 10000 do
    pat = RGXcompile(getnextpattern())
    ...
    RGXfree (pat)   -- !reclaim memory!
end for
pat = 0    -- erase the handle in case we use it again!
See Also:  


RGXsetOptions

Syntax: RGXsetOptions (integer offset)
Description: Set the options for subsequent RGXfind()'s.
Comments: The possible options are listed in the Constants: section. The options set remain in effect until the next RGXsetOptions() command. RGXsetOptions(0) restores the default settings.
Example:  
 
include ctx\regex.e

--Caseless searching
RGXfind (".tag.", "test !TAG! *tag*") -- finds "*tag*"
RGXsetOptions (PCRE_CASELESS)     -- start caseless searches
RGXfind (".tag.", "test !Tag! *tag*") -- finds "!Tag!"
RGXsetOptions (0)                 -- restore default options
See Also:  Constants:



RGXgetErr

Syntax: RGXgetErr ( )
Description: Return a sequence containing the last error message and offset from RGXcompile ( ) .
Comments: The default behavior for RGXcompile( ) is to abort on error, so this function is only useful if RGXsetOptions( ) has been used to set the PCRE_IGNORE_ERRORS option. Normally a bad pattern is indicative of a programming bug so an abort is appropriate, but code which generates patterns on the fly might, rarely, need this.
Example:  
 
include ctx\regex.e

--Caseless searching
sequence err
RGXsetOptions (PCRE_IGNORE_ERRORS)     -- don't abort on compile error
RGXfind ("", "test !Tag! *tag*")     -- null pattern is not good!
err = RGXgetErr ()
puts (2, "Error message--" & err[1])    -- show the message
See Also:  


RGXscan

Syntax: RGXscan (atom HandleOrPattern, sequence Subject, integer Offset)
Description: Scan Subject, starting at Offset, for Pattern.
Comments: This is an 'index' version of RGXfind(). Where RGXfind() creates RGXhead, RGXmatch, and RGXtail sequences, RGXscan() just stores the indexes of the match and substrings. See RGXstart() and RGXend() for access to these indexes. It is faster than RGXfind(), under some circumstances much faster. Note that Offset is a cardinal (start of string is 0).
Example:  
 
include ctx\regex.e
sequence subject
subject = "foo = 12 zot = 34"
--find the first number
RGXscan ("[0-9]+", subject, 0)
numtext1 = subject[RGXstart(0)..RGXend(0)]
--find the second number
RGXscan ("[0-9]+", subject, RGXend(0))
numtext2 = subject[RGXstart(0)..RGXend(0)]
See Also:  RGXstart(), RGXend()


RGXstart

Syntax: RGXstart (integer Substring )
Description: Return the starting index of the found substring number 'Substring' resulting from the last RGXscan() or RGXfind().
Comments: Substring 0 is the primary match, Substring 1 is the match from the first set of parens in the pattern (if any), etc. If the substring was not found, a -1 is returned.
Example:  
 
include ctx\regex.e

atom pat
pat = RGXcompile ("([0-9]+)\.([0-9]*)")

RGXscan (pat, "x = 23.77")
?RGXstart (0)              -- 5 (start of matched number)
?RGXstart (1)              -- 5 (start of digits before '.')
?RGXstart (2)              -- 8 (start of digits after '.')
See Also:  RGXscan(), RGXend()


RGXend

Syntax: RGXend (integer Substring )
Description: Return the ending index of the found substring number 'Substring' resulting from the last RGXscan() or RGXfind().
Comments: Substring 0 is the primary match, Substring 1 is the match from the first set of parens in the pattern (if any), etc. If the substring was not found, a -1 is returned.
Example:  
 
include ctx\regex.e

atom pat
pat = RGXcompile ("([0-9]+)\.([0-9]*")

RGXscan (pat, "x = 23.77test")
?RGXend (0)              -- 6 (end of matched number)
?RGXend (1)              -- 6 (end of the digits before '.')
?RGXend (2)              -- 9 (end of the digits after '.')
See Also:  RGXscan(), RGXstart()


ReadMe

    Regex is a Euphoria module for doing regular expression searches. It is implemented as a wrapper around the PCRE (Perl-Compatible Regular Expression) library written by:

Philip Hazel <ph10@cam.ac.uk>

The latest release of PCRE is always available from:

ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/pcre-xxx.tar.gz

The following licence is taken from the PCRE package: I intend regex to be covered by no more and no less than this.

PCRE LICENCE  ------------

    PCRE is a library of functions to support regular expressions whose syntax and semantics are as close as possible to those of the Perl 5 language.

Written by: Philip Hazel <ph10@cam.ac.uk>

University of Cambridge Computing Service,
Cambridge, England. Phone: +44 1223 334714.

Copyright (c) 1997-2000 University of Cambridge

    Permission is granted to anyone to use this software for any purpose on any computer system, and to redistribute it freely, subject to the following restrictions:

1. This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

2. The origin of this software must not be misrepresented, either by explicit claim or by omission.

3. Altered versions must be plainly marked as such, and must not be misrepresented as being the original software.

4. If PCRE is embedded in any software that is released under the GNU General Purpose Licence (GPL), then the terms of that licence shall
supersede any condition above with which it is incompatible.


END OF LICENCE

Changes in Version 2.1:

Changes in Version 2.0:




OverView

     Regex.e is a Euphoria module for doing regular expression searches. It is implemented as a wrapper around Phillip Hazel's PCRE (Perl-Compatible Regular Expression) library.
The latest release of PCRE is always available from

ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/pcre-xxx.tar.gz

PCRE operates by compiling a regular expression (the pattern) into an internal program and then running that program on the string to be searched (the subject). It has many options, allowing quite sophisticated search options. In general, it fills a large middle ground between simple scanning loops in the host language, and full-up lexers like LEX, FLEX, etc.

The basic API is RGXscan(), RGXstart(), and RGXend(). Use RGXscan() to look for a pattern in a subject string, then RGXstart() and RGXend() to get the start and end indexes of the found pattern.

Patterns which are used many times may be pre-compiled with RGXcompile() to save time.

A 'higher-level' and sometimes more convenient API is provided by RGXfind(), RGXhead, RGXmatch, and RGXtail. RGXfind() automatically generated thehead, match and tail sequences, which may not be necessary, and under some circumstances can be remarkably slow. (Thanks to Andy Serpa for pointing this out). If speed is an issue, the RGXscan(),
RXGstart() and RGXend() interface is favored.


Installation

 

Zip file Installation:

Create a subdirectory 'Ctx' of your Euphoria include directory ("Euphoria\Include\Ctx", also known as "%EUDIR%\Include\Ctx"). Extract the zip archive into this directory. The contents of the zip file are stored without any path, so they will end up in whatever directory they are extracted to. Any files with the extension .tst may be deleted.


Documentation Installation:

The best way to install the HTML documentation (this file) is to modify the Euphoria\Html\library.htm file as follows:

1) Find the following text in library.htm:

 2. Routines by Application Area
</font></center>
<p>

2) Immediately after the above text, insert:

<!-- Custom additions-------------------------------->
<font color="#FF0099" size=+1><br>2.0 Added Libraries</font>
 <br><br>
<table border=0 cellspacing=2 cellpadding=2>
    <tr>
    <td valign=top><a href="../include/Ctx/regex.htm"><b>REGEX</b></a></td>
    <td width=10 align=center valign=top>-</td>
    <td>Perl-Compatible Regular Expressions (regex.e)</td>
    </tr>
</table> <p> <hr>
<!-- Custom additions-------------------------------->

3) Be sure to modify the string after href= to reflect where you have installed regex.htm



Manifest

pcre.dll:     The generated DLL
pcre.html:    Documentation for PCRE taken from the web.
regex.e:      Euphoria include file interfacing to the DLL
regex.htm:    Documentation for REGEX (this file)
test.ex:       A small test program.
manual1.css   Cascading style sheet for regex.htm
release       Brief release notes