Myrddin: libregex

Summary

Regexes are useful tools for parsing and matching text. This library implements a simple but powerful regex variant, which executes in O(text*pat).

pkg regex =
            type ast = union
                    /* basic string building */
                    `Alt    (ast#, ast#)
                    `Cat    (ast#, ast#)

                    /* repetition */
                    `Star   ast#
                    `Rstar  ast#
                    `Plus   ast#
                    `Rplus  ast#
                    `Quest  ast#    

                    /* end matches */
                    `Chr    char
                    `Ranges char[2][:]

                    /* meta */
                    `Cap    (std.size, ast#) /* id, ast */
                    `Bol    /* beginning of line */
                    `Eol    /* end of line */
                    `Bow    /* beginning of word */
                    `Eow    /* end of word */
            ;;

            type status = union
                    `Noimpl
                    `Incomplete
                    `Unbalanced char
                    `Emptyparen
                    `Badrep char
                    `Badrange byte[:]
                    `Badescape char
            ;;

/* regex compilation */
    const parse : (re : byte[:] -> std.result(ast#, status))
    const compile   : (re : byte[:] -> std.result(regex#, status))
    const dbgcompile    : (re : byte[:] -> std.result(regex#, status))
    const free  : (re : regex# -> void)

    /* regex execution */
    const exec  : (re : regex#, str : byte[:] -> std.option(byte[:][:]))
    const search    : (re : regex#, str : byte[:] -> std.option(byte[:][:]))

    const sub   : (re : regex#, str : byte[:], subst : byte[:][:] -> std.option(byte[:]))
    const sbsub : (sb : std.strbuf#, re : regex#, str : byte[:], subst : byte[:][:] -> bool)
    const suball    : (re : regex#, str : byte[:], subst : byte[:][:] -> byte[:])
    const sbsuball  : (sb : std.strbuf#, re : regex#, str : byte[:], subst : byte[:][:] -> void)

    const matchfree : (pat : byte[:][:] -> void)
;;

Overview

Libregex is a simple regex API that uses a parallel NFA implementation. This means that while it is not blazingly fast, it does not exhibit pathological behavior on regexes like (aa|aab?)\* that many common regex APIs will see.

Regex Syntax

The grammar for regexes that are accepted is sketched out below.

       regex       : altexpr
       altexpr     : catexpr ('|' altexpr)+
       catexpr     : repexpr (catexpr)+
       repexpr     : baseexpr[*+?][?]
       baseexpr    : literal
                   | charclass
                   | charrange
                   | '.'
                   | '^'
                   | '$'
                   | '\<'
                   | '\>'
                   | '(' regex ')'
       charclass   : see below
       charrange   : '[' (literal('-' literal)?)+']'

The following metacharacters have the meanings listed below:

Matches a single unicode character

Metachar Description
\< Matches the beginning of a word. Does not consume any characters.
\> Matches the end of a word. Does not consume any characters.
^ Matches the beginning of a line. Does not consume any characters.
$ Matches the end of a line. Does not consume any characters.
* Matches any number of repetitions of the preceding regex fragment.
+ Matches one or more repetitions of the preceding regex fragment.
? Matches zero or one of the preceding regex fragment.

In order to match a literal metacharacter, it needs to be preceded by a '\' character.

The following character classes are supported:

Charclass Description
\d ASCII digits
\D Negation of ASCII digits
\x ASCII Hex digits
\X Negation of ASCII Hex digits
\s ASCII spaces
\S Negation of ASCII spaces
\w ASCII word characters
\W Negation of ASCII word characters
\h ASCII whitespace characters
\H Negation of ASCII whitespace characters
\pX Characters with unicode property 'X'
\PX Negation of characters with property 'X'

The current list of supported Unicode character classes X are

Abbrev Full name Description
L Letter All letters, including lowercase, uppercase, titlecase, and uncased.
Lu Uppercase_Letter All uppercase letters.
Ll Lowercase_Letter All lowercase letters.
Lt Titlecase_Letter All titlecase letters.
N Number All numbers.
Z Separator All separators, including spaces and control characers.
Zs Space_Separator All space separators, including tabs and ASCII spaces.

Functions

const parse : (re : byte[:] -> std.result(ast#, status))

Parse takes a regex string, and converts it to a regex syntax tree, returning `std.Ok ast# if the regex was valid, or a `std.Err e if the regex could not be parsed. This AST can be used to further process the regex, possibly turning it into a multiregex as in Hairless, or using it for NFA and DFA tricks.

const compile   : (re : byte[:] -> std.result(regex#, status))
const dbgcompile    : (re : byte[:] -> std.result(regex#, status))

compile takes a regex string, and converts it to a compiled regex, returning `std.Ok regex if the regex was valid, or a `std.Err e with the reason that the compilation failed. dbgcompile is similar, however, the regex is compiled so it will spit out a good deal of debugging output. Unless you are intent on debugging the internals of the regex engine, this is likely only of academic interest.

const free  : (re : regex# -> void)

free must be called on a compiled regex to release it's resources after you are finished using it.

const exec  : (re : regex#, str : byte[:] -> std.option(byte[:][:])

exec runs the regex over the specified text, returning an `std.Some matches if the text matched, or `std.None if the text did not match. matches[0] is always the full text that was matched, and will always be returned regardless of whether capture groups are specified.

const search    : (re : regex#, str : byte[:] -> std.option(byte[:][:]))

search searches for a matching sub-segment of the regex over the specified text, returning an `std.Some matches if the text matched, or `std.None if the text did not match. matches[0] is always the full text that was matched, and will always be returned regardless of whether capture groups are specified. search returns the the earliest match in the string provided.

const sub   : (re : regex#, str : byte[:], subst : byte[:][:] -> std.option(byte[:]))
const sbsub : (sb : std.strbuf#, re : regex#, str : byte[:], subst : byte[:][:] -> bool)

sub will take a pattern, an input string, and a set of substitutions, and attempt to match. If the match is successful, it will replace each group within str with subst, returning a freshly allocated string. sbsub behaves identically, however it inserts the new string into the string buffer provided, instead of allocating a new string.

If there is no match, then `std.None will be returned.

const suball    : (re : regex#, str : byte[:], subst : byte[:][:] -> byte[:])
const sbsuball  : (sb : std.strbuf#, re : regex#, str : byte[:], subst : byte[:][:] -> void)

suball replaces every match within the string using the given substitutions. Only captured groups will be substituted. The remaining text will be left in place.

Example

Pattern matching

Substitution