Date: Wed, 14 Sep 2011 21:08:02 +0000 (UTC) From: Gabor Kovesdan <gabor@FreeBSD.org> To: src-committers@freebsd.org, svn-src-user@freebsd.org Subject: svn commit: r225561 - user/gabor/tre-integration/lib/libc/regex Message-ID: <201109142108.p8EL82vN042595@svn.freebsd.org>
next in thread | raw e-mail | index | archive | help
Author: gabor Date: Wed Sep 14 21:08:02 2011 New Revision: 225561 URL: http://svn.freebsd.org/changeset/base/225561 Log: - Update the old manual page to match better the specific features of TRE and drop those parts that are specific to the old regex code. Modified: user/gabor/tre-integration/lib/libc/regex/regex.3 Modified: user/gabor/tre-integration/lib/libc/regex/regex.3 ============================================================================== --- user/gabor/tre-integration/lib/libc/regex/regex.3 Wed Sep 14 20:13:10 2011 (r225560) +++ user/gabor/tre-integration/lib/libc/regex/regex.3 Wed Sep 14 21:08:02 2011 (r225561) @@ -1,3 +1,4 @@ +.\" Copyright (c) 2011 Gabor Kovesdan <gabor@FreeBSD.org>. .\" Copyright (c) 1992, 1993, 1994 Henry Spencer. .\" Copyright (c) 1992, 1993, 1994 .\" The Regents of the University of California. All rights reserved. @@ -32,12 +33,18 @@ .\" @(#)regex.3 8.4 (Berkeley) 3/20/94 .\" $FreeBSD$ .\" -.Dd August 17, 2005 +.Dd September 14, 2011 .Dt REGEX 3 .Os .Sh NAME .Nm regcomp , +.Nm regncomp , +.Nm regwcomp , +.Nm regwncomp , .Nm regexec , +.Nm regnexec , +.Nm regwexec , +.Nm regwnexec , .Nm regerror , .Nm regfree .Nd regular-expression library @@ -47,12 +54,39 @@ .In regex.h .Ft int .Fo regcomp -.Fa "regex_t * restrict preg" "const char * restrict pattern" "int cflags" +.Fa "regex_t * preg" "const char * pattern" "int cflags" +.Fc +.Ft int +.Fo regncomp +.Fa "regex_t * preg" "const char * pattern" "size_t len" "int cflags" +.Fc +.Ft int +.Fo regwcomp +.Fa "regex_t * preg" "const wchar_t * pattern" "int cflags" +.Fc +.Ft int +.Fo regwncomp +.Fa "regex_t * preg" "const wchar_t * pattern" "size_t len" "int cflags" .Fc .Ft int .Fo regexec -.Fa "const regex_t * restrict preg" "const char * restrict string" -.Fa "size_t nmatch" "regmatch_t pmatch[restrict]" "int eflags" +.Fa "const regex_t * preg" "const char * string" +.Fa "size_t nmatch" "regmatch_t pmatch[]" "int eflags" +.Fc +.Ft int +.Fo regnexec +.Fa "const regex_t * preg" "const char * string" "size_t len" +.Fa "size_t nmatch" "regmatch_t pmatch[]" "int eflags" +.Fc +.Ft int +.Fo regwexec +.Fa "const regex_t * preg" "const wchar_t * string" +.Fa "size_t nmatch" "regmatch_t pmatch[]" "int eflags" +.Fc +.Ft int +.Fo regwnexec +.Fa "const regex_t * preg" "const wchar_t * string" "size_t len" +.Fa "size_t nmatch" "regmatch_t pmatch[]" "int eflags" .Fc .Ft size_t .Fo regerror @@ -62,24 +96,57 @@ .Ft void .Fn regfree "regex_t *preg" .Sh DESCRIPTION -These routines implement +These routines implement pattern matchinf of .St -p1003.2 -regular expressions -.Pq Do RE Dc Ns s ; -see -.Xr re_format 7 . +regular expressions. +The +.Xr re_format 7 +manual can be consulted for the syntax and use of these. +.Pp The .Fn regcomp function -compiles an RE written as a string into an internal form, +compiles a regular expression written as a string into an internal form. +The +.Fn regncomp +function works in the very same way, +but takes another argument to specify the length of the pattern. +This function can accept patterns with NUL bytes inside because. +The +.Fn regwcomp +and +.Fn regwncomp +functions work like the two former ones but take the pattern in +the wide string form. +.Pp +The .Fn regexec -matches that internal form against a string and reports results, -.Fn regerror -transforms error codes from either into human-readable messages, +function matches that internal form against a string and reports results. +The +.Fn regnexec +function works in the same way but takes another argument to specify +the length of the pattern, +allowing NUL bytes in the input string. +Besides, +for long inputs strings it is more efficient to call this function if +the length is already known beause it will not require the matcher to +calculate the length and read the input bytes one by one. +The +.Fn regwexec and +.Fn regwnexec +functions work like the two former ones but take the input as a +wide string. +.Pp +The +.Fn regerror +function transforms error codes from the above functions into +human-readable messages. +.Pp +The .Fn regfree -frees any dynamically-allocated storage used by the internal form -of an RE. +function frees any dynamically-allocated storage used by the internal form +of a regular expression. .Pp The header .In regex.h @@ -87,8 +154,8 @@ declares two structure types, .Ft regex_t and .Ft regmatch_t , -the former for compiled internal forms and the latter for match reporting. -It also declares the four functions, +the former for compiled internal forms and the latter for submatch reporting. +It also declares the functions mentioned above, a type .Ft regoff_t , and a number of constants with names starting with @@ -96,8 +163,7 @@ and a number of constants with names sta .Pp The .Fn regcomp -function -compiles the regular expression contained in the +family of functions compile the regular expression contained in the .Fa pattern string, subject to the flags in @@ -106,19 +172,19 @@ and places the results in the .Ft regex_t structure pointed to by .Fa preg . +Some variants of the function also take the length of the pattern in +.Fa len. The .Fa cflags argument is the bitwise OR of zero or more of the following flags: .Bl -tag -width REG_EXTENDED .It Dv REG_EXTENDED -Compile modern -.Pq Dq extended -REs, -rather than the obsolete -.Pq Dq basic -REs that -are the default. +Compile extended regular expressions +.Pq Dq EREs , +rather than the obsolete basic regular expressions +.Pq Dq BREs +that are the default. .It Dv REG_BASIC This is a synonym for 0, provided as a counterpart to @@ -127,31 +193,26 @@ to improve readability. .It Dv REG_NOSPEC Compile with recognition of all special characters turned off. All characters are thus considered ordinary, -so the -.Dq RE -is a literal string. -This is an extension, -compatible with but not specified by -.St -p1003.2 , -and should be used with -caution in software intended to be portable to other systems. -.Dv REG_EXTENDED -and +so the reqular expression is a literal string. +.It Dv REG_LITERAL +Synonim for +.Dv REG_NOSPEC. +.It Dv REG_EXTENDED +may not be used together with .Dv REG_NOSPEC -may not be used +or +.Dv REG_LITERAL in the same call to .Fn regcomp . .It Dv REG_ICASE Compile for matching that ignores upper/lower case distinctions. -See -.Xr re_format 7 . .It Dv REG_NOSUB Compile for matching that need only report success or failure, not what was matched. .It Dv REG_NEWLINE Compile for newline-sensitive matching. By default, newline is a completely ordinary character with no special -meaning in either REs or strings. +meaning in either regular expressins or strings. With this flag, .Ql [^ bracket expressions and @@ -170,66 +231,79 @@ The regular expression ends, not at the first NUL, but just before the character pointed to by the .Va re_endp +or +.Va re_wendp member of the structure pointed to by .Fa preg . +The former is used for the functions that take a single- or multi-byte +string, +while the second is used for those taking a wide string. The .Va re_endp member is of type -.Ft "const char *" . -This flag permits inclusion of NULs in the RE; +.Ft "const char *" +and the +.Va re_wendp +member is of type +.Ft "const wchar_t *" . +This flag permits inclusion of NULs in the regular expression; they are considered ordinary characters. -This is an extension, -compatible with but not specified by -.St -p1003.2 , -and should be used with -caution in software intended to be portable to other systems. .El .Pp When successful, +the .Fn regcomp -returns 0 and fills in the structure pointed to by +family of functions returns +.Dv REG_OK +and fills in the structure pointed to by .Fa preg . -One member of that structure -(other than -.Va re_endp ) -is publicized: +The .Va re_nsub , -of type +member of the structure of type .Ft size_t , -contains the number of parenthesized subexpressions within the RE -(except that the value of this member is undefined if the +contains the number of parenthesized subexpressions within the regular +expression (except when the .Dv REG_NOSUB -flag was used). +flag was used for the compilation of the pattern). If .Fn regcomp fails, it returns a non-zero error code; see -.Sx DIAGNOSTICS . +.Sx RETURN VALUES . .Pp The .Fn regexec -function -matches the compiled RE pointed to by +family of functions match the compiled regular expression pointed to by .Fa preg against the -.Fa string , +.Fa string +(possibly having a length of +.Fa len +when using the variants that take the input length), subject to the flags in .Fa eflags , -and reports results using +and reports match through its return value. +The .Fa nmatch , .Fa pmatch , -and the returned value. -The RE must have been compiled by a previous invocation of -.Fn regcomp . +arguments are also filled in to hold submatches unless the pattern was +compiled using the +.Dv REG_NOSUB +falg. +The regular expression must have been compiled by a previous invocation of +.Fn regcomp +or any of its alternative forms. The compiled form is not altered during execution of -.Fn regexec , -so a single compiled RE can be used simultaneously by multiple threads. +.Fn regexec +or its alternatives, +so a single compiled regular expression can be used simultaneously by +multiple threads, +and it can be used with any variant of the +.Fn regexec +functions. +(I.e. a multi-byte pattern can be matched to wide string input and +vice versa.) .Pp -By default, -the NUL-terminated string pointed to by -.Fa string -is considered to be the text of an entire line, minus any terminating -newline. The .Fa eflags argument is the bitwise OR of zero or more of the following flags: @@ -266,11 +340,6 @@ See below for the definition of .Fa pmatch and .Fa nmatch . -This is an extension, -compatible with but not specified by -.St -p1003.2 , -and should be used with -caution in software intended to be portable to other systems. Note that a non-zero .Va rm_so does not imply @@ -278,22 +347,17 @@ does not imply .Dv REG_STARTEND affects only the location of the string, not how it is matched. -.El .Pp +The function indicates a match by returning +.Dv REG_OK , +no match with +.Dv REG_NOMATCH , +or returns an error code different from the above two values +if an error has occured during the execution. See -.Xr re_format 7 -for a discussion of what is matched in situations where an RE or a -portion thereof could match any of several substrings of -.Fa string . -.Pp -Normally, -.Fn regexec -returns 0 for success and the non-zero code -.Dv REG_NOMATCH -for failure. -Other non-zero error codes may be returned in exceptional situations; -see -.Sx DIAGNOSTICS . +.Sx RETURN VALUES +for the detailed description of error codes. +.El .Pp If .Dv REG_NOSUB @@ -338,16 +402,16 @@ array is filled in to indicate what subs .Fa string was matched by the entire RE. Remaining members report what substring was matched by parenthesized -subexpressions within the RE; +subexpressions within the regular expression; member .Va i reports subexpression .Va i , with subexpressions counted (starting at 1) by the order of their opening -parentheses in the RE, left to right. +parentheses in the regular expression, left to right. Unused entries in the array (corresponding either to subexpressions that did not participate in the match at all, or to subexpressions that do not -exist in the RE (that is, +exist in the regular expression (that is, .Va i > .Fa preg Ns -> Ns Va re_nsub ) ) @@ -358,7 +422,7 @@ and set to -1. If a subexpression participated in the match several times, the reported substring is the last one it matched. -(Note, as an example in particular, that when the RE +(Note, as an example in particular, that when the regular expression .Ql "(b*)+" matches .Ql bbb , @@ -443,55 +507,15 @@ is 0, .Fa errbuf is ignored but the return value is still correct. .Pp -If the -.Fa errcode -given to -.Fn regerror -is first ORed with -.Dv REG_ITOA , -the -.Dq message -that results is the printable name of the error code, -e.g.\& -.Dq Dv REG_NOMATCH , -rather than an explanation thereof. -If -.Fa errcode -is -.Dv REG_ATOI , -then -.Fa preg -shall be -.No non\- Ns Dv NULL -and the -.Va re_endp -member of the structure it points to -must point to the printable name of an error code; -in this case, the result in -.Fa errbuf -is the decimal digits of -the numeric value of the error code -(0 if the name is not recognized). -.Dv REG_ITOA -and -.Dv REG_ATOI -are intended primarily as debugging facilities; -they are extensions, -compatible with but not specified by -.St -p1003.2 , -and should be used with -caution in software intended to be portable to other systems. -Be warned also that they are considered experimental and changes are possible. -.Pp The .Fn regfree function -frees any dynamically-allocated storage associated with the compiled RE -pointed to by +frees any dynamically-allocated storage associated with the compiled +regular expression pointed to by .Fa preg . The remaining .Ft regex_t -is no longer a valid compiled RE +is no longer a valid compiled regular expression and the effect of supplying it to .Fn regexec or @@ -500,148 +524,67 @@ is undefined. .Pp None of these functions references global variables except for tables of constants; -all are safe for use from multiple threads if the arguments are safe. -.Sh IMPLEMENTATION CHOICES -There are a number of decisions that -.St -p1003.2 -leaves up to the implementor, -either by explicitly saying -.Dq undefined -or by virtue of them being -forbidden by the RE grammar. -This implementation treats them as follows. -.Pp -See -.Xr re_format 7 -for a discussion of the definition of case-independent matching. -.Pp -There is no particular limit on the length of REs, -except insofar as memory is limited. -Memory usage is approximately linear in RE size, and largely insensitive -to RE complexity, except for bounded repetitions. -See -.Sx BUGS -for one short RE using them -that will run almost any system out of memory. -.Pp -A backslashed character other than one specifically given a magic meaning -by -.St -p1003.2 -(such magic meanings occur only in obsolete -.Bq Dq basic -REs) -is taken as an ordinary character. -.Pp -Any unmatched -.Ql [\& -is a -.Dv REG_EBRACK -error. -.Pp -Equivalence classes cannot begin or end bracket-expression ranges. -The endpoint of one range cannot begin another. -.Pp -.Dv RE_DUP_MAX , -the limit on repetition counts in bounded repetitions, is 255. -.Pp -A repetition operator -.Ql ( ?\& , -.Ql *\& , -.Ql +\& , -or bounds) -cannot follow another -repetition operator. -A repetition operator cannot begin an expression or subexpression -or follow -.Ql ^\& -or -.Ql |\& . -.Pp -.Ql |\& -cannot appear first or last in a (sub)expression or after another -.Ql |\& , -i.e., an operand of -.Ql |\& -cannot be an empty subexpression. -An empty parenthesized subexpression, -.Ql "()" , -is legal and matches an -empty (sub)string. -An empty string is not a legal RE. -.Pp -A -.Ql {\& -followed by a digit is considered the beginning of bounds for a -bounded repetition, which must then follow the syntax for bounds. -A -.Ql {\& -.Em not -followed by a digit is considered an ordinary character. -.Pp -.Ql ^\& -and -.Ql $\& -beginning and ending subexpressions in obsolete -.Pq Dq basic -REs are anchors, not ordinary characters. -.Sh DIAGNOSTICS -Non-zero error codes from +thus all of them are thread-safe. +.Sh RETURN VALUES +Non-zero error codes from the .Fn regcomp and .Fn regexec +family of functions include the following: .Pp .Bl -tag -width REG_ECOLLATE -compact +.It Dv REG_OK +Operation successfully executed. +Synonim for 0, +to provide better code readability. .It Dv REG_NOMATCH The .Fn regexec -function -failed to match +functions +failed to match. .It Dv REG_BADPAT -invalid regular expression +Invalid regular expression. +This implementation only returns this code when the regular expression +passed to +.Fn regcomp +contains an illegal multibyte sequence. .It Dv REG_ECOLLATE -invalid collating element +Invalid collating element. +Returned whenever equivalence classes or multicharacter collating elements +are used in a bracket expression. +.Pq They are not supported yet. .It Dv REG_ECTYPE -invalid character class +Invalid character class name. .It Dv REG_EESCAPE -.Ql \e -applied to unescapable character +The last character was a backslash. .It Dv REG_ESUBREG -invalid backreference number +Invalid backreference number. .It Dv REG_EBRACK -brackets +Brackets .Ql "[ ]" -not balanced +not balanced. .It Dv REG_EPAREN -parentheses +Parentheses .Ql "( )" -not balanced +not balanced. .It Dv REG_EBRACE -braces +Braces .Ql "{ }" -not balanced +not balanced. .It Dv REG_BADBR -invalid repetition count(s) in -.Ql "{ }" +Invalid repetition count(s) in +.Ql "{ }" : +not a number, more than two numbers, first larger than second, or number too large. .It Dv REG_ERANGE -invalid character range in -.Ql "[ ]" +Invalid character range in +.Ql "[ ]" , +i.e. ending point is earlier in the collating order than the starting point. .It Dv REG_ESPACE -ran out of memory +Out of memory. .It Dv REG_BADRPT -.Ql ?\& , -.Ql *\& , -or -.Ql +\& -operand invalid -.It Dv REG_EMPTY -empty (sub)expression -.It Dv REG_ASSERT -cannot happen - you found a bug -.It Dv REG_INVARG -invalid argument, e.g.\& negative-length string -.It Dv REG_ILLSEQ -illegal byte sequence (bad multibyte character) +Invalid use of repetition operators: two or more repetition operators have been +chained in an undefined way. .El .Sh SEE ALSO .Xr grep 1 , @@ -651,77 +594,55 @@ illegal byte sequence (bad multibyte cha sections 2.8 (Regular Expression Notation) and B.5 (C Binding for Regular Expression Matching). -.Sh HISTORY -Originally written by -.An Henry Spencer . -Altered for inclusion in the -.Bx 4.4 -distribution. -.Sh BUGS -This is an alpha release with known defects. -Please report problems. -.Pp -The back-reference code is subtle and doubts linger about its correctness -in complex cases. -.Pp -The -.Fn regexec -function -performance is poor. -This will improve with later releases. -The -.Fa nmatch -argument -exceeding 0 is expensive; -.Fa nmatch -exceeding 1 is worse. -The -.Fn regexec -function -is largely insensitive to RE complexity -.Em except -that back -references are massively expensive. -RE length does matter; in particular, there is a strong speed bonus -for keeping RE length under about 30 characters, -with most special characters counting roughly double. -.Pp +.Sh STANDARDS The -.Fn regcomp -function -implements bounded repetitions by macro expansion, -which is costly in time and space if counts are large -or bounded repetitions are nested. -An RE like, say, -.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}" -will (eventually) run almost any existing machine out of swap space. -.Pp -There are suspected problems with response to obscure error conditions. -Notably, -certain kinds of internal overflow, -produced only by truly enormous REs or by multiply nested bounded repetitions, -are probably not handled well. -.Pp -Due to a mistake in -.St -p1003.2 , -things like -.Ql "a)b" -are legal REs because -.Ql )\& -is -a special character only in the presence of a previous unmatched -.Ql (\& . -This cannot be fixed until the spec is fixed. -.Pp -The standard's definition of back references is vague. -For example, does -.Ql "a\e(\e(b\e)*\e2\e)*d" -match -.Ql "abbbd" ? -Until the standard is clarified, -behavior in such cases should not be relied on. -.Pp -The implementation of word-boundary matching is a bit of a kludge, -and bugs may lurk in combinations of word-boundary matching and anchoring. -.Pp -Word-boundary matching does not work properly in multibyte locales. +.Fn regcomp , +.Fn regexec , +.Fn regerror +and +.Fn regfree +functions, +the header file +.In regex.h +and the two structure types +.Ft regex_t +and +.Ft regmatch_t +(except the +.Va re_endp +and +.Va re_wendp +fields), +the type +.Ft regoff_t , +the macros +.Dv REG_EXTENDED , +.Dv REG_ICASE , +.Dv REG_NOSUB , +.Dv REG_NEWLINE , +.Dv REG_NOTBOL , +.Dv REG_NOTEOL +and all the error codes except +.Dv REG_OK +conform to the standard +.St -p1003.2 . +.Pp +The alternative forms of the functions taking the length of the input and/or +taking wide strings, the flags that are not listed above, the +.Va re_end +and +.Va re_wendp +fields in +.Ft regex_t +and the +.Dv REG_OK error code are extensions and thus are not expected to be +portable. +.Sh HISTORY +This regex implementation comes from the TRE project +and it was included first in +.Fx 10-CURRENT. +This manual was originally written by +.An Henry Spencer +for an older implementation and later extended and +tailored or TRE by +.An Gabor Kovesdan .
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201109142108.p8EL82vN042595>