Date: Thu, 6 Oct 2011 11:20:21 +0000 (UTC) From: Gabor Kovesdan <gabor@FreeBSD.org> To: src-committers@freebsd.org, svn-src-user@freebsd.org Subject: svn commit: r226056 - user/gabor/tre-integration/lib/libc/regex Message-ID: <201110061120.p96BKLJN058310@svn.freebsd.org>
next in thread | raw e-mail | index | archive | help
Author: gabor Date: Thu Oct 6 11:20:21 2011 New Revision: 226056 URL: http://svn.freebsd.org/changeset/base/226056 Log: - Clean up and update manual page Modified: user/gabor/tre-integration/lib/libc/regex/re_format.7 Modified: user/gabor/tre-integration/lib/libc/regex/re_format.7 ============================================================================== --- user/gabor/tre-integration/lib/libc/regex/re_format.7 Thu Oct 6 11:17:54 2011 (r226055) +++ user/gabor/tre-integration/lib/libc/regex/re_format.7 Thu Oct 6 11:20:21 2011 (r226056) @@ -1,3 +1,4 @@ +.\" Copyright (c) 2011 Gabor Kovesdan <gabor@FreeBSD.org>. .\" Copyright (c) 1992, 1993, 1994 Henry Spencer. .\" Copyright (c) 1992, 1993, 1994 .\" The Regents of the University of California. All rights reserved. @@ -36,7 +37,7 @@ .\" @(#)re_format.7 8.3 (Berkeley) 3/20/94 .\" $FreeBSD$ .\" -.Dd March 20, 1994 +.Dd October 6, 2011 .Dt RE_FORMAT 7 .Os .Sh NAME @@ -48,32 +49,33 @@ Regular expressions as defined in .St -p1003.2 , come in two forms: -modern REs (roughly those of +modern regular expressions (roughly those of .Xr egrep 1 ; 1003.2 calls these .Dq extended -REs) -and obsolete REs (roughly those of +regular expressions or +.Dq EREs ) +and obsolete regular expressionss (roughly those of .Xr ed 1 ; -1003.2 +1003.2 calls these .Dq basic -REs). -Obsolete REs mostly exist for backward compatibility in some old programs; +regular expressions or +.Dq BREs ) . +BREs mostly exist for backward compatibility in some old programs; they will be discussed at the end. .St -p1003.2 -leaves some aspects of RE syntax and semantics open; -`\(dd' marks decisions on these aspects that -may not be fully portable to other -.St -p1003.2 -implementations. +leaves some aspects of regular expression syntax and semantics open, +so this manual will describe the behavior of this implementation +instead of just reproducing the same iformation that is already +available in the standard. .Pp -A (modern) RE is one\(dd or more non-empty\(dd +An extended regular expression is one or more non-empty .Em branches , separated by .Ql \&| . It matches anything that matches one of the branches. .Pp -A branch is one\(dd or more +A branch is one or more .Em pieces , concatenated. It matches a match for the first, followed by a match for the second, etc. @@ -81,7 +83,7 @@ It matches a match for the first, follow A piece is an .Em atom possibly followed -by a single\(dd +by a single .Ql \&* , .Ql \&+ , .Ql \&? , @@ -99,42 +101,30 @@ matches a sequence of 0 or 1 matches of .Pp A .Em bound -is -.Ql \&{ -followed by an unsigned decimal integer, -possibly followed by -.Ql \&, -possibly followed by another unsigned decimal integer, -always followed by -.Ql \&} . +is an expression that allows the repetition of the atom +according to the specified constraints. +A +.Em bound +starts with an opening brace +.Pq Ql \&{ +character, followed by an unsigned decimal integer, an optional comma +.Pq Ql \&, +followed by another unsigned decimal integer, +always followed by a closing brace +.Pq Ql \&} . The integers must lie between 0 and .Dv RE_DUP_MAX -(255\(dd) inclusive, -and if there are two of them, the first may not exceed the second. -An atom followed by a bound containing one integer -.Em i -and no comma matches -a sequence of exactly -.Em i -matches of the atom. -An atom followed by a bound -containing one integer -.Em i -and a comma matches -a sequence of -.Em i -or more matches of the atom. -An atom followed by a bound -containing two integers -.Em i -and -.Em j -matches -a sequence of -.Em i -through -.Em j -(inclusive) matches of the atom. +inclusive. +The integers restrict the minimum and maximum repetition count of the atom +and the first number may not exceed the second. +The second integer is optional and if it is missing but the comma is present, +there is no upper limit of the repetition. +If there is only one integer specified and the comma is also missing, +exactly the specified number of repetitions is required. +In this implementation, +it is also possible to leave out the first integer and only specify the +comma and the upper limit. +In this case 0 is implied as a minimum repetition count. .Pp An atom is a regular expression enclosed in .Ql () @@ -142,7 +132,7 @@ An atom is a regular expression enclosed regular expression), an empty set of .Ql () -(matching the null string)\(dd, +(matching the null string), a .Em bracket expression (see below), @@ -155,47 +145,46 @@ a .Ql \e followed by one of the characters .Ql ^.[$()|*+?{\e -(matching that character taken as an ordinary character), -a -.Ql \e -followed by any other character\(dd -(matching that character taken as an ordinary character, -as if the -.Ql \e -had not been present\(dd), -or a single character with no other significance (matching that character). +(matching the escaped character taken as an ordinary character) +or a single character with no other significance (matching the +same character). A .Ql \&{ followed by a character other than a digit is an ordinary -character, not the beginning of a bound\(dd. -It is illegal to end an RE with +character, not the beginning of a bound. +It is illegal to end a regular expression with .Ql \e . .Pp A .Em bracket expression is a list of characters enclosed in .Ql [] . -It normally matches any single character from the list (but see below). +It always matches a single character but the set of matching characters +is determined by more specific rules. If the list begins with .Ql \&^ , -it matches any single character -(but see below) -.Em not -from the rest of the list. -If two characters in the list are separated by +it matches any single character that is not present in the rest of the +list. +If the list does not begin with +.Ql \&^ , +normally all characters that are listed in the brackets will match. +An exception from this is the use of collating ranges. +If there is a .Ql \&- , -this is shorthand -for the full -.Em range -of characters between those two (inclusive) in the -collating sequence, -.No e.g. Ql [0-9] -in ASCII matches any decimal digit. -It is illegal\(dd for two ranges to share an -endpoint, -.No e.g. Ql a-c-e . -Ranges are very collating-sequence-dependent, -and portable programs should avoid relying on them. +which is not the first character in the bracket, +it will be interpreted as a collating range and will match all +characters that fall in between the preceding and following characters +(inclusive) in the current locale's collating order. +.No For example, Ql [a0-9] +in ASCII matches +.Ql a +or any decimal digit. +.No For example, Ql [^agh] +matches any character that is not +.Ql a , +.Ql g , +or +.Ql h . .Pp To include a literal .Ql \&] @@ -235,7 +224,7 @@ can thus match more than one character, e.g.\& if the collating sequence includes a .Ql ch collating element, -then the RE +then the regular expression .Ql [[.ch.]]*c matches the first five characters of @@ -263,7 +252,7 @@ then and .Ql [xy] are all synonymous. -An equivalence class may not\(dd be an endpoint +An equivalence class may not be an endpoint of a range. .Pp Within a bracket expression, the name of a @@ -284,7 +273,7 @@ Standard character class names are: .Pp These stand for the character classes defined in .Xr ctype 3 . -A locale may provide others. +A particular locale may provide others. A character class may not be used as an endpoint of a range. .Pp A bracketed expression like @@ -295,35 +284,16 @@ The reverse, matching any character that class, the negation operator of bracket expressions may be used: .Ql [^[:class:]] . .Pp -There are two special cases\(dd of bracket expressions: -the bracket expressions -.Ql [[:<:]] -and -.Ql [[:>:]] -match the null string at the beginning and end of a word respectively. -A word is defined as a sequence of word characters -which is neither preceded nor followed by -word characters. -A word character is an -.Em alnum -character (as defined by -.Xr ctype 3 ) -or an underscore. -This is an extension, -compatible with but not specified by -.St -p1003.2 , -and should be used with -caution in software intended to be portable to other systems. -.Pp -In the event that an RE could match more than one substring of a given -string, -the RE matches the one starting earliest in the string. -If the RE could match more than one substring starting at that point, +In the event that a regular expression could match more than one +substring of a given string, +the regular expression matches the one starting earliest in the string. +If the regular expression could match more than one substring starting +at that point, it matches the longest. Subexpressions also match the longest possible substrings, subject to the constraint that the whole match be as long as possible, -with subexpressions starting earlier in the RE taking priority over -ones starting later. +with subexpressions starting earlier in the regular expression taking +priority over ones starting later. Note that higher-level subexpressions thus take priority over their lower-level component subexpressions. .Pp @@ -346,15 +316,14 @@ when .Ql (a*)* is matched against .Ql bc -both the whole RE and the parenthesized +both the whole regular expression and the parenthesized subexpression match the null string. .Pp -If case-independent matching is specified, -the effect is much as if all case distinctions had vanished from the -alphabet. -When an alphabetic that exists in multiple cases appears as an -ordinary character outside a bracket expression, it is effectively -transformed into a bracket expression containing both cases, +The effect of case-independent match is like as if all case distinctions +vanished from the alphabet. +It can also be modelled as if each and every character were replaced +by a bracket expression, +containing both cases of the same letter, .No e.g. Ql x becomes .Ql [xX] . @@ -368,15 +337,13 @@ and becomes .Ql [^xX] . .Pp -No particular limit is imposed on the length of REs\(dd. -Programs intended to be portable should not employ REs longer -than 256 bytes, -as an implementation can refuse to accept such REs and remain -POSIX-compliant. -.Pp -Obsolete -.Pq Dq basic -regular expressions differ in several respects. +No particular limit is imposed on the length of regular expression. +Programs intended to be portable should not employ regular expressions +longer than 256 bytes, +as an implementation can refuse to accept such regular expressions and +remain POSIX-compliant. +.Pp +Basic regular expressions differ in several respects. .Ql \&| is an ordinary character and there is no equivalent for its functionality. @@ -391,7 +358,7 @@ or respectively). Also note that .Ql x+ -in modern REs is equivalent to +in extended regular expressions is equivalent to .Ql xx* . The delimiters for bounds are .Ql \e{ @@ -412,15 +379,15 @@ and .Ql \&) by themselves ordinary characters. .Ql \&^ -is an ordinary character except at the beginning of the -RE or\(dd the beginning of a parenthesized subexpression, +is an ordinary character except at the beginning of the regular expression +or the beginning of a parenthesized subexpression, .Ql \&$ is an ordinary character except at the end of the -RE or\(dd the end of a parenthesized subexpression, +regular expression or the end of a parenthesized subexpression, and .Ql \&* is an ordinary character if it appears at the beginning of the -RE or the beginning of a parenthesized subexpression +regular expresssion or the beginning of a parenthesized subexpression (after a possible leading .Ql \&^ ) . Finally, there is one new type of atom, a @@ -442,6 +409,9 @@ or .Ql cc but not .Ql bc . +.Pp +Back references are not defined for extended regular expressions but +most implementations (including this) implement them. .Sh SEE ALSO .Xr regex 3 .Rs @@ -450,34 +420,12 @@ but not .%N 1003.2 .%P section 2.8 .Re -.Sh BUGS -Having two kinds of REs is a botch. -.Pp -The current -.St -p1003.2 -spec says that -.Ql \&) -is an ordinary character in -the absence of an unmatched -.Ql \&( ; -this was an unintentional result of a wording error, -and change is likely. -Avoid relying on it. -.Pp -Back references are a dreadful botch, -posing major problems for efficient implementations. -They are also somewhat vaguely defined -(does -.Ql a\e(\e(b\e)*\e2\e)*d -match -.Ql abbbd ? ) . -Avoid using them. -.Pp -.St -p1003.2 -specification of case-independent matching is vague. -The -.Dq one case implies all cases -definition given above -is current consensus among implementors as to the right interpretation. -.Pp -The syntax for word boundaries is incredibly ugly. +.Sh HISTORY +This manual was originally written by +.An Henry Spencer +for an older implementation and later extended and +tailored for TRE by +.An Gabor Kovesdan . +The regex implementation comes from the TRE project +and it was included first in +.Fx 10-CURRENT.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201110061120.p96BKLJN058310>