From owner-freebsd-questions@freebsd.org Thu Nov 9 21:37:03 2017 Return-Path: Delivered-To: freebsd-questions@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 41240E5CF11 for ; Thu, 9 Nov 2017 21:37:03 +0000 (UTC) (envelope-from mfv@bway.net) Received: from smtp2.bway.net (smtp2.v6.bway.net [IPv6:2607:d300:1::28]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 1F64179E25 for ; Thu, 9 Nov 2017 21:37:03 +0000 (UTC) (envelope-from mfv@bway.net) Received: from gecko4 (host-216-220-115-218.dsl.bway.net [216.220.115.218]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) (Authenticated sender: m1316v@bway.net) by smtp2.bway.net (Postfix) with ESMTPSA id B40C895876; Thu, 9 Nov 2017 16:36:44 -0500 (EST) Date: Thu, 9 Nov 2017 16:36:44 -0500 From: mfv To: "James B. Byrne via freebsd-questions" Cc: byrnejb@harte-lyne.ca Subject: Re: Regex character and collation calss documentation Message-ID: <20171109163644.3338c824@gecko4> In-Reply-To: <41c47638eec0e1a562f4446c7fe5a2df.squirrel@webmail.harte-lyne.ca> References: <41c47638eec0e1a562f4446c7fe5a2df.squirrel@webmail.harte-lyne.ca> Reply-To: mfv@bway.net MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Nov 2017 21:37:03 -0000 > On Wed, 2017-11-08 at 12:47 "James B. Byrne via freebsd-questions" > wrote: > >I have been perusing the available documentation respecting regex on >FreeBSD and cannot find a reference to [.NUL.]. Everything that I have >found points to ctype.h. The only class names I can find therein are: > >int isalnum(int); [:alnum:] >int isalpha(int); [:alpha:] >int iscntrl(int); [:cntrl:] >int isdigit(int); [:digit:] >int isgraph(int); [:graph:] >int islower(int); [:lower:] >int isprint(int); [:print:] >int ispunct(int); [:punct:] >int isspace(int); [:space:] >int isupper(int); [:upper:] >int isxdigit(int); [:xdigit:] > >From reading the reference at >https://docs.freebsd.org/info/regex/regex.pdf and comparing it to the >uncommented lines in ctype.h on my FreeBSD-11.1 desktop host one could >reasonably deduce that the following should be available on FreeBSD in >addition to the above: > >int isascii(int); [:ascii:] > >int isblank(int); [:blank:] > >int ishexnumber(int); [:hexnumber:] >int isideogram(int); [:ideogram:] >int isnumber(int); [:number:] >int isphonogram(int); [:phonogram:] >int isrune(int); [:rune:] >int isspecial(int); [:special:] > >But of these only [[:blank:]] is recognized by grep; whatever else >might employ the rest. > >[[:ascii:]] >grep: Invalid character class name >[[:hexnumber:]] >grep: Invalid character class name >[[:ideogram:]] >grep: Invalid character class name >[[:number:]] >grep: Invalid character class name >[[:phonogram:]] >grep: Invalid character class name >[[:rune:]] >grep: Invalid character class name >[[:special:]] >grep: Invalid character class name > > >However I see no reference to [.NUL.] anywhere. The sed man page has >no reference to nul or NUL at all and tr only has this to say: > > The tr utility has historically not permitted the manipulation > of NUL bytes in its input and, additionally, stripped NUL's from > its input stream. This implementation has removed this behavior > as a bug. > > >Is there a master list of character/collation classes for FreeBSD >regex? I have read the man pages for grep and re_format. In no case >is the character or collation class NUL mentioned. > >Where is the usage of [.NUL.] documented? > Hello James, This may help you with a bit of hacking. I asked myself the same question but could not find a satisfactory answer. After remembering that "man ascii" has names for all non-printable ASCII characters, I placed some of these characters in a text file and then removed the same characters using their name. Thus: - the character ^@ was removed using [[.NUL.]] - the character ^G was removed using [[.BEL.]] - the character ^F was removed using [[.ACK.]] - etc, I did not try all non-printable characters but a large sampling followed this pattern. Trying to use SP for a space produced the following error: sed: 1: "/[[.SP.]]/d": RE error: invalid collating element Perhaps there are other exceptions similar to SP. This syntax also recognises printable characters as well. For example the character 'A' was removed using 's/[[.A.]]//g'. I would have preferred some formal documentation on this matter but like yourself am still searching. Cheers ... Marek