Date: Fri, 2 Sep 2011 08:03:38 +0200 From: Wolfgang Zenker <wolfgang@lyxys.ka.sub.org> To: Gabor Kovesdan <gabor@freebsd.org> Cc: Andrey Chernov <ache@nagual.pp.ru>, freebsd-standards@freebsd.org, freebsd-i18n@freebsd.org Subject: Re: POSIX regex VS. multi-byte characters Message-ID: <20110902060338.GA8192@lyxys.ka.sub.org> In-Reply-To: <4E603AA3.1040204@FreeBSD.org> References: <4E603AA3.1040204@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi Gabor, * Gabor Kovesdan <gabor@freebsd.org> [110902 04:08]: > While working on bringing in a new regex code to FreeBSD, I came into an > issue. POSIX says here: > http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09 > "Matching shall be based on the bit pattern used for encoding the > character, not on the graphic representation of the character. This > means that if a character set contains two or more encodings for a > graphic symbol, or if the strings searched contain text encoded in more > than one codeset, no attempt is made to search for any other > representation of the encoded symbol. If that is required, the user can > specify equivalence classes containing all variations of the desired > graphic symbol." > According to my interpretation of this text, if someone specifies a > single bit as pattern that can be a prefix of a multi-byte character > that shall match, since match is based on bit pattern not semantical > meaning. Besides, in a consistent environment that uses a single > encoding and also supposing a user with common sense that would not > enter meaningless input, only whole characters should occur in the > pattern. However, GNU grep has a test in its regression test suite that > contradicts to this and chooses the opposite approach, i.e. it shall not > match a fragment of a character. Looking at the standard, I think GNU > grep is incorrect and my interpretation is the correct one. I think you are misinterpreting the standard here. As I read it, the phrase "bit pattern used for encoding the character" means the complete byte sequence that encodes the character, not just a byte. The paragraph quoted above talks about characters that have several different encodings like e.g. characters that exist as single codepoint but can also be encoded using diacritical marks and a base character. Wolfgang
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110902060338.GA8192>