Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 18 Jun 2008 12:40:24 +0200
From:      =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= <des@des.no>
To:        Konrad Jankowski <konrad.jankowski@bluemedia.pl>
Cc:        Doug Barton <dougb@FreeBSD.org>, current@FreeBSD.org, Andrey Chernov <ache@nagual.pp.ru>, Diomidis Spinellis <dds@aueb.gr>, hackers@FreeBSD.org, Gabor Kovesdan <gabor@FreeBSD.org>, Max Khon <fjoe@samodelkin.net>, "Sean C. Farley" <scf@FreeBSD.org>, K?vesd?n G?bor <gabor@t-hosting.hu>
Subject:   Re: CFT: BSD-licensed grep [Fwd: cvs commit: ports/textproc/bsdgrep Makefile distinfo]
Message-ID:  <86skvbc9gn.fsf@ds4.des.no>
In-Reply-To: <4858DBF6.5070001@bluemedia.pl> (Konrad Jankowski's message of "Wed\, 18 Jun 2008 11\:57\:10 %2B0200")
References:  <20080617002224.GA16122@nagual.pp.ru> <20080617002808.GB16122@nagual.pp.ru> <20080617004647.GA16546@nagual.pp.ru> <48576610.9080808@FreeBSD.org> <48577510.4020007@aueb.gr> <48577BD2.4070205@bluemedia.pl> <20080617102900.GA46479@nagual.pp.ru> <485798C4.2050605@FreeBSD.org> <20080618055851.GA85018@nagual.pp.ru> <86zlpjduew.fsf@ds4.des.no> <20080618083739.GA87100@nagual.pp.ru> <867icndqv5.fsf@ds4.des.no> <4858DBF6.5070001@bluemedia.pl>

next in thread | previous in thread | raw e-mail | index | archive | help
Konrad Jankowski <konrad.jankowski@bluemedia.pl> writes:
> Dag-Erling Sm=C3=B8rgrav <des@des.no> writes:
> > In any case, this is a libc issue, right?  As long as sort / grep
> > uses the API correctly, they will work fine once libc is fixed?
> Correct.  Given sort uses strcoll()/wcscoll()/strxfrm()/wcsxfrm() and
> call setlocale().  I don't know about grep.

For grep, I believe it should simply be a matter of calling setlocale(),
using wide strings, and using a multibyte regex engine (for appropriate
values of "simply").

Another thing I'm unsure about is the matter of input and output.  Do
mbstowcs() / mbtowc() simply trust the input to conform to LC_CTYPE and
convert accordingly?  When reading UTF, do they recognize and handle
BOMs, or simply treat them as zero-width non-breaking space?  In the
absence of a BOM, do they assume that the input follows the system's
native byte order?

(IMHO, the API is broken, since there is no way for the same program to
simultaneously handle streams with different encodings, but I guess it's
too late to fix that)

DES
--=20
Dag-Erling Sm=C3=B8rgrav - des@des.no



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?86skvbc9gn.fsf>