Date: Wed, 09 Sep 2009 08:16:09 -0700 From: Tim Kientzle <kientzle@freebsd.org> To: Andrey Chernov <ache@nagual.pp.ru>, Roman Divacky <rdivacky@freebsd.org>, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r196981 - head/usr.bin/unzip Message-ID: <4AA7C6B9.1020600@freebsd.org> In-Reply-To: <20090909132616.GA35808@nagual.pp.ru> References: <200909081555.n88FtDwe052523@svn.freebsd.org> <20090909132616.GA35808@nagual.pp.ru>
next in thread | previous in thread | raw e-mail | index | archive | help
Andrey Chernov wrote: > On Tue, Sep 08, 2009 at 03:55:13PM +0000, Roman Divacky wrote: >> + * Detect whether this is a text file. ... but libarchive >> + * does not read the central directory, so we have to >> + * guess ... >> + */ >> + if (a_opt && n == 0) { >> + for (p = buffer; p < end; ++p) { >> + if (!isascii((unsigned char)*p)) { >> + text = 0; >> + break; >> + } >> + } >> + } >> + > > If I understand the purpose of this code right, better use > isalnum()+ispunct()+ispace() > combination to count non-ASCII people too. > Also setlocale() call must be added to the main() for that. Personally, I would rather see unzip just ignore the -a option entirely, but I suppose that's probably infeasible. Since this is only to support -a (which does end-of-line conversions), I would suggest using a rather different set of heuristics that examines end-of-line sequences and control characters only: * Any byte value <31 that's not CR or LF: not text * LF not preceded by CR: not text * CR not followed by LF: not text (or at least, not DOS text) * Otherwise, it is text. At a minimum, this dodges the locale issue. Someday, I'll get around to filling in the seek support that libarchive needs for reading central directories, then unzip can look at the "text file" bit (which is no more reliable than anything described above) and this code can just go away. Tim
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4AA7C6B9.1020600>