From owner-freebsd-questions@FreeBSD.ORG Wed Nov 9 01:16:51 2011 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 86DE3106566C for ; Wed, 9 Nov 2011 01:16:51 +0000 (UTC) (envelope-from bonomi@mail.r-bonomi.com) Received: from mail.r-bonomi.com (mx-out.r-bonomi.com [204.87.227.120]) by mx1.freebsd.org (Postfix) with ESMTP id 451EE8FC0C for ; Wed, 9 Nov 2011 01:16:50 +0000 (UTC) Received: (from bonomi@localhost) by mail.r-bonomi.com (8.14.4/rdb1) id pA91HRDo065662; Tue, 8 Nov 2011 19:17:27 -0600 (CST) Date: Tue, 8 Nov 2011 19:17:27 -0600 (CST) From: Robert Bonomi Message-Id: <201111090117.pA91HRDo065662@mail.r-bonomi.com> To: conrads@cox.net In-Reply-To: <20111108184236.3a78ebf6@cox.net> Cc: freebsd-questions@freebsd.org Subject: Re: "Unprintable" 8-bit characters X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 09 Nov 2011 01:16:51 -0000 On Tue, 8 Nov 2011 18:42:36 -0600, "Conrad J. Sabatier" wrote: > > I've been trying to understand what the deal is with regards to the > displaying of the "extended" 8-bit character set, i.e., 8-bit characters > with the MSB set. Quite simply Unix dates from the days where the 8th bit was used as a 'parity' bit. Allowing detection of *all* single-bit errors -- especially over the notoriously un-reliable connections known as 'serial ports'. > > More specifically, I'm trying to figure out how to get the "ls" command > to properly display filenames containing characters in this extended > set. I have some MP3 files, for instance, whose names contain certain > European characters, such as the lowercase "u" with umlaut (code 0xfc > in the Latin set, according to gucharmap), that I just can't get ls to > display properly. These characters seem to be considered by ls as > "unprintable", and the best I've been able to produce in the ls > output is backslash interpretations of the characters using either the > -B or -b options, otherwise the default "?" is displayed in their place. > > The strange thing is that these characters will display just fine in > xterm, gnome-terminal, etc. I can copy and paste them from the > gucharmap utility into a shell command line or other application, and > they appear as they should, but ls simply refuses to display them. I > can print them using the printf command, even bash's builtin echo seems > to have no problem with them. Only ls appears to have this problem. > > I've experimented with using various locales, using the LC_* > variables, as well as the LANG variable (as documented in the > environment section of the ls man page), all to no avail. Obviously you never read as far as the '-w' switch. > Is this an inherent limitation of ls, It is -not- a limitation; rather it is a _desired_ behavior -- so that one can _tell_ where there is an 'unprintable' character (like \r, or\b) in a filename. There are *good*reasons*(TM) why -q is the default behavior for 'terminal' output. > or is there some workaround or > other solution? Do we need a new en_*.UTF-16 locale? Should we > consider extending the ls command to handle these characters? There _are_ "improved" versions of ls that do understand the 'locale' environment variables -- but those programs introduce a whole bunch of *other* 'not necessarily desired' behaviors -- like sorting upper-case and lower-case letters as 'equals', rather than regarding any upper-case as sorting before any lowercase. > Or is > there just something about all of this that I'm just not "getting"? > > As an additional note, I notice that in the text console, this same > character code (0xfc) produces an entirely different character (a > lowercase n in a raised position, as for the exponent in a mathematical > expression). Is there, in fact, no standardization re: the > representation of these "high bit" characters? "The nice thing about standards is that there are so many to choose from" applies. WITH A VENGANCE!! There are at least FIFTEEN different sets of glyphs for the 'high bit set' byte codes *JUST* for the 'iso-8859' base charset. Plus 'utf-8' And not counting the various bastardiztions (e.g. 'CP-1252', etc.) that Microsoft has introduced. > Thanks to anyone who can help clear up this long-standing mystery for > me. eading he ine anpage -- with particular attention to the '-q' and '-w' options should provie some enlightenment.