From owner-freebsd-questions@FreeBSD.ORG Wed Nov 9 01:58:18 2011 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 63EE3106564A for ; Wed, 9 Nov 2011 01:58:18 +0000 (UTC) (envelope-from conrads@cox.net) Received: from eastrmfepo102.cox.net (eastrmfepo102.cox.net [68.230.241.214]) by mx1.freebsd.org (Postfix) with ESMTP id 09B438FC0A for ; Wed, 9 Nov 2011 01:58:17 +0000 (UTC) Received: from eastrmimpo210.cox.net ([68.230.241.225]) by eastrmfepo102.cox.net (InterMail vM.8.01.04.00 201-2260-137-20101110) with ESMTP id <20111109015812.ENWL3808.eastrmfepo102.cox.net@eastrmimpo210.cox.net>; Tue, 8 Nov 2011 20:58:12 -0500 Received: from serene.no-ip.org ([98.164.86.236]) by eastrmimpo210.cox.net with bizsmtp id upyB1h00255wwzE02pyBUd; Tue, 08 Nov 2011 20:58:11 -0500 X-CT-Class: Clean X-CT-Score: 0.00 X-CT-RefID: str=0001.0A02020B.4EB9DE34.001E,ss=1,re=0.000,fgs=0 X-CT-Spam: 0 X-Authority-Analysis: v=1.1 cv=ewGJ9pl8V9JWS7AITLPHT0HSZLhrTByv8yJS1zQd6E8= c=1 sm=1 a=G8Uczd0VNMoA:10 a=kj9zAlcOel0A:10 a=uAbGmPAyUfLL1M3oYAsfuA==:17 a=lM4-zUH5AAAA:8 a=kviXuzpPAAAA:8 a=cH9zHN4GoNYeGypakIkA:9 a=CjuIK1q_8ugA:10 a=4vB-4DCPJfMA:10 a=uAbGmPAyUfLL1M3oYAsfuA==:117 X-CM-Score: 0.00 Authentication-Results: cox.net; none Received: from cox.net (localhost [127.0.0.1]) by serene.no-ip.org (8.14.5/8.14.5) with ESMTP id pA91w9Tk016178; Tue, 8 Nov 2011 19:58:10 -0600 (CST) (envelope-from conrads@cox.net) Date: Tue, 8 Nov 2011 19:58:04 -0600 From: "Conrad J. Sabatier" To: Robert Bonomi Message-ID: <20111108195804.6dfa47c8@cox.net> In-Reply-To: <201111090117.pA91HRDo065662@mail.r-bonomi.com> References: <20111108184236.3a78ebf6@cox.net> <201111090117.pA91HRDo065662@mail.r-bonomi.com> X-Mailer: Claws Mail 3.7.10 (GTK+ 2.24.6; amd64-portbld-freebsd9.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: freebsd-questions@freebsd.org Subject: Re: "Unprintable" 8-bit characters X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 09 Nov 2011 01:58:18 -0000 On Tue, 8 Nov 2011 19:17:27 -0600 (CST) Robert Bonomi wrote: > > On Tue, 8 Nov 2011 18:42:36 -0600, "Conrad J. Sabatier" wrote: > > > > I've been trying to understand what the deal is with regards to the > > displaying of the "extended" 8-bit character set, i.e., 8-bit > > characters with the MSB set. > > Quite simply Unix dates from the days where the 8th bit was used as a > 'parity' bit. Allowing detection of *all* single-bit errors -- > especially over the notoriously un-reliable connections known as > 'serial ports'. Ah, yes! The "good old days". :-) > > More specifically, I'm trying to figure out how to get the "ls" > > command to properly display filenames containing characters in this > > extended set. I have some MP3 files, for instance, whose names > > contain certain European characters, such as the lowercase "u" with > > umlaut (code 0xfc in the Latin set, according to gucharmap), that I > > just can't get ls to display properly. These characters seem to be > > considered by ls as "unprintable", and the best I've been able to > > produce in the ls output is backslash interpretations of the > > characters using either the -B or -b options, otherwise the default > > "?" is displayed in their place. > > > > The strange thing is that these characters will display just fine in > > xterm, gnome-terminal, etc. I can copy and paste them from the > > gucharmap utility into a shell command line or other application, > > and they appear as they should, but ls simply refuses to display > > them. I can print them using the printf command, even bash's > > builtin echo seems to have no problem with them. Only ls appears > > to have this problem. > > > > I've experimented with using various locales, using the LC_* > > variables, as well as the LANG variable (as documented in the > > environment section of the ls man page), all to no avail. > > Obviously you never read as far as the '-w' switch. Yes, somehow that one went right past me. Haste makes waste! :-) > > Is this an inherent limitation of ls, > > It is -not- a limitation; rather it is a _desired_ behavior -- so > that one can _tell_ where there is an 'unprintable' character (like > \r, or\b) in a filename. There are *good*reasons*(TM) why -q is the > default behavior for 'terminal' output. OK, I can see that. :-) > > or is there some workaround or > > other solution? Do we need a new en_*.UTF-16 locale? Should we > > consider extending the ls command to handle these characters? > > There _are_ "improved" versions of ls that do understand the 'locale' > environment variables -- but those programs introduce a whole bunch of > *other* 'not necessarily desired' behaviors -- like sorting > upper-case and lower-case letters as 'equals', rather than regarding > any upper-case as sorting before any lowercase. Well, *that* certainly won't do! That should be the exception, not the rule. > > Or is > > there just something about all of this that I'm just not "getting"? > > > > As an additional note, I notice that in the text console, this same > > character code (0xfc) produces an entirely different character (a > > lowercase n in a raised position, as for the exponent in a > > mathematical expression). Is there, in fact, no standardization > > re: the representation of these "high bit" characters? > > "The nice thing about standards is that there are so many to choose > from" applies. WITH A VENGANCE!! > > There are at least FIFTEEN different sets of glyphs for the 'high bit > set' byte codes *JUST* for the 'iso-8859' base charset. Plus > 'utf-8' And not counting the various bastardiztions (e.g. 'CP-1252', > etc.) that Microsoft has introduced. > > > Thanks to anyone who can help clear up this long-standing mystery > > for me. > > eading he ine anpage -- with particular attention to the > '-q' and '-w' options should provie some enlightenment. Thank you very much. Some of this matched the suspicions I already had re: this matter. Don't know how I completely missed the -w switch. Mea culpa. :-) So, what would be the safest bet as far as the most "universal" representation for these characters? Something I've long wondered about when I've e-mailed people and copied/pasted these characters (are they really seeing what I'm seeing?). :-) -- Conrad J. Sabatier conrads@cox.net