From owner-freebsd-questions@FreeBSD.ORG  Wed Nov  9 01:16:51 2011
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 86DE3106566C
	for <freebsd-questions@freebsd.org>;
	Wed,  9 Nov 2011 01:16:51 +0000 (UTC)
	(envelope-from bonomi@mail.r-bonomi.com)
Received: from mail.r-bonomi.com (mx-out.r-bonomi.com [204.87.227.120])
	by mx1.freebsd.org (Postfix) with ESMTP id 451EE8FC0C
	for <freebsd-questions@freebsd.org>;
	Wed,  9 Nov 2011 01:16:50 +0000 (UTC)
Received: (from bonomi@localhost)
	by mail.r-bonomi.com (8.14.4/rdb1) id pA91HRDo065662;
	Tue, 8 Nov 2011 19:17:27 -0600 (CST)
Date: Tue, 8 Nov 2011 19:17:27 -0600 (CST)
From: Robert Bonomi <bonomi@mail.r-bonomi.com>
Message-Id: <201111090117.pA91HRDo065662@mail.r-bonomi.com>
To: conrads@cox.net
In-Reply-To: <20111108184236.3a78ebf6@cox.net>
Cc: freebsd-questions@freebsd.org
Subject: Re: "Unprintable" 8-bit characters
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 09 Nov 2011 01:16:51 -0000


On Tue, 8 Nov 2011 18:42:36 -0600, "Conrad J. Sabatier" wrote:
>
> I've been trying to understand what the deal is with regards to the
> displaying of the "extended" 8-bit character set, i.e., 8-bit characters
> with the MSB set.

Quite simply Unix dates from the days where the 8th bit was used as a 'parity'
bit.  Allowing detection of *all* single-bit errors -- especially over the
notoriously un-reliable connections known as 'serial ports'.
>
> More specifically, I'm trying to figure out how to get the "ls" command
> to properly display filenames containing characters in this extended
> set.  I have some MP3 files, for instance, whose names contain certain
> European characters, such as the lowercase "u" with umlaut (code 0xfc
> in the Latin set, according to gucharmap), that I just can't get ls to
> display properly.  These characters seem to be considered by ls as
> "unprintable", and the best I've been able to produce in the ls
> output is backslash interpretations of the characters using either the
> -B or -b options, otherwise the default "?" is displayed in their place.
>
> The strange thing is that these characters will display just fine in
> xterm, gnome-terminal, etc.  I can copy and paste them from the
> gucharmap utility into a shell command line or other application, and
> they appear as they should, but ls simply refuses to display them.  I
> can print them using the printf command, even bash's builtin echo seems
> to have no problem with them.  Only ls appears to have this problem.
>
> I've experimented with using various locales, using the LC_*
> variables, as well as the LANG variable (as documented in the
> environment section of the ls man page), all to no avail.

Obviously you never read as far as the '-w' switch.  <grin>

> Is this an inherent limitation of ls, 

It is -not- a limitation; rather it is a _desired_ behavior -- so that 
one can _tell_ where there is an 'unprintable' character (like \r, or\b)
in a filename.  There are *good*reasons*(TM) why -q is the default behavior
for 'terminal' output.

>                                       or is there some workaround or
> other solution?  Do we need a new en_*.UTF-16 locale?  Should we
> consider extending the ls command to handle these characters?

There _are_ "improved" versions of ls that do understand the 'locale'
environment variables -- but those programs introduce a whole bunch of
*other* 'not necessarily desired' behaviors -- like sorting upper-case and
lower-case letters as 'equals', rather than regarding any upper-case as 
sorting before any lowercase.

>                                                                Or is
> there just something about all of this that I'm just not "getting"?
>
> As an additional note, I notice that in the text console, this same
> character code (0xfc) produces an entirely different character (a
> lowercase n in a raised position, as for the exponent in a mathematical
> expression).  Is there, in fact, no standardization re: the
> representation of these "high bit" characters?

"The nice thing about standards is that there are so many to choose from"
applies.  WITH A VENGANCE!!

There are at least FIFTEEN different sets of glyphs for the 'high bit set'
byte codes *JUST* for the 'iso-8859' base charset.  Plus 'utf-8'  And not 
counting the various bastardiztions (e.g. 'CP-1252', etc.) that Microsoft 
has introduced.

> Thanks to anyone who can help clear up this long-standing mystery for
> me.

<R>eading <t>he <f>ine <m>anpage -- with particular attention to the '-q'
and '-w' options should provie some enlightenment.