Date: Tue, 8 Nov 2011 18:42:36 -0600 From: "Conrad J. Sabatier" <conrads@cox.net> To: freebsd-questions@FreeBSD.org Subject: "Unprintable" 8-bit characters Message-ID: <20111108184236.3a78ebf6@cox.net>
next in thread | raw e-mail | index | archive | help
Pardon me if this may seem like a stupid question, but this is something that's been bugging me for a long time, and none of my research has turned up anything useful yet. I've been trying to understand what the deal is with regards to the displaying of the "extended" 8-bit character set, i.e., 8-bit characters with the MSB set. More specifically, I'm trying to figure out how to get the "ls" command to properly display filenames containing characters in this extended set. I have some MP3 files, for instance, whose names contain certain European characters, such as the lowercase "u" with umlaut (code 0xfc in the Latin set, according to gucharmap), that I just can't get ls to display properly. These characters seem to be considered by ls as "unprintable", and the best I've been able to produce in the ls output is backslash interpretations of the characters using either the -B or -b options, otherwise the default "?" is displayed in their place. The strange thing is that these characters will display just fine in xterm, gnome-terminal, etc. I can copy and paste them from the gucharmap utility into a shell command line or other application, and they appear as they should, but ls simply refuses to display them. I can print them using the printf command, even bash's builtin echo seems to have no problem with them. Only ls appears to have this problem. I've experimented with using various locales, using the LC_* variables, as well as the LANG variable (as documented in the environment section of the ls man page), all to no avail. Is this an inherent limitation of ls, or is there some workaround or other solution? Do we need a new en_*.UTF-16 locale? Should we consider extending the ls command to handle these characters? Or is there just something about all of this that I'm just not "getting"? As an additional note, I notice that in the text console, this same character code (0xfc) produces an entirely different character (a lowercase n in a raised position, as for the exponent in a mathematical expression). Is there, in fact, no standardization re: the representation of these "high bit" characters? Thanks to anyone who can help clear up this long-standing mystery for me. -- Conrad J. Sabatier conrads@cox.net
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20111108184236.3a78ebf6>