Date: Tue, 8 Nov 2011 20:24:18 -0600 From: "Conrad J. Sabatier" <conrads@cox.net> To: "Michael Ross" <gmx@ross.cx> Cc: freebsd-questions@freebsd.org Subject: Re: "Unprintable" 8-bit characters Message-ID: <20111108202418.05081f25@cox.net> In-Reply-To: <op.v4nor5dhg7njmm@michael-think> References: <20111108184236.3a78ebf6@cox.net> <op.v4nor5dhg7njmm@michael-think>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 09 Nov 2011 02:51:31 +0100 "Michael Ross" <gmx@ross.cx> wrote: > Am 09.11.2011, 01:42 Uhr, schrieb Conrad J. Sabatier > <conrads@cox.net>: >=20 > > Pardon me if this may seem like a stupid question, but this is > > something that's been bugging me for a long time, and none of my > > research has turned up anything useful yet. > > > > I've been trying to understand what the deal is with regards to the > > displaying of the "extended" 8-bit character set, i.e., 8-bit > > characters with the MSB set. > > > > More specifically, I'm trying to figure out how to get the "ls" > > command to properly display filenames containing characters in this > > extended set. I have some MP3 files, for instance, whose names > > contain certain European characters, such as the lowercase "u" with > > umlaut (code 0xfc in the Latin set, according to gucharmap), that I > > just can't get ls to display properly. These characters seem to be > > considered by ls as "unprintable", and the best I've been able to > > produce in the ls output is backslash interpretations of the > > characters using either the -B or -b options, otherwise the default > > "?" is displayed in their place. >=20 > Unsure if I understand you correctly. > ("extended" 8-bit character set with MSB? utf-16?) > I'm confused by this charset stuff in general. That is to say, "8-bit characters with the most significant bit set", or "characters greater than 0x7f". I can certainly appreciate your confusion; this is definitely a confusing area. In gucharmap, selecting the unlauted "u" in the Latin set, the "Character Details" tab reveals the following: U+00FC LATIN SMALL LETTER U WITH DIAERESIS General Character Properties In Unicode since: 1.1 Unicode category: Letter, Lowercase Canonical decomposition: U+0075 LATIN SMALL LETTER U + U+0308 COMBINING DIAERESIS Various Useful Representations UTF-8: 0xC3 0xBC UTF-16: 0x00FC C octal escaped UTF-8: \303\274 XML decimal entity: ü So apparently, it's a "wide" character in UTF-8, which really throws a monkey wrench into the works in certain situations (for example, one of the little scripts I've written to process MP3 files uses the "cut" command, which complains about an "illegal byte sequence"). Even more confusing, selecting the character and copying it to the clipboard, the UTF-16 representation (0xfc) is what actually gets used. Pasting this single-byte version into an X terminal (any of them: xterm, gnome-terminal, etc.) does display the correct character, an umlauted "u", even if using an 8-bit locale, such as UTF-8. Majorly confusing! > Assuming you want \0xfc displayed as "=FC", Yes, exactly. > > cat test.py && python test.py && ls -l >=20 > #!/usr/local/bin/python > # -*- coding: utf-8 -*- >=20 > f=3Dopen('\xfc','w') > f.close() > total 2 >=20 > -rw-r--r-- 1 michael wheel 29 9 Nov 02:43 test.py > -rw-r--r-- 1 michael wheel 0 9 Nov 02:44 =FC >=20 >=20 > here is what works for me: >=20 > in my login class in /etc/login.conf: >=20 > :charset=3DISO-8859-1:\ > :lang=3Dde_DE.ISO8859-1:\ >=20 > ``cap_mkdb /etc/login.conf'' after changes >=20 >=20 > in /etc/rc.conf: >=20 > scrnmap=3D"iso-8859-1_to_cp437" > font8x8=3D"cp850-8x8" > font8x14=3D"cp850-8x14" > font8x16=3D"cp850-8x16" >=20 >=20 > and in /etc/ttys, console type is set to ``cons25l1'' Thanks, I hadn't considered making those sorts of changes for the console. I work so seldom nowadays in the console, I'd forgotten all about that stuff (use it or lose it, as they say!). I'll certainly give that a try. Much appreciation for both yours and Robert's replies. --=20 Conrad J. Sabatier conrads@cox.net
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20111108202418.05081f25>