Date: Wed, 9 Nov 2011 03:10:24 +0100 From: Polytropon <freebsd@edvax.de> To: "Michael Ross" <gmx@ross.cx> Cc: "Conrad J. Sabatier" <conrads@cox.net>, freebsd-questions@freebsd.org Subject: Re: "Unprintable" 8-bit characters Message-ID: <20111109031024.fb4c617e.freebsd@edvax.de> In-Reply-To: <op.v4nor5dhg7njmm@michael-think> References: <20111108184236.3a78ebf6@cox.net> <op.v4nor5dhg7njmm@michael-think>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 09 Nov 2011 02:51:31 +0100, Michael Ross wrote: > Am 09.11.2011, 01:42 Uhr, schrieb Conrad J. Sabatier <conrads@cox.net>: >=20 > > Pardon me if this may seem like a stupid question, but this is > > something that's been bugging me for a long time, and none of my > > research has turned up anything useful yet. > > > > I've been trying to understand what the deal is with regards to the > > displaying of the "extended" 8-bit character set, i.e., 8-bit characters > > with the MSB set. > > > > More specifically, I'm trying to figure out how to get the "ls" command > > to properly display filenames containing characters in this extended > > set. I have some MP3 files, for instance, whose names contain certain > > European characters, such as the lowercase "u" with umlaut (code 0xfc > > in the Latin set, according to gucharmap), that I just can't get ls to > > display properly. These characters seem to be considered by ls as > > "unprintable", and the best I've been able to produce in the ls > > output is backslash interpretations of the characters using either the > > -B or -b options, otherwise the default "?" is displayed in their place. >=20 > Unsure if I understand you correctly. > ("extended" 8-bit character set with MSB? utf-16?) > I'm confused by this charset stuff in general. >=20 > Assuming you want \0xfc displayed as "=FC", >=20 > > cat test.py && python test.py && ls -l >=20 > #!/usr/local/bin/python > # -*- coding: utf-8 -*- >=20 > f=3Dopen('\xfc','w') > f.close() > total 2 >=20 > -rw-r--r-- 1 michael wheel 29 9 Nov 02:43 test.py > -rw-r--r-- 1 michael wheel 0 9 Nov 02:44 =FC >=20 >=20 > here is what works for me: >=20 > in my login class in /etc/login.conf: >=20 > :charset=3DISO-8859-1:\ > :lang=3Dde_DE.ISO8859-1:\ >=20 > ``cap_mkdb /etc/login.conf'' after changes Ah, thanks - that seems to be the proper way to have the environmental variables set - instead of my (ab)use of setenv's in the csh config file. :-) Note the "precedence" of $LANG vs. $LC_* (as they can be used to configure things more precisely, e. g. regarding system messages or date formats; see example following). > in /etc/rc.conf: >=20 > scrnmap=3D"iso-8859-1_to_cp437" Hm? CP437? Codepage? Isn't that some MS-DOS thing? I've never needed a screenmap to make "extended characters" (everything beyong US-ASCII) work. > font8x8=3D"cp850-8x8" > font8x14=3D"cp850-8x14" > font8x16=3D"cp850-8x16" >=20 >=20 > and in /etc/ttys, console type is set to ``cons25l1'' I have a similar setting here, but that does _not_ work wuth UTF-8 codec characters. If I want to use them, I have to change some environmental variables, from #-------GERMAN/ENGLISH------------------------ <=3D=3D=3D DEFAULT setenv LC_ALL en_US.ISO8859-1 setenv LC_MESSAGES en_US.ISO8859-1 setenv LC_COLLATE de_DE.ISO8859-1 setenv LC_CTYPE de_DE.ISO8859-1 setenv LC_MONETARY de_DE.ISO8859-1 setenv LC_NUMERIC de_DE.ISO8859-1 setenv LC_TIME de_DE.ISO8859-1 unsetenv LANG to #-------INTERNATIONAL------------------------- setenv LC_ALL en_US.UTF-8 setenv LC_MESSAGES en_US.UTF-8 setenv LC_COLLATE de_DE.UTF-8 setenv LC_CTYPE de_DE.UTF-8 setenv LC_MONETARY de_DE.UTF-8 setenv LC_NUMERIC de_DE.UTF-8 setenv LC_TIME de_DE.UTF-8 setenv LANG de_DE.UTF-8 Then I can use UTF-8 characters inside rxvt-unicode. Of course, text mode console is limited to the first set of configuration, using the ISO 8859-1 character set. This worked long before UTF-8 arrived with the glorious idea that I should have 2 bytes where one is sufficient, to describe our (german) 6 umlauts and the Eszett ligature. :-) Improper settings will result in [][] or A-tilde three quarters upside-down question mark, depending on editor or terminal used. But returning to the original question, I think Robert did explain it very well: There is no real consensus about what the different codings should mean. They were meant to unify the representation of a very large set of characters, but basically there are many inter- pretations now, and how they show up to the user depends on the font in use, _if_ it has this mapping or that, or none. For running ls, -w is the right option to use - but IN COMBINATION with correct settings for the terminal emulation AND the presence of a font that will do. Again a fine demonstration why file names should be limited to printable ASCII and no spaces if you want them to work everywhere. :-) --=20 Polytropon Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ...
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20111109031024.fb4c617e.freebsd>