From owner-freebsd-questions@FreeBSD.ORG Wed Nov 9 02:10:27 2011 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 19A101065670 for ; Wed, 9 Nov 2011 02:10:27 +0000 (UTC) (envelope-from freebsd@edvax.de) Received: from mx02.qsc.de (mx02.qsc.de [213.148.130.14]) by mx1.freebsd.org (Postfix) with ESMTP id BD4508FC08 for ; Wed, 9 Nov 2011 02:10:26 +0000 (UTC) Received: from r56.edvax.de (port-92-195-104-16.dynamic.qsc.de [92.195.104.16]) by mx02.qsc.de (Postfix) with ESMTP id 400231E1B0; Wed, 9 Nov 2011 03:10:25 +0100 (CET) Received: from r56.edvax.de (localhost [127.0.0.1]) by r56.edvax.de (8.14.5/8.14.5) with SMTP id pA92AOec003359; Wed, 9 Nov 2011 03:10:24 +0100 (CET) (envelope-from freebsd@edvax.de) Date: Wed, 9 Nov 2011 03:10:24 +0100 From: Polytropon To: "Michael Ross" Message-Id: <20111109031024.fb4c617e.freebsd@edvax.de> In-Reply-To: References: <20111108184236.3a78ebf6@cox.net> Organization: EDVAX X-Mailer: Sylpheed 3.1.1 (GTK+ 2.24.5; i386-portbld-freebsd8.2) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: "Conrad J. Sabatier" , freebsd-questions@freebsd.org Subject: Re: "Unprintable" 8-bit characters X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Polytropon List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 09 Nov 2011 02:10:27 -0000 On Wed, 09 Nov 2011 02:51:31 +0100, Michael Ross wrote: > Am 09.11.2011, 01:42 Uhr, schrieb Conrad J. Sabatier : >=20 > > Pardon me if this may seem like a stupid question, but this is > > something that's been bugging me for a long time, and none of my > > research has turned up anything useful yet. > > > > I've been trying to understand what the deal is with regards to the > > displaying of the "extended" 8-bit character set, i.e., 8-bit characters > > with the MSB set. > > > > More specifically, I'm trying to figure out how to get the "ls" command > > to properly display filenames containing characters in this extended > > set. I have some MP3 files, for instance, whose names contain certain > > European characters, such as the lowercase "u" with umlaut (code 0xfc > > in the Latin set, according to gucharmap), that I just can't get ls to > > display properly. These characters seem to be considered by ls as > > "unprintable", and the best I've been able to produce in the ls > > output is backslash interpretations of the characters using either the > > -B or -b options, otherwise the default "?" is displayed in their place. >=20 > Unsure if I understand you correctly. > ("extended" 8-bit character set with MSB? utf-16?) > I'm confused by this charset stuff in general. >=20 > Assuming you want \0xfc displayed as "=FC", >=20 > > cat test.py && python test.py && ls -l >=20 > #!/usr/local/bin/python > # -*- coding: utf-8 -*- >=20 > f=3Dopen('\xfc','w') > f.close() > total 2 >=20 > -rw-r--r-- 1 michael wheel 29 9 Nov 02:43 test.py > -rw-r--r-- 1 michael wheel 0 9 Nov 02:44 =FC >=20 >=20 > here is what works for me: >=20 > in my login class in /etc/login.conf: >=20 > :charset=3DISO-8859-1:\ > :lang=3Dde_DE.ISO8859-1:\ >=20 > ``cap_mkdb /etc/login.conf'' after changes Ah, thanks - that seems to be the proper way to have the environmental variables set - instead of my (ab)use of setenv's in the csh config file. :-) Note the "precedence" of $LANG vs. $LC_* (as they can be used to configure things more precisely, e. g. regarding system messages or date formats; see example following). > in /etc/rc.conf: >=20 > scrnmap=3D"iso-8859-1_to_cp437" Hm? CP437? Codepage? Isn't that some MS-DOS thing? I've never needed a screenmap to make "extended characters" (everything beyong US-ASCII) work. > font8x8=3D"cp850-8x8" > font8x14=3D"cp850-8x14" > font8x16=3D"cp850-8x16" >=20 >=20 > and in /etc/ttys, console type is set to ``cons25l1'' I have a similar setting here, but that does _not_ work wuth UTF-8 codec characters. If I want to use them, I have to change some environmental variables, from #-------GERMAN/ENGLISH------------------------ <=3D=3D=3D DEFAULT setenv LC_ALL en_US.ISO8859-1 setenv LC_MESSAGES en_US.ISO8859-1 setenv LC_COLLATE de_DE.ISO8859-1 setenv LC_CTYPE de_DE.ISO8859-1 setenv LC_MONETARY de_DE.ISO8859-1 setenv LC_NUMERIC de_DE.ISO8859-1 setenv LC_TIME de_DE.ISO8859-1 unsetenv LANG to #-------INTERNATIONAL------------------------- setenv LC_ALL en_US.UTF-8 setenv LC_MESSAGES en_US.UTF-8 setenv LC_COLLATE de_DE.UTF-8 setenv LC_CTYPE de_DE.UTF-8 setenv LC_MONETARY de_DE.UTF-8 setenv LC_NUMERIC de_DE.UTF-8 setenv LC_TIME de_DE.UTF-8 setenv LANG de_DE.UTF-8 Then I can use UTF-8 characters inside rxvt-unicode. Of course, text mode console is limited to the first set of configuration, using the ISO 8859-1 character set. This worked long before UTF-8 arrived with the glorious idea that I should have 2 bytes where one is sufficient, to describe our (german) 6 umlauts and the Eszett ligature. :-) Improper settings will result in [][] or A-tilde three quarters upside-down question mark, depending on editor or terminal used. But returning to the original question, I think Robert did explain it very well: There is no real consensus about what the different codings should mean. They were meant to unify the representation of a very large set of characters, but basically there are many inter- pretations now, and how they show up to the user depends on the font in use, _if_ it has this mapping or that, or none. For running ls, -w is the right option to use - but IN COMBINATION with correct settings for the terminal emulation AND the presence of a font that will do. Again a fine demonstration why file names should be limited to printable ASCII and no spaces if you want them to work everywhere. :-) --=20 Polytropon Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ...