Date: Wed, 9 Nov 2011 03:10:24 +0100 From: Polytropon <freebsd@edvax.de> To: "Michael Ross" <gmx@ross.cx> Cc: "Conrad J. Sabatier" <conrads@cox.net>, freebsd-questions@freebsd.org Subject: Re: "Unprintable" 8-bit characters Message-ID: <20111109031024.fb4c617e.freebsd@edvax.de> In-Reply-To: <op.v4nor5dhg7njmm@michael-think> References: <20111108184236.3a78ebf6@cox.net> <op.v4nor5dhg7njmm@michael-think>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 09 Nov 2011 02:51:31 +0100, Michael Ross wrote:
> Am 09.11.2011, 01:42 Uhr, schrieb Conrad J. Sabatier <conrads@cox.net>:
>=20
> > Pardon me if this may seem like a stupid question, but this is
> > something that's been bugging me for a long time, and none of my
> > research has turned up anything useful yet.
> >
> > I've been trying to understand what the deal is with regards to the
> > displaying of the "extended" 8-bit character set, i.e., 8-bit characters
> > with the MSB set.
> >
> > More specifically, I'm trying to figure out how to get the "ls" command
> > to properly display filenames containing characters in this extended
> > set. I have some MP3 files, for instance, whose names contain certain
> > European characters, such as the lowercase "u" with umlaut (code 0xfc
> > in the Latin set, according to gucharmap), that I just can't get ls to
> > display properly. These characters seem to be considered by ls as
> > "unprintable", and the best I've been able to produce in the ls
> > output is backslash interpretations of the characters using either the
> > -B or -b options, otherwise the default "?" is displayed in their place.
>=20
> Unsure if I understand you correctly.
> ("extended" 8-bit character set with MSB? utf-16?)
> I'm confused by this charset stuff in general.
>=20
> Assuming you want \0xfc displayed as "=FC",
>=20
> > cat test.py && python test.py && ls -l
>=20
> #!/usr/local/bin/python
> # -*- coding: utf-8 -*-
>=20
> f=3Dopen('\xfc','w')
> f.close()
> total 2
>=20
> -rw-r--r-- 1 michael wheel 29 9 Nov 02:43 test.py
> -rw-r--r-- 1 michael wheel 0 9 Nov 02:44 =FC
>=20
>=20
> here is what works for me:
>=20
> in my login class in /etc/login.conf:
>=20
> :charset=3DISO-8859-1:\
> :lang=3Dde_DE.ISO8859-1:\
>=20
> ``cap_mkdb /etc/login.conf'' after changes
Ah, thanks - that seems to be the proper way to have
the environmental variables set - instead of my (ab)use
of setenv's in the csh config file. :-)
Note the "precedence" of $LANG vs. $LC_* (as they can
be used to configure things more precisely, e. g.
regarding system messages or date formats; see example
following).
> in /etc/rc.conf:
>=20
> scrnmap=3D"iso-8859-1_to_cp437"
Hm? CP437? Codepage? Isn't that some MS-DOS thing?
I've never needed a screenmap to make "extended
characters" (everything beyong US-ASCII) work.
> font8x8=3D"cp850-8x8"
> font8x14=3D"cp850-8x14"
> font8x16=3D"cp850-8x16"
>=20
>=20
> and in /etc/ttys, console type is set to ``cons25l1''
I have a similar setting here, but that does _not_ work
wuth UTF-8 codec characters. If I want to use them, I
have to change some environmental variables, from
#-------GERMAN/ENGLISH------------------------ <=3D=3D=3D DEFAULT
setenv LC_ALL en_US.ISO8859-1
setenv LC_MESSAGES en_US.ISO8859-1
setenv LC_COLLATE de_DE.ISO8859-1
setenv LC_CTYPE de_DE.ISO8859-1
setenv LC_MONETARY de_DE.ISO8859-1
setenv LC_NUMERIC de_DE.ISO8859-1
setenv LC_TIME de_DE.ISO8859-1
unsetenv LANG
to
#-------INTERNATIONAL-------------------------
setenv LC_ALL en_US.UTF-8
setenv LC_MESSAGES en_US.UTF-8
setenv LC_COLLATE de_DE.UTF-8
setenv LC_CTYPE de_DE.UTF-8
setenv LC_MONETARY de_DE.UTF-8
setenv LC_NUMERIC de_DE.UTF-8
setenv LC_TIME de_DE.UTF-8
setenv LANG de_DE.UTF-8
Then I can use UTF-8 characters inside rxvt-unicode. Of
course, text mode console is limited to the first set
of configuration, using the ISO 8859-1 character set.
This worked long before UTF-8 arrived with the glorious
idea that I should have 2 bytes where one is sufficient,
to describe our (german) 6 umlauts and the Eszett ligature. :-)
Improper settings will result in [][] or A-tilde three
quarters upside-down question mark, depending on editor
or terminal used.
But returning to the original question, I think Robert
did explain it very well: There is no real consensus
about what the different codings should mean. They
were meant to unify the representation of a very large
set of characters, but basically there are many inter-
pretations now, and how they show up to the user depends
on the font in use, _if_ it has this mapping or that,
or none.
For running ls, -w is the right option to use - but IN
COMBINATION with correct settings for the terminal
emulation AND the presence of a font that will do.
Again a fine demonstration why file names should be
limited to printable ASCII and no spaces if you want
them to work everywhere. :-)
--=20
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20111109031024.fb4c617e.freebsd>
