Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 9 Nov 2011 03:10:24 +0100
From:      Polytropon <freebsd@edvax.de>
To:        "Michael Ross" <gmx@ross.cx>
Cc:        "Conrad J. Sabatier" <conrads@cox.net>, freebsd-questions@freebsd.org
Subject:   Re: "Unprintable" 8-bit characters
Message-ID:  <20111109031024.fb4c617e.freebsd@edvax.de>
In-Reply-To: <op.v4nor5dhg7njmm@michael-think>
References:  <20111108184236.3a78ebf6@cox.net> <op.v4nor5dhg7njmm@michael-think>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 09 Nov 2011 02:51:31 +0100, Michael Ross wrote:
> Am 09.11.2011, 01:42 Uhr, schrieb Conrad J. Sabatier <conrads@cox.net>:
>=20
> > Pardon me if this may seem like a stupid question, but this is
> > something that's been bugging me for a long time, and none of my
> > research has turned up anything useful yet.
> >
> > I've been trying to understand what the deal is with regards to the
> > displaying of the "extended" 8-bit character set, i.e., 8-bit characters
> > with the MSB set.
> >
> > More specifically, I'm trying to figure out how to get the "ls" command
> > to properly display filenames containing characters in this extended
> > set.  I have some MP3 files, for instance, whose names contain certain
> > European characters, such as the lowercase "u" with umlaut (code 0xfc
> > in the Latin set, according to gucharmap), that I just can't get ls to
> > display properly.  These characters seem to be considered by ls as
> > "unprintable", and the best I've been able to produce in the ls
> > output is backslash interpretations of the characters using either the
> > -B or -b options, otherwise the default "?" is displayed in their place.
>=20
> Unsure if I understand you correctly.
> ("extended" 8-bit character set with MSB? utf-16?)
> I'm confused by this charset stuff in general.
>=20
> Assuming you want \0xfc displayed as "=FC",
>=20
> > cat test.py && python test.py && ls -l
>=20
> #!/usr/local/bin/python
> # -*- coding: utf-8 -*-
>=20
> f=3Dopen('\xfc','w')
> f.close()
> total 2
>=20
> -rw-r--r--  1 michael  wheel  29  9 Nov 02:43 test.py
> -rw-r--r--  1 michael  wheel   0  9 Nov 02:44 =FC
>=20
>=20
> here is what works for me:
>=20
> in my login class in /etc/login.conf:
>=20
>          :charset=3DISO-8859-1:\
>          :lang=3Dde_DE.ISO8859-1:\
>=20
> ``cap_mkdb /etc/login.conf'' after changes

Ah, thanks - that seems to be the proper way to have
the environmental variables set - instead of my (ab)use
of setenv's in the csh config file. :-)

Note the "precedence" of $LANG vs. $LC_* (as they can
be used to configure things more precisely, e. g.
regarding system messages or date formats; see example
following).



> in /etc/rc.conf:
>=20
> 	scrnmap=3D"iso-8859-1_to_cp437"

Hm? CP437? Codepage? Isn't that some MS-DOS thing?
I've never needed a screenmap to make "extended
characters" (everything beyong US-ASCII) work.



> 	font8x8=3D"cp850-8x8"
> 	font8x14=3D"cp850-8x14"
> 	font8x16=3D"cp850-8x16"
>=20
>=20
> and in /etc/ttys, console type is set to ``cons25l1''

I have a similar setting here, but that does _not_ work
wuth UTF-8 codec characters. If I want to use them, I
have to change some environmental variables, from

	#-------GERMAN/ENGLISH------------------------ <=3D=3D=3D DEFAULT
	setenv	LC_ALL		en_US.ISO8859-1
	setenv	LC_MESSAGES	en_US.ISO8859-1
	setenv	LC_COLLATE	de_DE.ISO8859-1
	setenv	LC_CTYPE	de_DE.ISO8859-1
	setenv	LC_MONETARY	de_DE.ISO8859-1
	setenv	LC_NUMERIC	de_DE.ISO8859-1
	setenv	LC_TIME		de_DE.ISO8859-1
	unsetenv LANG

to

	#-------INTERNATIONAL-------------------------
	setenv	LC_ALL		en_US.UTF-8
	setenv	LC_MESSAGES	en_US.UTF-8
	setenv	LC_COLLATE	de_DE.UTF-8
	setenv	LC_CTYPE	de_DE.UTF-8
	setenv	LC_MONETARY	de_DE.UTF-8
	setenv	LC_NUMERIC	de_DE.UTF-8
	setenv	LC_TIME		de_DE.UTF-8
	setenv	LANG		de_DE.UTF-8

Then I can use UTF-8 characters inside rxvt-unicode. Of
course, text mode console is limited to the first set
of configuration, using the ISO 8859-1 character set.

This worked long before UTF-8 arrived with the glorious
idea that I should have 2 bytes where one is sufficient,
to describe our (german) 6 umlauts and the Eszett ligature. :-)

Improper settings will result in [][] or A-tilde three
quarters upside-down question mark, depending on editor
or terminal used.


But returning to the original question, I think Robert
did explain it very well: There is no real consensus
about what the different codings should mean. They
were meant to unify the representation of a very large
set of characters, but basically there are many inter-
pretations now, and how they show up to the user depends
on the font in use, _if_ it has this mapping or that,
or none.

For running ls, -w is the right option to use - but IN
COMBINATION with correct settings for the terminal
emulation AND the presence of a font that will do.

Again a fine demonstration why file names should be
limited to printable ASCII and no spaces if you want
them to work everywhere. :-)



--=20
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20111109031024.fb4c617e.freebsd>