Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Apr 2002 04:21:58 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        dan@langille.org
Cc:        chat@freebsd.org
Subject:   Re: what are these characters please?
Message-ID:  <3CB571D6.2C10B9AA@mindspring.com>
References:  <20020411102024.3E6283F30@bast.unixathome.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Dan Langille wrote:
> On 10 Apr 2002 at 19:59, Terry Lambert wrote:
> > > Ville Skytt^[,Ad^[(B <ville.skytta@iki.fi>
> >
> > ANSI character set selector escape sequence for 7 bit representation
> > of 8 bit characters.
> >
> > If I had to guess, I would say "eth", which is a "D" with a bar in it,
> > unlike "thorn", which is an "O" with a forwars slash through it.  8-).
> >
> > Obviously a deficiency in the encapsulation of a cut-and-paste
> > that was not attributed by encoding, because CVS commit logs are
> > not MIME encapsulated.
> 
> Given that I'm trying to process the cvs-all messages into XML documents
> (using the perl module XML::Writer which does not do any encoding beyond
> characters such as >, <, etc), any suggestions as to how to deal with such
> characters?  I've been looking through cpan but I suspect I'm using the
> wrong search criteria ("encoding").  Any clues?

The character sets selected are documented in ANSI 3.64; you can
also find them in the VT220 and VT320 programming guides.  Given
that the committer was likely using EUC encoding for JIS-208, it
seems unrecoverable.

Most likely, you are going to have to live with it.

The problem is that the character set attribution was lost in the
cut-and-paste job, and it was the input method of the session doing
the cut-and-paste that probably replaced it with the escape sequence.

So you would need to know the original character set (ISO-8859-1 is
my guess, given the poster's Finnish email address), and the input
method and display character set used (I would say it was cut from
a "kterm" and pasted through a Kanji EUC or Shift-JIS input method,
given the committers email address).

Basically, anything that isn't ISO-8859-1 is pretty much lost, since
that's what CVS stores.

If you want to get complicated, the email address is actually
<ville.skytta@iki.fi>, and anything not inside the "<" ">" is
comments.  Email addresses aren't allowed to have special
characters in them (US ASCII strikes again!).

I don't think you are going to be able to automate it into a
particular character set because the posting isn't in a particular
character set.  You're basically going to get whatever is in the
CVS logs, as is, which will mean some strange stuff, occasionally.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-chat" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3CB571D6.2C10B9AA>