FreeBSD Mail Archives

Date:      Thu, 11 Apr 2002 07:38:04 -0400
From:      "Dan Langille" <dan@langille.org>
To:        Terry Lambert <tlambert2@mindspring.com>
Cc:        chat@freebsd.org
Subject:   Re: what are these characters please?
Message-ID:  <20020411113858.E48BB3F30@bast.unixathome.org>
In-Reply-To: <3CB571D6.2C10B9AA@mindspring.com>

On 11 Apr 2002 at 4:21, Terry Lambert wrote:

> Dan Langille wrote:
> > On 10 Apr 2002 at 19:59, Terry Lambert wrote:
> > > > Ville Skytt^[,Ad^[(B <ville.skytta@iki.fi>
> > >
> > > ANSI character set selector escape sequence for 7 bit representation of
> > > 8 bit characters.
> > >
> > > If I had to guess, I would say "eth", which is a "D" with a bar in it,
> > > unlike "thorn", which is an "O" with a forwars slash through it.  8-).
> > >
> > > Obviously a deficiency in the encapsulation of a cut-and-paste
> > > that was not attributed by encoding, because CVS commit logs are
> > > not MIME encapsulated.
> > 
> > Given that I'm trying to process the cvs-all messages into XML documents
> > (using the perl module XML::Writer which does not do any encoding beyond
> > characters such as >, <, etc), any suggestions as to how to deal with
> > such characters?  I've been looking through cpan but I suspect I'm using
> > the wrong search criteria ("encoding").  Any clues?
> 
> The character sets selected are documented in ANSI 3.64; you can
> also find them in the VT220 and VT320 programming guides.  Given
> that the committer was likely using EUC encoding for JIS-208, it
> seems unrecoverable.
> 
> Most likely, you are going to have to live with it.

I have to find a solution as non-ISO-8859-1 are causing grief when it 
comes to reading in the XML.  See below.

> So you would need to know the original character set (ISO-8859-1 is
> my guess, given the poster's Finnish email address), and the input
> method and display character set used (I would say it was cut from
> a "kterm" and pasted through a Kanji EUC or Shift-JIS input method,
> given the committers email address).

I'm not at all worried about restoring the original text.  I'm going for a 
"ignore what I can't use"-solution.

> Basically, anything that isn't ISO-8859-1 is pretty much lost, since
> that's what CVS stores.

ISO-8859-1 is fine by me.  FWIW, the XML headers include:

  <?xml version="1.0" encoding="ISO-8859-1"?>

The encoding problem actually occurs later when I try to process the XML 
with XML::Parser :

not well-formed (invalid token) at line 14, column 34, byte 559 at 
/usr/local/lib/perl5/site_perl/5.005/i386-freebsd/XML/Parser.pm line 185

And line 14 is:

        [Submitted by: Ville SkyttESC,AdESC(B &lt;ville.skytta@iki.fi&gt;]

I think my goal here is remove all non-ISO-8859-1 characters from the 
incoming cvs-all message.  I've been searching newsgroups (comp.lang.perl 
and comp.text.xml) trying to find a simple solution.

> If you want to get complicated, the email address is actually
> <ville.skytta@iki.fi>, and anything not inside the "<" ">" is
> comments.  Email addresses aren't allowed to have special
> characters in them (US ASCII strikes again!).

I agree, it's too complicated for the objective at hand.
-- 
Dan Langille
The FreeBSD Diary - http://freebsddiary.org/ - practical examples


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-chat" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20020411113858.E48BB3F30>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation