From owner-freebsd-chat Thu Apr 11 4:38:10 2002 Delivered-To: freebsd-chat@freebsd.org Received: from bast.unixathome.org (bast.unixathome.org [216.187.105.150]) by hub.freebsd.org (Postfix) with ESMTP id 209AC37B416 for ; Thu, 11 Apr 2002 04:38:06 -0700 (PDT) Received: from wocker (wocker.unixathome.org [192.168.0.99]) by bast.unixathome.org (Postfix) with ESMTP id E48BB3F30; Thu, 11 Apr 2002 07:38:58 -0400 (EDT) From: "Dan Langille" Organization: DVL Software Limited To: Terry Lambert Date: Thu, 11 Apr 2002 07:38:04 -0400 MIME-Version: 1.0 Subject: Re: what are these characters please? Reply-To: dan@langille.org Cc: chat@freebsd.org In-reply-to: <3CB571D6.2C10B9AA@mindspring.com> X-mailer: Pegasus Mail for Windows (v4.01) Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: 7BIT Content-description: Mail message body Message-Id: <20020411113858.E48BB3F30@bast.unixathome.org> Sender: owner-freebsd-chat@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org On 11 Apr 2002 at 4:21, Terry Lambert wrote: > Dan Langille wrote: > > On 10 Apr 2002 at 19:59, Terry Lambert wrote: > > > > Ville Skytt^[,Ad^[(B > > > > > > ANSI character set selector escape sequence for 7 bit representation of > > > 8 bit characters. > > > > > > If I had to guess, I would say "eth", which is a "D" with a bar in it, > > > unlike "thorn", which is an "O" with a forwars slash through it. 8-). > > > > > > Obviously a deficiency in the encapsulation of a cut-and-paste > > > that was not attributed by encoding, because CVS commit logs are > > > not MIME encapsulated. > > > > Given that I'm trying to process the cvs-all messages into XML documents > > (using the perl module XML::Writer which does not do any encoding beyond > > characters such as >, <, etc), any suggestions as to how to deal with > > such characters? I've been looking through cpan but I suspect I'm using > > the wrong search criteria ("encoding"). Any clues? > > The character sets selected are documented in ANSI 3.64; you can > also find them in the VT220 and VT320 programming guides. Given > that the committer was likely using EUC encoding for JIS-208, it > seems unrecoverable. > > Most likely, you are going to have to live with it. I have to find a solution as non-ISO-8859-1 are causing grief when it comes to reading in the XML. See below. > So you would need to know the original character set (ISO-8859-1 is > my guess, given the poster's Finnish email address), and the input > method and display character set used (I would say it was cut from > a "kterm" and pasted through a Kanji EUC or Shift-JIS input method, > given the committers email address). I'm not at all worried about restoring the original text. I'm going for a "ignore what I can't use"-solution. > Basically, anything that isn't ISO-8859-1 is pretty much lost, since > that's what CVS stores. ISO-8859-1 is fine by me. FWIW, the XML headers include: The encoding problem actually occurs later when I try to process the XML with XML::Parser : not well-formed (invalid token) at line 14, column 34, byte 559 at /usr/local/lib/perl5/site_perl/5.005/i386-freebsd/XML/Parser.pm line 185 And line 14 is: [Submitted by: Ville SkyttESC,AdESC(B <ville.skytta@iki.fi>] I think my goal here is remove all non-ISO-8859-1 characters from the incoming cvs-all message. I've been searching newsgroups (comp.lang.perl and comp.text.xml) trying to find a simple solution. > If you want to get complicated, the email address is actually > , and anything not inside the "<" ">" is > comments. Email addresses aren't allowed to have special > characters in them (US ASCII strikes again!). I agree, it's too complicated for the objective at hand. -- Dan Langille The FreeBSD Diary - http://freebsddiary.org/ - practical examples To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-chat" in the body of the message