From owner-freebsd-chat Thu Apr 11 4:22:44 2002 Delivered-To: freebsd-chat@freebsd.org Received: from falcon.prod.itd.earthlink.net (falcon.mail.pas.earthlink.net [207.217.120.74]) by hub.freebsd.org (Postfix) with ESMTP id 267EA37B400 for ; Thu, 11 Apr 2002 04:22:37 -0700 (PDT) Received: from pool0021.cvx21-bradley.dialup.earthlink.net ([209.179.192.21] helo=mindspring.com) by falcon.prod.itd.earthlink.net with esmtp (Exim 3.33 #1) id 16vcer-00063s-00; Thu, 11 Apr 2002 04:22:30 -0700 Message-ID: <3CB571D6.2C10B9AA@mindspring.com> Date: Thu, 11 Apr 2002 04:21:58 -0700 From: Terry Lambert X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: dan@langille.org Cc: chat@freebsd.org Subject: Re: what are these characters please? References: <20020411102024.3E6283F30@bast.unixathome.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-chat@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Dan Langille wrote: > On 10 Apr 2002 at 19:59, Terry Lambert wrote: > > > Ville Skytt^[,Ad^[(B > > > > ANSI character set selector escape sequence for 7 bit representation > > of 8 bit characters. > > > > If I had to guess, I would say "eth", which is a "D" with a bar in it, > > unlike "thorn", which is an "O" with a forwars slash through it. 8-). > > > > Obviously a deficiency in the encapsulation of a cut-and-paste > > that was not attributed by encoding, because CVS commit logs are > > not MIME encapsulated. > > Given that I'm trying to process the cvs-all messages into XML documents > (using the perl module XML::Writer which does not do any encoding beyond > characters such as >, <, etc), any suggestions as to how to deal with such > characters? I've been looking through cpan but I suspect I'm using the > wrong search criteria ("encoding"). Any clues? The character sets selected are documented in ANSI 3.64; you can also find them in the VT220 and VT320 programming guides. Given that the committer was likely using EUC encoding for JIS-208, it seems unrecoverable. Most likely, you are going to have to live with it. The problem is that the character set attribution was lost in the cut-and-paste job, and it was the input method of the session doing the cut-and-paste that probably replaced it with the escape sequence. So you would need to know the original character set (ISO-8859-1 is my guess, given the poster's Finnish email address), and the input method and display character set used (I would say it was cut from a "kterm" and pasted through a Kanji EUC or Shift-JIS input method, given the committers email address). Basically, anything that isn't ISO-8859-1 is pretty much lost, since that's what CVS stores. If you want to get complicated, the email address is actually , and anything not inside the "<" ">" is comments. Email addresses aren't allowed to have special characters in them (US ASCII strikes again!). I don't think you are going to be able to automate it into a particular character set because the posting isn't in a particular character set. You're basically going to get whatever is in the CVS logs, as is, which will mean some strange stuff, occasionally. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-chat" in the body of the message