From owner-freebsd-chat Thu Apr 11 13:27:36 2002 Delivered-To: freebsd-chat@freebsd.org Received: from harrier.prod.itd.earthlink.net (harrier.mail.pas.earthlink.net [207.217.120.12]) by hub.freebsd.org (Postfix) with ESMTP id 0EF4037B404 for ; Thu, 11 Apr 2002 13:27:32 -0700 (PDT) Received: from pool0116.cvx40-bradley.dialup.earthlink.net ([216.244.42.116] helo=mindspring.com) by harrier.prod.itd.earthlink.net with esmtp (Exim 3.33 #1) id 16vlA3-00063F-00; Thu, 11 Apr 2002 13:27:16 -0700 Message-ID: <3CB5F189.3DEA9304@mindspring.com> Date: Thu, 11 Apr 2002 13:26:49 -0700 From: Terry Lambert X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: dan@langille.org Cc: chat@freebsd.org Subject: Re: what are these characters please? References: <20020411113858.E48BB3F30@bast.unixathome.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-chat@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Dan Langille wrote: > > Most likely, you are going to have to live with it. > > I have to find a solution as non-ISO-8859-1 are causing grief when it > comes to reading in the XML. See below. [ ... ] > I'm not at all worried about restoring the original text. I'm going for a > "ignore what I can't use"-solution. > > > Basically, anything that isn't ISO-8859-1 is pretty much lost, since > > that's what CVS stores. > > ISO-8859-1 is fine by me. FWIW, the XML headers include: > > > > The encoding problem actually occurs later when I try to process the XML > with XML::Parser : > > not well-formed (invalid token) at line 14, column 34, byte 559 at > /usr/local/lib/perl5/site_perl/5.005/i386-freebsd/XML/Parser.pm line 185 > > And line 14 is: > > [Submitted by: Ville SkyttESC,AdESC(B <ville.skytta@iki.fi>] > > I think my goal here is remove all non-ISO-8859-1 characters from the > incoming cvs-all message. I've been searching newsgroups (comp.lang.perl > and comp.text.xml) trying to find a simple solution. An "escape" character *is* a valid ISO-8859-1 character. > > If you want to get complicated, the email address is actually > > , and anything not inside the "<" ">" is > > comments. Email addresses aren't allowed to have special > > characters in them (US ASCII strikes again!). > > I agree, it's too complicated for the objective at hand. The only other option would be to pre-parse for ANSI escape sequences, and strip them. This basically means eating everything between the and the next character betwwn 0x40 and 0x80 (for the most part; that should do it for what you have seen so far, unless you hit something like sixels). -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-chat" in the body of the message