Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Apr 2002 13:26:49 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        dan@langille.org
Cc:        chat@freebsd.org
Subject:   Re: what are these characters please?
Message-ID:  <3CB5F189.3DEA9304@mindspring.com>
References:  <20020411113858.E48BB3F30@bast.unixathome.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Dan Langille wrote:
> > Most likely, you are going to have to live with it.
> 
> I have to find a solution as non-ISO-8859-1 are causing grief when it
> comes to reading in the XML.  See below.

[ ... ]

> I'm not at all worried about restoring the original text.  I'm going for a
> "ignore what I can't use"-solution.
> 
> > Basically, anything that isn't ISO-8859-1 is pretty much lost, since
> > that's what CVS stores.
> 
> ISO-8859-1 is fine by me.  FWIW, the XML headers include:
> 
>   <?xml version="1.0" encoding="ISO-8859-1"?>
> 
> The encoding problem actually occurs later when I try to process the XML
> with XML::Parser :
> 
> not well-formed (invalid token) at line 14, column 34, byte 559 at
> /usr/local/lib/perl5/site_perl/5.005/i386-freebsd/XML/Parser.pm line 185
> 
> And line 14 is:
> 
>         [Submitted by: Ville SkyttESC,AdESC(B &lt;ville.skytta@iki.fi&gt;]
> 
> I think my goal here is remove all non-ISO-8859-1 characters from the
> incoming cvs-all message.  I've been searching newsgroups (comp.lang.perl
> and comp.text.xml) trying to find a simple solution.


An "escape" character *is* a valid ISO-8859-1 character.

> > If you want to get complicated, the email address is actually
> > <ville.skytta@iki.fi>, and anything not inside the "<" ">" is
> > comments.  Email addresses aren't allowed to have special
> > characters in them (US ASCII strikes again!).
> 
> I agree, it's too complicated for the objective at hand.


The only other option would be to pre-parse for ANSI escape sequences,
and strip them.

This basically means eating everything between the <ESC> and
the next character betwwn 0x40 and 0x80 (for the most part;
that should do it for what you have seen so far, unless you
hit something like sixels).

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-chat" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3CB5F189.3DEA9304>