Date: Thu, 11 Apr 2002 15:11:13 +0000 (UTC) From: naddy@mips.inka.de (Christian Weisgerber) To: freebsd-chat@freebsd.org Subject: Re: what are these characters please? Message-ID: <a9492h$2g43$1@kemoauc.mips.inka.de> References: <3CB571D6.2C10B9AA@mindspring.com> <20020411113858.E48BB3F30@bast.unixathome.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Dan Langille <dan@langille.org> wrote: > I have to find a solution as non-ISO-8859-1 are causing grief when it > comes to reading in the XML. See below. Note that there is stuff in the commit logs that is valid but doesn't make sense in ISO 8859-1 encoding. For example, somebody by the name of "Slaven Rezi<E6>" is credited. I very much doubt that the final character is really ae ligature (as per 8859-1); c with acute (8859-2) seems more plausible. It gets worse for Cyrillic names. So if you assume the input to be ISO-8859-1-encoded, you will preserve the stuff that was actually input in 8859-1 but totally screw up the stuff that was originally input in some other encoding. > I'm not at all worried about restoring the original text. I'm going for a > "ignore what I can't use"-solution. Okay. > I think my goal here is remove all non-ISO-8859-1 characters from the > incoming cvs-all message. It makes more sense to clobber everything that isn't ASCII. chomp($line); $line ~= tr/\x09\x20-\x7E/?/c; # tab, printable ASCII Putting a replacement character such as '?' or '#' there is probably less confusing than outright deleting the offending bytes. -- Christian "naddy" Weisgerber naddy@mips.inka.de To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-chat" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?a9492h$2g43$1>