From owner-freebsd-chat Thu Apr 11 8:12: 4 2002 Delivered-To: freebsd-chat@freebsd.org Received: from mail.inka.de (quechua.inka.de [212.227.14.2]) by hub.freebsd.org (Postfix) with ESMTP id 037D237B419 for ; Thu, 11 Apr 2002 08:11:56 -0700 (PDT) Received: from kemoauc.mips.inka.de (uucp@) by mail.inka.de with local-bsmtp id 16vgEt-0001vI-01; Thu, 11 Apr 2002 17:11:55 +0200 Received: from kemoauc.mips.inka.de (localhost [127.0.0.1]) by kemoauc.mips.inka.de (8.12.2/8.12.2) with ESMTP id g3BFBEcU082061 for ; Thu, 11 Apr 2002 17:11:14 +0200 (CEST) (envelope-from mailnull@localhost.mips.inka.de) Received: (from mailnull@localhost) by kemoauc.mips.inka.de (8.12.2/8.12.2/Submit) id g3BFBEAv082060 for freebsd-chat@freebsd.org; Thu, 11 Apr 2002 17:11:14 +0200 (CEST) (envelope-from mailnull) From: naddy@mips.inka.de (Christian Weisgerber) Subject: Re: what are these characters please? Date: Thu, 11 Apr 2002 15:11:13 +0000 (UTC) Message-ID: References: <3CB571D6.2C10B9AA@mindspring.com> <20020411113858.E48BB3F30@bast.unixathome.org> Originator: naddy@mips.inka.de (Christian Weisgerber) To: freebsd-chat@freebsd.org Sender: owner-freebsd-chat@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Dan Langille wrote: > I have to find a solution as non-ISO-8859-1 are causing grief when it > comes to reading in the XML. See below. Note that there is stuff in the commit logs that is valid but doesn't make sense in ISO 8859-1 encoding. For example, somebody by the name of "Slaven Rezi" is credited. I very much doubt that the final character is really ae ligature (as per 8859-1); c with acute (8859-2) seems more plausible. It gets worse for Cyrillic names. So if you assume the input to be ISO-8859-1-encoded, you will preserve the stuff that was actually input in 8859-1 but totally screw up the stuff that was originally input in some other encoding. > I'm not at all worried about restoring the original text. I'm going for a > "ignore what I can't use"-solution. Okay. > I think my goal here is remove all non-ISO-8859-1 characters from the > incoming cvs-all message. It makes more sense to clobber everything that isn't ASCII. chomp($line); $line ~= tr/\x09\x20-\x7E/?/c; # tab, printable ASCII Putting a replacement character such as '?' or '#' there is probably less confusing than outright deleting the offending bytes. -- Christian "naddy" Weisgerber naddy@mips.inka.de To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-chat" in the body of the message