From owner-freebsd-chat Thu Apr 11 8:12: 7 2002 Delivered-To: freebsd-chat@freebsd.org Received: from mail.inka.de (quechua.inka.de [212.227.14.2]) by hub.freebsd.org (Postfix) with ESMTP id D807837B417 for ; Thu, 11 Apr 2002 08:11:56 -0700 (PDT) Received: from kemoauc.mips.inka.de (uucp@) by mail.inka.de with local-bsmtp id 16vgEt-0001vI-00; Thu, 11 Apr 2002 17:11:55 +0200 Received: from kemoauc.mips.inka.de (localhost [127.0.0.1]) by kemoauc.mips.inka.de (8.12.2/8.12.2) with ESMTP id g3BElgcU081606 for ; Thu, 11 Apr 2002 16:47:42 +0200 (CEST) (envelope-from mailnull@localhost.mips.inka.de) Received: (from mailnull@localhost) by kemoauc.mips.inka.de (8.12.2/8.12.2/Submit) id g3BElgNW081603 for freebsd-chat@freebsd.org; Thu, 11 Apr 2002 16:47:42 +0200 (CEST) (envelope-from mailnull) From: naddy@mips.inka.de (Christian Weisgerber) Subject: Re: what are these characters please? Date: Thu, 11 Apr 2002 14:47:41 +0000 (UTC) Message-ID: References: <20020411125429.C73703F30@bast.unixathome.org> Originator: naddy@mips.inka.de (Christian Weisgerber) To: freebsd-chat@freebsd.org Sender: owner-freebsd-chat@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org Dan Langille wrote: > > Well what encoding do your XML documents use? > > It was UTF-8. Some months ago it changed to ISO-8859-1 when I first > encountered this type of issue (back then it was Lyngbl). Seems like a bad choice to me, because how are you now going to handle characters outside the meager repertoire of ISO 8859-1? > Given that the incoming characters are supposed to be ISO-8859-1 (which is > what CVS stores (see Tony's message), Terry This is wrong. CVS stores byte streams. There is no implied character set. Nor is there a way to tag any data or CVS meta data with a character set. You can _by convention_ decide that all data stored in a particular CVS repository is to be interpreted in the character set, but I'm not aware of such a convention being in place for FreeBSD. > I'm quite sure the best thing to do is just ignore the non-standard > characters (i.e. by removing them). What's your view on that approach? I still don't know quite what you are trying to accomplish. Are you looking for a purely mechanical solution? Or are you prepared to do manual fix-ups? Do strive for accuracy? Or do you only want to quickly crunch data and don't care if people's names are mutilated? Since CVS doesn't store character set information, anything outside the printable ASCII range (0x20..0x7E) is *undefined* and thus basically an error condition. There are two ways to deal with this: 1. You can just automatically strip the characters (or replace them by a placeholder like '?' or such) and get on. This will mutilate some names, but since the input is already undefined, you can argue that you really won't do any further damage anyway. 2. You can manually try to figure out what those characters are and fix them up in one of several ways: replace by UTF-8, convert to ASCII-only, etc. If you go with (1), I strongly suggest that you kill everything outside ASCII and do not consider the input to be ISO 8859-1. Grepping over the FreeBSD commit logs, I see names that, although technically valid ISO 8859-1 sequences, were clearly input in ISO 8859-2 or KOI-8R environments. -- Christian "naddy" Weisgerber naddy@mips.inka.de To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-chat" in the body of the message