From owner-freebsd-chat Thu Apr 11 11:52:34 2002 Delivered-To: freebsd-chat@freebsd.org Received: from bast.unixathome.org (bast.unixathome.org [216.187.105.150]) by hub.freebsd.org (Postfix) with ESMTP id A295537B41B for ; Thu, 11 Apr 2002 11:52:29 -0700 (PDT) Received: from wocker (wocker.unixathome.org [192.168.0.99]) by bast.unixathome.org (Postfix) with ESMTP id E34203F30 for ; Thu, 11 Apr 2002 14:53:22 -0400 (EDT) From: "Dan Langille" Organization: DVL Software Limited To: freebsd-chat@freebsd.org Date: Thu, 11 Apr 2002 14:52:24 -0400 MIME-Version: 1.0 Subject: CVS log encoding (was Re: what are these characters please?) Reply-To: dan@langille.org In-reply-to: X-mailer: Pegasus Mail for Windows (v4.01) Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: 7BIT Content-description: Mail message body Message-Id: <20020411185322.E34203F30@bast.unixathome.org> Sender: owner-freebsd-chat@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org I have combined your two replies into one message. On 11 Apr 2002 at 14:47, Christian Weisgerber wrote: > Dan Langille wrote: > > > > Well what encoding do your XML documents use? > > > > It was UTF-8. Some months ago it changed to ISO-8859-1 when I first > > encountered this type of issue (back then it was Lyngbl). > > Seems like a bad choice to me, because how are you now going to > handle characters outside the meager repertoire of ISO 8859-1? > > > Given that the incoming characters are supposed to be ISO-8859-1 (which > > is what CVS stores (see Tony's message), > Terry > This is wrong. CVS stores byte streams. There is no implied character > set. Nor is there a way to tag any data or CVS meta data with a > character set. [sorry Terry; I worked with a chap in New Zealand by the name of Tony Lamberton and whenever I see your name...] > You can _by convention_ decide that all data stored in a particular > CVS repository is to be interpreted in the character set, > but I'm not aware of such a convention being in place for FreeBSD. If there is no convention, then it will be up to me to pick an encoding and stick with it. > > I'm quite sure the best thing to do is just ignore the non-standard > > characters (i.e. by removing them). What's your view on that approach? > > I still don't know quite what you are trying to accomplish. Are > you looking for a purely mechanical solution? Or are you prepared > to do manual fix-ups? Do strive for accuracy? Or do you only want > to quickly crunch data and don't care if people's names are mutilated? The goal is to accurately reflect the cvs log (see http://test.freshports.org for the beta set). But since I've started to encounter these characters which are causing strife, I'm willing to take what I can get. > Since CVS doesn't store character set information, anything outside > the printable ASCII range (0x20..0x7E) is *undefined* and thus > basically an error condition. There are two ways to deal with this: > > 1. You can just automatically strip the characters (or replace them > by a placeholder like '?' or such) and get on. This will mutilate > some names, but since the input is already undefined, you can > argue that you really won't do any further damage anyway. > > 2. You can manually try to figure out what those characters are and > fix them up in one of several ways: replace by UTF-8, convert > to ASCII-only, etc. I like a combination of the two: - Fix any characters which are outside the chosen encoding and save the data immediately. Flag the record as having been altered. - Optionally fix flagged records at some future date This will achieve the primary goal of always having up-to-date information and [optionally] achieve a not-so-primary goal of having accurate data. > If you go with (1), I strongly suggest that you kill everything > outside ASCII and do not consider the input to be ISO 8859-1. > Grepping over the FreeBSD commit logs, I see names that, although > technically valid ISO 8859-1 sequences, were clearly input in ISO > 8859-2 or KOI-8R environments. Thank you for grepping those logs for me. It would be good if we could have one encoding which covers all possible characters. I think I'll settle for the UTF-8 encoding (unless you can recommend another). On 11 Apr 2002 at 15:11, Christian Weisgerber wrote: > Dan Langille wrote: > > > I have to find a solution as non-ISO-8859-1 are causing grief when it > > comes to reading in the XML. See below. > > Note that there is stuff in the commit logs that is valid but doesn't > make sense in ISO 8859-1 encoding. For example, somebody by the > name of "Slaven Rezi" is credited. I very much doubt that the > final character is really ae ligature (as per 8859-1); c with acute > (8859-2) seems more plausible. It gets worse for Cyrillic names. I'm beginning to see the extent of the problem. > So if you assume the input to be ISO-8859-1-encoded, you will > preserve the stuff that was actually input in 8859-1 but totally > screw up the stuff that was originally input in some other encoding. That points at using something like UTF-8 I think. > > I'm not at all worried about restoring the original text. I'm going > > for a "ignore what I can't use"-solution. > > Okay. > > > I think my goal here is remove all non-ISO-8859-1 characters from the > > incoming cvs-all message. > > It makes more sense to clobber everything that isn't ASCII. > > chomp($line); > $line ~= tr/\x09\x20-\x7E/?/c; # tab, printable ASCII > > Putting a replacement character such as '?' or '#' there is probably > less confusing than outright deleting the offending bytes. Good point. That will ease the manual fix-up process too. -- Dan Langille The FreeBSD Diary - http://freebsddiary.org/ - practical examples To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-chat" in the body of the message