From owner-freebsd-chat  Thu Apr 11  8:12: 7 2002
Delivered-To: freebsd-chat@freebsd.org
Received: from mail.inka.de (quechua.inka.de [212.227.14.2])
	by hub.freebsd.org (Postfix) with ESMTP id D807837B417
	for <freebsd-chat@freebsd.org>; Thu, 11 Apr 2002 08:11:56 -0700 (PDT)
Received: from kemoauc.mips.inka.de (uucp@)
	by mail.inka.de with local-bsmtp 
	id 16vgEt-0001vI-00; Thu, 11 Apr 2002 17:11:55 +0200
Received: from kemoauc.mips.inka.de (localhost [127.0.0.1])
	by kemoauc.mips.inka.de (8.12.2/8.12.2) with ESMTP id g3BElgcU081606
	for <freebsd-chat@freebsd.org>; Thu, 11 Apr 2002 16:47:42 +0200 (CEST)
	(envelope-from mailnull@localhost.mips.inka.de)
Received: (from mailnull@localhost)
	by kemoauc.mips.inka.de (8.12.2/8.12.2/Submit) id g3BElgNW081603
	for freebsd-chat@freebsd.org; Thu, 11 Apr 2002 16:47:42 +0200 (CEST)
	(envelope-from mailnull)
From: naddy@mips.inka.de (Christian Weisgerber)
Subject: Re: what are these characters please?
Date: Thu, 11 Apr 2002 14:47:41 +0000 (UTC)
Message-ID: <a947md$2fli$1@kemoauc.mips.inka.de>
References: <a93ugk$155s$1@kemoauc.mips.inka.de> <20020411125429.C73703F30@bast.unixathome.org>
Originator: naddy@mips.inka.de (Christian Weisgerber)
To: freebsd-chat@freebsd.org
Sender: owner-freebsd-chat@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-chat.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-chat>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-chat>
X-Loop: FreeBSD.org

Dan Langille <dan@langille.org> wrote:

> > Well what encoding do your XML documents use?
> 
> It was UTF-8.  Some months ago it changed to ISO-8859-1 when I first 
> encountered this type of issue (back then it was Lyngb<F8>l).

Seems like a bad choice to me, because how are you now going to
handle characters outside the meager repertoire of ISO 8859-1?

> Given that the incoming characters are supposed to be ISO-8859-1 (which is
> what CVS stores (see Tony's message),
                       Terry
This is wrong. CVS stores byte streams. There is no implied character
set. Nor is there a way to tag any data or CVS meta data with a
character set.

You can _by convention_ decide that all data stored in a particular
CVS repository is to be interpreted in the <mumble> character set,
but I'm not aware of such a convention being in place for FreeBSD.

> I'm quite sure the best thing to do is just ignore the non-standard
> characters (i.e. by removing them).  What's your view on that approach?

I still don't know quite what you are trying to accomplish.  Are
you looking for a purely mechanical solution?  Or are you prepared
to do manual fix-ups?  Do strive for accuracy?  Or do you only want
to quickly crunch data and don't care if people's names are mutilated?

Since CVS doesn't store character set information, anything outside
the printable ASCII range (0x20..0x7E) is *undefined* and thus
basically an error condition.  There are two ways to deal with this:

1. You can just automatically strip the characters (or replace them
   by a placeholder like '?' or such) and get on.  This will mutilate
   some names, but since the input is already undefined, you can
   argue that you really won't do any further damage anyway.

2. You can manually try to figure out what those characters are and
   fix them up in one of several ways: replace by UTF-8, convert
   to ASCII-only, etc.

If you go with (1), I strongly suggest that you kill everything
outside ASCII and do not consider the input to be ISO 8859-1.
Grepping over the FreeBSD commit logs, I see names that, although
technically valid ISO 8859-1 sequences, were clearly input in ISO
8859-2 or KOI-8R environments.

-- 
Christian "naddy" Weisgerber                          naddy@mips.inka.de


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-chat" in the body of the message