From owner-freebsd-questions@FreeBSD.ORG  Sun Apr 22 00:09:57 2012
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id BFA63106564A
	for <freebsd-questions@freebsd.org>;
	Sun, 22 Apr 2012 00:09:57 +0000 (UTC)
	(envelope-from bonomi@mail.r-bonomi.com)
Received: from mail.r-bonomi.com (mx-out.r-bonomi.com [204.87.227.120])
	by mx1.freebsd.org (Postfix) with ESMTP id 6DA048FC08
	for <freebsd-questions@freebsd.org>;
	Sun, 22 Apr 2012 00:09:57 +0000 (UTC)
Received: (from bonomi@localhost)
	by mail.r-bonomi.com (8.14.4/rdb1) id q3M0ANH6081375
	for freebsd-questions@freebsd.org; Sat, 21 Apr 2012 19:10:23 -0500 (CDT)
Date: Sat, 21 Apr 2012 19:10:23 -0500 (CDT)
From: Robert Bonomi <bonomi@mail.r-bonomi.com>
Message-Id: <201204220010.q3M0ANH6081375@mail.r-bonomi.com>
To: freebsd-questions@freebsd.org
In-Reply-To: <20120421220703.86683bc9.freebsd@edvax.de>
Subject: Re: converting UTF-8 to HTML
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 22 Apr 2012 00:09:57 -0000


Polytropon <freebsd@edvax.de> wrote:
> On Sat, 21 Apr 2012 09:10:03 -0500 (CDT), Lars Eighner wrote:
> > On Sat, 21 Apr 2012, Erik Nurgaard wrote:
> > 
> > > When characters show up wrong in the users browser it's usually
> > > because the browser is set to use a non-UTF-8 charset by default
> > > such as windows-1252, the web server sends the charset=ascii in 
> > > the http header and there is no or incorrect meta tag to resolve 
> > > the problem. Non UTF-8 charsets are a leftover from last millenia 
> > > that we sometimes still choke on .. sorry the rant ;)
> > 
> > UTF-8 is a waste of storage for most people and is incompatiple with
> > text-mode tools: it's simple another bid to make it impossible to run
> > without a GUI.
>
> Regarding the fun of encodings, endianness, representation,
> use ("fi" the two letters vs. "fi" the ligature, or "a"
> the 1-byte sequence vs. "a" the two-byte sequence), see
> the following document:
>
> Matt Mayer: Love Hotels and Unicode
> http://www.reigndesign.com/blog/love-hotels-and-unicode/
>
> And finally it offers an interesting attack vector, given
> the fact that several unicode characters "look" the same,
> but in fact are different. So "two files with the 'same'
> name" is a possible means that malware implementers can
> utilize to mislead the users.
>
> Short example from MICROS~1 land here:
> http://blogs.technet.com/b/mmpc/archive/2011/08/10/can-we-believe-our-eyes.aspx
>
> But this all doesn't negate the usefulness of unicode / UTF-8
> in general. Especially when you have collaborative settings
> with multi-language document processing requirements, it
> is a helpful thing, as working with "normal" (ASCII) letters,
> cyrillic ones, chinese and japanese symbols, arabic writing
> is no big deal as long as all the tools do properly support
> it the _same_ way.
>

Sorry, but UTF-8 is a *botch*, to put it charitably.

Correction -- UTF-8 is a particular implementation of the botch that is
'variable-width encoding' representation of the glyphs used to represent
printed information.

"Variable-width ecoding" destroys the concept of addressibility -within-
a text.  And, therefore, 'random access'/'direct access' is impossible.

Ditto for concepts like 'read backwards'. 

Not to mention the inevitable, and UNAVOIDABLE problems that occur when
the 'encoding' used for a particular set of data is not represented *IN*
the dataset (or in inextricably-coupled 'metadata').  When one has to
'guess' what the encoding for a particular file is.  

'Assume' -- with all that -that- word implies -- a particular encoding,
when the data is actually encoded with something 'different', and you
can encounter 'illegal' (in the 'assumed' encoding) byte sequences, 
from which there is *NO* means of recovery -- since the 'interpreter'
can't tell how long the 'illegal' code is, it can't tell where the 'next'
symbol should start, and and it just _stops_cold_ ... an apparent 'end of
file'. 

I have had _that_ particular ufortunate experience, with an 'encoding-aware'
text editor (On a Debain Linux system, if it matters), which, on exit 
_SILENTLY_ *truncated* the originl file at the point of the 'illegal' symbol.

The -correct- solution -- if you are in an environment where you need more
glyphs than can be represented by a single byte -- is to use *fixed-width*
multi-byte symbols for _everything_.  This is "relatively easy" to implement
within a single 'system' (be it a single machine, or 'corporate wide'), but
makes for major difficulities when 'external' communication is involved.
There is, unfortunately, simply -no- simple solution for that problem. :((