From owner-freebsd-questions@FreeBSD.ORG Sun Apr 22 00:09:57 2012 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BFA63106564A for ; Sun, 22 Apr 2012 00:09:57 +0000 (UTC) (envelope-from bonomi@mail.r-bonomi.com) Received: from mail.r-bonomi.com (mx-out.r-bonomi.com [204.87.227.120]) by mx1.freebsd.org (Postfix) with ESMTP id 6DA048FC08 for ; Sun, 22 Apr 2012 00:09:57 +0000 (UTC) Received: (from bonomi@localhost) by mail.r-bonomi.com (8.14.4/rdb1) id q3M0ANH6081375 for freebsd-questions@freebsd.org; Sat, 21 Apr 2012 19:10:23 -0500 (CDT) Date: Sat, 21 Apr 2012 19:10:23 -0500 (CDT) From: Robert Bonomi Message-Id: <201204220010.q3M0ANH6081375@mail.r-bonomi.com> To: freebsd-questions@freebsd.org In-Reply-To: <20120421220703.86683bc9.freebsd@edvax.de> Subject: Re: converting UTF-8 to HTML X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Apr 2012 00:09:57 -0000 Polytropon wrote: > On Sat, 21 Apr 2012 09:10:03 -0500 (CDT), Lars Eighner wrote: > > On Sat, 21 Apr 2012, Erik Nurgaard wrote: > > > > > When characters show up wrong in the users browser it's usually > > > because the browser is set to use a non-UTF-8 charset by default > > > such as windows-1252, the web server sends the charset=ascii in > > > the http header and there is no or incorrect meta tag to resolve > > > the problem. Non UTF-8 charsets are a leftover from last millenia > > > that we sometimes still choke on .. sorry the rant ;) > > > > UTF-8 is a waste of storage for most people and is incompatiple with > > text-mode tools: it's simple another bid to make it impossible to run > > without a GUI. > > Regarding the fun of encodings, endianness, representation, > use ("fi" the two letters vs. "fi" the ligature, or "a" > the 1-byte sequence vs. "a" the two-byte sequence), see > the following document: > > Matt Mayer: Love Hotels and Unicode > http://www.reigndesign.com/blog/love-hotels-and-unicode/ > > And finally it offers an interesting attack vector, given > the fact that several unicode characters "look" the same, > but in fact are different. So "two files with the 'same' > name" is a possible means that malware implementers can > utilize to mislead the users. > > Short example from MICROS~1 land here: > http://blogs.technet.com/b/mmpc/archive/2011/08/10/can-we-believe-our-eyes.aspx > > But this all doesn't negate the usefulness of unicode / UTF-8 > in general. Especially when you have collaborative settings > with multi-language document processing requirements, it > is a helpful thing, as working with "normal" (ASCII) letters, > cyrillic ones, chinese and japanese symbols, arabic writing > is no big deal as long as all the tools do properly support > it the _same_ way. > Sorry, but UTF-8 is a *botch*, to put it charitably. Correction -- UTF-8 is a particular implementation of the botch that is 'variable-width encoding' representation of the glyphs used to represent printed information. "Variable-width ecoding" destroys the concept of addressibility -within- a text. And, therefore, 'random access'/'direct access' is impossible. Ditto for concepts like 'read backwards'. Not to mention the inevitable, and UNAVOIDABLE problems that occur when the 'encoding' used for a particular set of data is not represented *IN* the dataset (or in inextricably-coupled 'metadata'). When one has to 'guess' what the encoding for a particular file is. 'Assume' -- with all that -that- word implies -- a particular encoding, when the data is actually encoded with something 'different', and you can encounter 'illegal' (in the 'assumed' encoding) byte sequences, from which there is *NO* means of recovery -- since the 'interpreter' can't tell how long the 'illegal' code is, it can't tell where the 'next' symbol should start, and and it just _stops_cold_ ... an apparent 'end of file'. I have had _that_ particular ufortunate experience, with an 'encoding-aware' text editor (On a Debain Linux system, if it matters), which, on exit _SILENTLY_ *truncated* the originl file at the point of the 'illegal' symbol. The -correct- solution -- if you are in an environment where you need more glyphs than can be represented by a single byte -- is to use *fixed-width* multi-byte symbols for _everything_. This is "relatively easy" to implement within a single 'system' (be it a single machine, or 'corporate wide'), but makes for major difficulities when 'external' communication is involved. There is, unfortunately, simply -no- simple solution for that problem. :((