From owner-freebsd-questions@FreeBSD.ORG  Fri Nov 11 01:12:48 2011
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0CC13106566C
	for <freebsd-questions@freebsd.org>;
	Fri, 11 Nov 2011 01:12:48 +0000 (UTC) (envelope-from conrads@cox.net)
Received: from eastrmfepo103.cox.net (eastrmfepo103.cox.net [68.230.241.215])
	by mx1.freebsd.org (Postfix) with ESMTP id A392F8FC16
	for <freebsd-questions@freebsd.org>;
	Fri, 11 Nov 2011 01:12:47 +0000 (UTC)
Received: from eastrmimpo305.cox.net ([68.230.241.237])
	by eastrmfepo103.cox.net
	(InterMail vM.8.01.04.00 201-2260-137-20101110) with ESMTP id
	<20111111011242.EJJC28068.eastrmfepo103.cox.net@eastrmimpo305.cox.net>;
	Thu, 10 Nov 2011 20:12:42 -0500
Received: from serene.no-ip.org ([98.164.86.236])
	by eastrmimpo305.cox.net with bizsmtp
	id vdCg1h00M55wwzE02dChAs; Thu, 10 Nov 2011 20:12:41 -0500
X-CT-Class: Bulk
X-CT-Score: 5.00
X-CT-RefID: str=0001.0A02020B.4EBC7689.00AF,ss=3,re=0.000,fgs=0
X-CT-Spam: 0
X-Authority-Analysis: v=1.1 cv=BX0YEIBOusRIeQdDschwVvWAB1OmeRFmMWKQyT+Am3A=
	c=1 sm=1 a=G8Uczd0VNMoA:10 a=kj9zAlcOel0A:10
	a=uAbGmPAyUfLL1M3oYAsfuA==:17
	a=lM4-zUH5AAAA:8 a=kviXuzpPAAAA:8 a=0x-Y4APh3q6g4Mh_81QA:9
	a=u_3vePIcgR3jBwjXXEsA:7 a=CjuIK1q_8ugA:10 a=4vB-4DCPJfMA:10
	a=eR8K6Hi1c0XdF3Zz:21 a=d93vutSRP4LTgj0v:21
	a=uAbGmPAyUfLL1M3oYAsfuA==:117
X-CM-Score: 0.00
Authentication-Results: cox.net; none
Received: from cox.net (localhost [127.0.0.1])
	by serene.no-ip.org (8.14.5/8.14.5) with ESMTP id pAB1CdeX013965;
	Thu, 10 Nov 2011 19:12:40 -0600 (CST) (envelope-from conrads@cox.net)
Date: Thu, 10 Nov 2011 19:12:34 -0600
From: "Conrad J. Sabatier" <conrads@cox.net>
To: Robert Bonomi <bonomi@mail.r-bonomi.com>
Message-ID: <20111110191234.53611af7@cox.net>
In-Reply-To: <201111090504.pA954Pod066887@mail.r-bonomi.com>
References: <20111108205948.54daef43@cox.net>
	<201111090504.pA954Pod066887@mail.r-bonomi.com>
X-Mailer: Claws Mail 3.7.10 (GTK+ 2.24.6; amd64-portbld-freebsd9.0)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Cc: freebsd-questions@freebsd.org
Subject: Re: "Unprintable" 8-bit characters
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 11 Nov 2011 01:12:48 -0000

On Tue, 8 Nov 2011 23:04:25 -0600 (CST)
Robert Bonomi <bonomi@mail.r-bonomi.com> wrote:

> 
> "Conrad J. Sabatier" <conrads@cox.net> wrote:
> >
> > <grin>
> >
> > Yes, and this is one area where the labels are more than a little
> > misleading as well.  My natural inclination is think of UTF-8 as
> > being a single-byte representation for each character in the set,
> > whereas UTF-16, as the name implies, would be the "wide", 2-byte
> > version.
> 
> "Not exactly."
> 
> > Nonetheless, as I posted earlier in this thread, according to the
> > info in gucharmap, the representations of the umlauted "u" are just
> > the opposite of this:
> 
> "not exactly." Again.
> 
> > UTF-8: 0xC3 0xBC
> > UTF-16: 0x00FC
> >  
> > Go figure, huh?  :-)
> 
> In UTF-16, everything _is_ a 16-bit entity.  Notice that 0x00FC has
> -four- nybbles after the '0x.'  Every character boundary is on a
> multiple of 16 bits.

Ah yes!  I hadn't noticed that.

What's really weird, as I mentioned in a later private email to
Polytropon, last night, the copy-and-paste in gucharmap suddenly
decided to start copying the UTF-8 code instead of the UTF-16.  I have
no idea why that changed.

> In UTF-8, the 'base' charset -- the 'C0' and 'C1' groups are
> represented by a single byte.  'extended' characters are represented
> by two bytes. Thus, 'characters' have  a *variable*length*
> representation -- one or two bytes.  A character, whether it is
> represented by one or two bytes,  can begin on -any- byte boundary
> within a data stream, depending on 'what came before it'.  UTF-8
> 2-byte representations are designed such that one can jump to any
> _byte_ offset within the file, and determine -- by looking *only* at
> the value of that byte whether is is (a) a single-byte character, (b)
> the first byte of a two-byte sequence, or (c) the second byte of a
> two-byte sequence.
> 
> With UTF-16 you can position directly to any -character-, by jumping
> to a _byte_ offset that is twice the index of the character you want.
> Given a byte offset, you always know the 'equivalent' _character_
> offset.
> 
> With UTF-8, you have to read the character stream, counting
> 'characters' as you go, to get to the desired point.  You can seek to
> an arbitrary _byte_ offset, but you do not know how mny 'characters'
> into the file that offset is.

I see.  Yes, that could certainly complicate things.

> UTF-8 vs. UTF-16 is a trade-off between 'compactness' (UTF-8), and 
> simplicity of addessing/representation (UTF-16).
> 
> > This seems rather unfortunate to me.  You would think that, by now,
> > some "standard" character set might have emerged that would allow
> > one to use, at the very least, the "Western" characters (as opposed
> > to the "Eastern" or "Oriental" or "Asian", if you will) with a
> > reasonable expectation that others will see what was intended.
> 
> Heh. 
> 
> How many 'character' codes are you willing to devote to national
> 'currency symbols', just for starters?  Probable minimum of two per
> currency -- one for the minimum coinage unit (cent, pence, pfennig,
> etc.) and one for the denomination unit (dollar, pound, mark, kroner,
> etc.)
> 
> Now, one (obviously) has to have the basic 'Roman' alphabet. 
> 
> Then there are all the diacritical markings (accent, accent grave, dot
> umlaut, ring, bar, 'hat', inverted hat,  etc.) for vowels.  And
> cedilla, tilde, etc., for select consonants.  Plus language specific
> symbols like ess-zett , 'thorn', etc.
> 
> How about phonetic symbols, like 'schwa' ?
> 
> And Greek for all sorts of scientific use?
> 
> What about Cyrilic characters, for many Eastern Eurpean languages?
> 
> Now, consider punctuation marks:
>    the 'typewriter' basics, 
>    How many of 'minus-sign, hyphen, em-dash, en-dash, soft-hyphen'
> are needed? How many of 'accent, accent grave, apostrophe,
> opening/closing single-quote' are needed?
>    opening/closing double-quotes,  and/or a 'position neutral'
> double-quote?
> 
> "Other symbols", like --
>    digits,
>    common fractions,
>    'Trademark','Registered trademark','copyright' 
>    'paragraph','section', 
>    superscripts  -- exponents, footnotes, etc.
>    subscripts -- chemical formulae, etc.
>    "Simple line-drawing graphics"
> 
> Diphthongs??  Ligatures??
> 
> Start counting things up. 
> 
> An 8-bit 'address space' gets used used up _really_ quick.
> 
> <wry grin>

I certainly get the point.  :-)  Thanks for that very thorough
elucidation.  :-)

Now I just have to figure out what the heck's going on here, why
suddenly I'm seeing the exact opposite of what I was seeing yesterday.
Thought I had everything straightened out for a while there.  :-(

Oh, this is madness!  :-)

-- 
Conrad J. Sabatier
conrads@cox.net