From owner-freebsd-questions@FreeBSD.ORG Fri Nov 11 01:12:48 2011 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0CC13106566C for ; Fri, 11 Nov 2011 01:12:48 +0000 (UTC) (envelope-from conrads@cox.net) Received: from eastrmfepo103.cox.net (eastrmfepo103.cox.net [68.230.241.215]) by mx1.freebsd.org (Postfix) with ESMTP id A392F8FC16 for ; Fri, 11 Nov 2011 01:12:47 +0000 (UTC) Received: from eastrmimpo305.cox.net ([68.230.241.237]) by eastrmfepo103.cox.net (InterMail vM.8.01.04.00 201-2260-137-20101110) with ESMTP id <20111111011242.EJJC28068.eastrmfepo103.cox.net@eastrmimpo305.cox.net>; Thu, 10 Nov 2011 20:12:42 -0500 Received: from serene.no-ip.org ([98.164.86.236]) by eastrmimpo305.cox.net with bizsmtp id vdCg1h00M55wwzE02dChAs; Thu, 10 Nov 2011 20:12:41 -0500 X-CT-Class: Bulk X-CT-Score: 5.00 X-CT-RefID: str=0001.0A02020B.4EBC7689.00AF,ss=3,re=0.000,fgs=0 X-CT-Spam: 0 X-Authority-Analysis: v=1.1 cv=BX0YEIBOusRIeQdDschwVvWAB1OmeRFmMWKQyT+Am3A= c=1 sm=1 a=G8Uczd0VNMoA:10 a=kj9zAlcOel0A:10 a=uAbGmPAyUfLL1M3oYAsfuA==:17 a=lM4-zUH5AAAA:8 a=kviXuzpPAAAA:8 a=0x-Y4APh3q6g4Mh_81QA:9 a=u_3vePIcgR3jBwjXXEsA:7 a=CjuIK1q_8ugA:10 a=4vB-4DCPJfMA:10 a=eR8K6Hi1c0XdF3Zz:21 a=d93vutSRP4LTgj0v:21 a=uAbGmPAyUfLL1M3oYAsfuA==:117 X-CM-Score: 0.00 Authentication-Results: cox.net; none Received: from cox.net (localhost [127.0.0.1]) by serene.no-ip.org (8.14.5/8.14.5) with ESMTP id pAB1CdeX013965; Thu, 10 Nov 2011 19:12:40 -0600 (CST) (envelope-from conrads@cox.net) Date: Thu, 10 Nov 2011 19:12:34 -0600 From: "Conrad J. Sabatier" To: Robert Bonomi Message-ID: <20111110191234.53611af7@cox.net> In-Reply-To: <201111090504.pA954Pod066887@mail.r-bonomi.com> References: <20111108205948.54daef43@cox.net> <201111090504.pA954Pod066887@mail.r-bonomi.com> X-Mailer: Claws Mail 3.7.10 (GTK+ 2.24.6; amd64-portbld-freebsd9.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: freebsd-questions@freebsd.org Subject: Re: "Unprintable" 8-bit characters X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Nov 2011 01:12:48 -0000 On Tue, 8 Nov 2011 23:04:25 -0600 (CST) Robert Bonomi wrote: > > "Conrad J. Sabatier" wrote: > > > > > > > > Yes, and this is one area where the labels are more than a little > > misleading as well. My natural inclination is think of UTF-8 as > > being a single-byte representation for each character in the set, > > whereas UTF-16, as the name implies, would be the "wide", 2-byte > > version. > > "Not exactly." > > > Nonetheless, as I posted earlier in this thread, according to the > > info in gucharmap, the representations of the umlauted "u" are just > > the opposite of this: > > "not exactly." Again. > > > UTF-8: 0xC3 0xBC > > UTF-16: 0x00FC > > > > Go figure, huh? :-) > > In UTF-16, everything _is_ a 16-bit entity. Notice that 0x00FC has > -four- nybbles after the '0x.' Every character boundary is on a > multiple of 16 bits. Ah yes! I hadn't noticed that. What's really weird, as I mentioned in a later private email to Polytropon, last night, the copy-and-paste in gucharmap suddenly decided to start copying the UTF-8 code instead of the UTF-16. I have no idea why that changed. > In UTF-8, the 'base' charset -- the 'C0' and 'C1' groups are > represented by a single byte. 'extended' characters are represented > by two bytes. Thus, 'characters' have a *variable*length* > representation -- one or two bytes. A character, whether it is > represented by one or two bytes, can begin on -any- byte boundary > within a data stream, depending on 'what came before it'. UTF-8 > 2-byte representations are designed such that one can jump to any > _byte_ offset within the file, and determine -- by looking *only* at > the value of that byte whether is is (a) a single-byte character, (b) > the first byte of a two-byte sequence, or (c) the second byte of a > two-byte sequence. > > With UTF-16 you can position directly to any -character-, by jumping > to a _byte_ offset that is twice the index of the character you want. > Given a byte offset, you always know the 'equivalent' _character_ > offset. > > With UTF-8, you have to read the character stream, counting > 'characters' as you go, to get to the desired point. You can seek to > an arbitrary _byte_ offset, but you do not know how mny 'characters' > into the file that offset is. I see. Yes, that could certainly complicate things. > UTF-8 vs. UTF-16 is a trade-off between 'compactness' (UTF-8), and > simplicity of addessing/representation (UTF-16). > > > This seems rather unfortunate to me. You would think that, by now, > > some "standard" character set might have emerged that would allow > > one to use, at the very least, the "Western" characters (as opposed > > to the "Eastern" or "Oriental" or "Asian", if you will) with a > > reasonable expectation that others will see what was intended. > > Heh. > > How many 'character' codes are you willing to devote to national > 'currency symbols', just for starters? Probable minimum of two per > currency -- one for the minimum coinage unit (cent, pence, pfennig, > etc.) and one for the denomination unit (dollar, pound, mark, kroner, > etc.) > > Now, one (obviously) has to have the basic 'Roman' alphabet. > > Then there are all the diacritical markings (accent, accent grave, dot > umlaut, ring, bar, 'hat', inverted hat, etc.) for vowels. And > cedilla, tilde, etc., for select consonants. Plus language specific > symbols like ess-zett , 'thorn', etc. > > How about phonetic symbols, like 'schwa' ? > > And Greek for all sorts of scientific use? > > What about Cyrilic characters, for many Eastern Eurpean languages? > > Now, consider punctuation marks: > the 'typewriter' basics, > How many of 'minus-sign, hyphen, em-dash, en-dash, soft-hyphen' > are needed? How many of 'accent, accent grave, apostrophe, > opening/closing single-quote' are needed? > opening/closing double-quotes, and/or a 'position neutral' > double-quote? > > "Other symbols", like -- > digits, > common fractions, > 'Trademark','Registered trademark','copyright' > 'paragraph','section', > superscripts -- exponents, footnotes, etc. > subscripts -- chemical formulae, etc. > "Simple line-drawing graphics" > > Diphthongs?? Ligatures?? > > Start counting things up. > > An 8-bit 'address space' gets used used up _really_ quick. > > I certainly get the point. :-) Thanks for that very thorough elucidation. :-) Now I just have to figure out what the heck's going on here, why suddenly I'm seeing the exact opposite of what I was seeing yesterday. Thought I had everything straightened out for a while there. :-( Oh, this is madness! :-) -- Conrad J. Sabatier conrads@cox.net