Date: Tue, 8 Nov 2011 23:04:25 -0600 (CST) From: Robert Bonomi <bonomi@mail.r-bonomi.com> To: freebsd-questions@freebsd.org Subject: Re: "Unprintable" 8-bit characters Message-ID: <201111090504.pA954Pod066887@mail.r-bonomi.com> In-Reply-To: <20111108205948.54daef43@cox.net>
next in thread | previous in thread | raw e-mail | index | archive | help
"Conrad J. Sabatier" <conrads@cox.net> wrote: > > <grin> > > Yes, and this is one area where the labels are more than a little > misleading as well. My natural inclination is think of UTF-8 as being a > single-byte representation for each character in the set, whereas > UTF-16, as the name implies, would be the "wide", 2-byte version. "Not exactly." > Nonetheless, as I posted earlier in this thread, according to the info > in gucharmap, the representations of the umlauted "u" are just the > opposite of this: "not exactly." Again. > UTF-8: 0xC3 0xBC > UTF-16: 0x00FC > > Go figure, huh? :-) In UTF-16, everything _is_ a 16-bit entity. Notice that 0x00FC has -four- nybbles after the '0x.' Every character boundary is on a multiple of 16 bits. In UTF-8, the 'base' charset -- the 'C0' and 'C1' groups are represented by a single byte. 'extended' characters are represented by two bytes. Thus, 'characters' have a *variable*length* representation -- one or two bytes. A character, whether it is represented by one or two bytes, can begin on -any- byte boundary within a data stream, depending on 'what came before it'. UTF-8 2-byte representations are designed such that one can jump to any _byte_ offset within the file, and determine -- by looking *only* at the value of that byte whether is is (a) a single-byte character, (b) the first byte of a two-byte sequence, or (c) the second byte of a two-byte sequence. With UTF-16 you can position directly to any -character-, by jumping to a _byte_ offset that is twice the index of the character you want. Given a byte offset, you always know the 'equivalent' _character_ offset. With UTF-8, you have to read the character stream, counting 'characters' as you go, to get to the desired point. You can seek to an arbitrary _byte_ offset, but you do not know how mny 'characters' into the file that offset is. UTF-8 vs. UTF-16 is a trade-off between 'compactness' (UTF-8), and simplicity of addessing/representation (UTF-16). > This seems rather unfortunate to me. You would think that, by now, > some "standard" character set might have emerged that would allow one > to use, at the very least, the "Western" characters (as opposed to > the "Eastern" or "Oriental" or "Asian", if you will) with a reasonable > expectation that others will see what was intended. Heh. How many 'character' codes are you willing to devote to national 'currency symbols', just for starters? Probable minimum of two per currency -- one for the minimum coinage unit (cent, pence, pfennig, etc.) and one for the denomination unit (dollar, pound, mark, kroner, etc.) Now, one (obviously) has to have the basic 'Roman' alphabet. Then there are all the diacritical markings (accent, accent grave, dot umlaut, ring, bar, 'hat', inverted hat, etc.) for vowels. And cedilla, tilde, etc., for select consonants. Plus language specific symbols like ess-zett , 'thorn', etc. How about phonetic symbols, like 'schwa' ? And Greek for all sorts of scientific use? What about Cyrilic characters, for many Eastern Eurpean languages? Now, consider punctuation marks: the 'typewriter' basics, How many of 'minus-sign, hyphen, em-dash, en-dash, soft-hyphen' are needed? How many of 'accent, accent grave, apostrophe, opening/closing single-quote' are needed? opening/closing double-quotes, and/or a 'position neutral' double-quote? "Other symbols", like -- digits, common fractions, 'Trademark','Registered trademark','copyright' 'paragraph','section', superscripts -- exponents, footnotes, etc. subscripts -- chemical formulae, etc. "Simple line-drawing graphics" Diphthongs?? Ligatures?? Start counting things up. An 8-bit 'address space' gets used used up _really_ quick. <wry grin>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201111090504.pA954Pod066887>