FreeBSD Mail Archives

Date:      Sun, 22 Apr 2012 11:45:45 +0100
From:      Matthew Seaman <m.seaman@infracaninophile.co.uk>
To:        freebsd-questions@freebsd.org
Subject:   Re: converting UTF-8 to HTML
Message-ID:  <4F93E159.7020807@infracaninophile.co.uk>
In-Reply-To: <4F93CC95.5050209@locolomo.org>
References:  <20120421055823.GA6788@tinyCurrent> <4F9253D7.7010609@locolomo.org> <4F9278A2.1020301@locolomo.org> <alpine.BSF.2.00.1204210909450.5338@abbf.6qbyyneqvnyhc.pbz> <4F93CC95.5050209@locolomo.org>

[-- Attachment #1 --]
On 22/04/2012 10:17, Erik N�rgaard wrote:
> UTF-8 is variable with, ascii characters are stored as single bytes (not
> sure about iso-8859-1) while other characters are stored as two byte chars.

ascii uses the low 128 values that you can assign to an unsigned char,
ie. those where the high-order bit is zero.

iso-8859-1 and the various other iso-8859-X character sets fill in the
remaining 128 characters with various other glyphs useful in latin
alphabets, so it's still one char per glyph.  Other alphabets (greek,
cyrillic, etc) have similar one byte-per glyph encodings. But you have
to know what the encoding is to display the content correctly, and it is
difficult to mix chunks of text in different encodings in the same document.

UTF has various different forms, based on different word sizes, but the
commonly used UTF-8 works in units of 1-byte chars.  However, glyphs may
be represented by sequences of from 1 to 4 bytes.  The 1-byte glyphs are
identical to ascii.  Any byte with the high-order bit set indicates the
beginning of a multibyte glyph -- the number of bytes is indicated by
the bit pattern of the first byte and all the other bytes of that glyph
will have the high order bit set.  All million-plus glyphs available
through Unicode can be expressed this way, so the encoding is universal
and suitable for all languages and alphabets or non-alphabetic languages.

Not all possible byte sequences are valid UTF-8 text, but the design of
the encoding means that an interpreter can skip over an invalid sequence
of bytes and find the beginning of the next valid sequence easily.
Whoever it was upthread had the misfortune to run into a text editor
that just gave up and truncated their document at an invalid sequence
needs to vent their ire on the lazy and stupid programmers of whatever
app it was, rather than on the concept of UTF-8 itself.

Yes, with UTF-8 encoded text, you can no-longer equate the number of
glyphs[*] in a piece of text (and hence the space required to display
the text) with the memory required to store that text.  There's a lot of
legacy code out there which makes this assumption, but this is
overshadowed by the amount of legacy code out there which can only
handle ascii text.  Fixing all that code is pretty long-winded, but not
conceptually too difficult.  Programming a text-only display to assume
everything is UTF-8 would be quite viable, and backwardly compatible
with ascii-only displays.  The hard part is creating a font with a
more-or-less complete set of Unicode glyphs.

	Cheers,

	Matthew

[*] Let's not even mention the concept of 'combining characters' here.

-- 
Dr Matthew J Seaman MA, D.Phil.                   7 Priory Courtyard
                                                  Flat 3
PGP: http://www.infracaninophile.co.uk/pgpkey     Ramsgate
JID: matthew@infracaninophile.co.uk               Kent, CT11 9PW

[-- Attachment #2 --]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.16 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk+T4WAACgkQ8Mjk52CukIzOmgCfUBg1eIOuCYblRHCct3xBX7MZ
eLYAn1oBBmoAi6DGL37siUceAboi9aGA
=VQj4
-----END PGP SIGNATURE-----

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F93E159.7020807>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation