Date: Sun, 22 Apr 2012 11:45:45 +0100 From: Matthew Seaman <m.seaman@infracaninophile.co.uk> To: freebsd-questions@freebsd.org Subject: Re: converting UTF-8 to HTML Message-ID: <4F93E159.7020807@infracaninophile.co.uk> In-Reply-To: <4F93CC95.5050209@locolomo.org> References: <20120421055823.GA6788@tinyCurrent> <4F9253D7.7010609@locolomo.org> <4F9278A2.1020301@locolomo.org> <alpine.BSF.2.00.1204210909450.5338@abbf.6qbyyneqvnyhc.pbz> <4F93CC95.5050209@locolomo.org>
next in thread | previous in thread | raw e-mail | index | archive | help
This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enigBE8224C100E6251ECE879E77 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On 22/04/2012 10:17, Erik N=F8rgaard wrote: > UTF-8 is variable with, ascii characters are stored as single bytes (no= t > sure about iso-8859-1) while other characters are stored as two byte ch= ars. ascii uses the low 128 values that you can assign to an unsigned char, ie. those where the high-order bit is zero. iso-8859-1 and the various other iso-8859-X character sets fill in the remaining 128 characters with various other glyphs useful in latin alphabets, so it's still one char per glyph. Other alphabets (greek, cyrillic, etc) have similar one byte-per glyph encodings. But you have to know what the encoding is to display the content correctly, and it is difficult to mix chunks of text in different encodings in the same docume= nt. UTF has various different forms, based on different word sizes, but the commonly used UTF-8 works in units of 1-byte chars. However, glyphs may be represented by sequences of from 1 to 4 bytes. The 1-byte glyphs are identical to ascii. Any byte with the high-order bit set indicates the beginning of a multibyte glyph -- the number of bytes is indicated by the bit pattern of the first byte and all the other bytes of that glyph will have the high order bit set. All million-plus glyphs available through Unicode can be expressed this way, so the encoding is universal and suitable for all languages and alphabets or non-alphabetic languages.= Not all possible byte sequences are valid UTF-8 text, but the design of the encoding means that an interpreter can skip over an invalid sequence of bytes and find the beginning of the next valid sequence easily. Whoever it was upthread had the misfortune to run into a text editor that just gave up and truncated their document at an invalid sequence needs to vent their ire on the lazy and stupid programmers of whatever app it was, rather than on the concept of UTF-8 itself. Yes, with UTF-8 encoded text, you can no-longer equate the number of glyphs[*] in a piece of text (and hence the space required to display the text) with the memory required to store that text. There's a lot of legacy code out there which makes this assumption, but this is overshadowed by the amount of legacy code out there which can only handle ascii text. Fixing all that code is pretty long-winded, but not conceptually too difficult. Programming a text-only display to assume everything is UTF-8 would be quite viable, and backwardly compatible with ascii-only displays. The hard part is creating a font with a more-or-less complete set of Unicode glyphs. Cheers, Matthew [*] Let's not even mention the concept of 'combining characters' here. --=20 Dr Matthew J Seaman MA, D.Phil. 7 Priory Courtyard Flat 3 PGP: http://www.infracaninophile.co.uk/pgpkey Ramsgate JID: matthew@infracaninophile.co.uk Kent, CT11 9PW --------------enigBE8224C100E6251ECE879E77 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.16 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk+T4WAACgkQ8Mjk52CukIzOmgCfUBg1eIOuCYblRHCct3xBX7MZ eLYAn1oBBmoAi6DGL37siUceAboi9aGA =VQj4 -----END PGP SIGNATURE----- --------------enigBE8224C100E6251ECE879E77--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F93E159.7020807>