Date: Sat, 21 Apr 2012 22:07:03 +0200 From: Polytropon <freebsd@edvax.de> To: Lars Eighner <lars@larseighner.com> Cc: freebsd-questions@freebsd.org Subject: Re: converting UTF-8 to HTML Message-ID: <20120421220703.86683bc9.freebsd@edvax.de> In-Reply-To: <alpine.BSF.2.00.1204210909450.5338@abbf.6qbyyneqvnyhc.pbz> References: <20120421055823.GA6788@tinyCurrent> <4F9253D7.7010609@locolomo.org> <4F9278A2.1020301@locolomo.org> <alpine.BSF.2.00.1204210909450.5338@abbf.6qbyyneqvnyhc.pbz>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 21 Apr 2012 09:10:03 -0500 (CDT), Lars Eighner wrote: > On Sat, 21 Apr 2012, Erik N=F8rgaard wrote: >=20 > > When characters show up wrong in the users browser it's usually because= the=20 > > browser is set to use a non-UTF-8 charset by default such as windows-12= 52,=20 > > the web server sends the charset=3Dascii in the http header and there i= s no or=20 > > incorrect meta tag to resolve the problem. Non UTF-8 charsets are a lef= tover=20 > > from last millenia that we sometimes still choke on .. sorry the rant ;) >=20 > UTF-8 is a waste of storage for most people [...] Disks and RAM are huge and cheap. Plenty of space that is going to be used. Nobody cares. > [...] and is incompatiple with > text-mode tools: it's simple another bid to make it impossible to run > without a GUI. Again, nobody cares - until, of couse, it's too late and you need to do some recovery or analytic tasks in a limited environment or via a connection with limited means. Regarding the fun of encodings, endianness, representation, use ("fi" the two letters vs. "fi" the ligature, or "=DF" the 1-byte sequence vs. "=DF" the two-byte sequence), see the following document: Matt Mayer: Love Hotels and Unicode http://www.reigndesign.com/blog/love-hotels-and-unicode/ And finally it offers an interesting attack vector, given the fact that several unicode characters "look" the same, but in fact are different. So "two files with the 'same' name" is a possible means that malware implementers can utilize to mislead the users. Short example from MICROS~1 land here: http://blogs.technet.com/b/mmpc/archive/2011/08/10/can-we-believe-our-eyes.= aspx But this all doesn't negate the usefulness of unicode / UTF-8 in general. Especially when you have collaborative settings with multi-language document processing requirements, it is a helpful thing, as working with "normal" (ASCII) letters, cyrillic ones, chinese and japanese symbols, arabic writing is no big deal as long as all the tools do properly support it the _same_ way. --=20 Polytropon Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ...
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120421220703.86683bc9.freebsd>