From owner-freebsd-questions@FreeBSD.ORG Wed Nov 9 17:25:46 2011 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 93212106564A for ; Wed, 9 Nov 2011 17:25:46 +0000 (UTC) (envelope-from freebsd@edvax.de) Received: from mx02.qsc.de (mx02.qsc.de [213.148.130.14]) by mx1.freebsd.org (Postfix) with ESMTP id 438858FC0C for ; Wed, 9 Nov 2011 17:25:46 +0000 (UTC) Received: from r56.edvax.de (port-92-195-104-16.dynamic.qsc.de [92.195.104.16]) by mx02.qsc.de (Postfix) with ESMTP id BACED1E9EC; Wed, 9 Nov 2011 18:25:44 +0100 (CET) Received: from r56.edvax.de (localhost [127.0.0.1]) by r56.edvax.de (8.14.5/8.14.5) with SMTP id pA9HPi6I004552; Wed, 9 Nov 2011 18:25:44 +0100 (CET) (envelope-from freebsd@edvax.de) Date: Wed, 9 Nov 2011 18:25:44 +0100 From: Polytropon To: "Conrad J. Sabatier" Message-Id: <20111109182544.c807a82d.freebsd@edvax.de> In-Reply-To: <20111108205948.54daef43@cox.net> References: <20111108184236.3a78ebf6@cox.net> <20111109031024.fb4c617e.freebsd@edvax.de> <20111108205948.54daef43@cox.net> Organization: EDVAX X-Mailer: Sylpheed 3.1.1 (GTK+ 2.24.5; i386-portbld-freebsd8.2) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: freebsd-questions@freebsd.org Subject: Re: "Unprintable" 8-bit characters X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Polytropon List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 09 Nov 2011 17:25:46 -0000 On Tue, 8 Nov 2011 20:59:48 -0600, Conrad J. Sabatier wrote: > Same here. I've been "guilty" as well of neglecting to properly adjust > my console configuration. Sometimes "just works" in combination with lazyness beats all proper concepts of doing things. :-) > Doesn't using "LC_ALL" obviate the need to set any of the other LC_* > variables? At least, that's always been my understanding of it. I have to admit that I haven't fully understood everything in that relation, but it seems that the $LC_* (!ALL) can modify "subsets" of what $LC_ALL defines. Languages and character sets can be assigned independently (e. g. english program messages, but german file names properly displayed). > But, getting back to something you said earlier, what did you mean > exactly about the precedence of LANG vs. LC_*? There is, if I remember correctly, the idea that _if_ $LANG is set, $LC_* won't be considered at all, even if they are set. http://www.freebsd.org/doc/handbook/using-localization.html See 24.3.4.1.1.1 and 24.3.4.1.2. > Yes, and this is one area where the labels are more than a little > misleading as well. My natural inclination is think of UTF-8 as being a > single-byte representation for each character in the set, whereas > UTF-16, as the name implies, would be the "wide", 2-byte version. > Nonetheless, as I posted earlier in this thread, according to the info > in gucharmap, the representations of the umlauted "u" are just the > opposite of this: >=20 > UTF-8: 0xC3 0xBC > UTF-16: 0x00FC >=20 > Go figure, huh? :-) I think Robert did explain it very good: While UTF-16 is a "fixed width" (2 byte) representation, UTF-8 is "variable width" (1 byte _or_ two byte). > > But returning to the original question, I think Robert > > did explain it very well: There is no real consensus > > about what the different codings should mean. They > > were meant to unify the representation of a very large > > set of characters, but basically there are many inter- > > pretations now, and how they show up to the user depends > > on the font in use, _if_ it has this mapping or that, > > or none. >=20 > This seems rather unfortunate to me. You would think that, by now, > some "standard" character set might have emerged that would allow one > to use, at the very least, the "Western" characters (as opposed to > the "Eastern" or "Oriental" or "Asian", if you will) with a reasonable > expectation that others will see what was intended. Assumptions, wishes, conclusions and hopes do differ from reality. :-) For example, in October I had to assist working on a document containing german text and chinese symbols. Decision: We use UTF-8 so the chinese symbols can appear in the input. A name: Weng Tonghe [][][]. The brackets should symbolize the three characters for that name. They did show up properly in the editor, but on the printed page... Weng Tonghe [][]. What? Two? But there were three on input! As we found out, the "he" used in input was the wrong one (there are several "he"s), and the font used to render the text did not have that particular "he". When we found the correct one, finally three characters appeared, as intended and correct. This should show: You _never_ know where things are wrong when something is missing - settings, fonts, who knows. In relation to file names, this is not a problem of the file system as it will store any name you want, but if you can actually SEE or USE that file name - that's a completely different thing. > > Again a fine demonstration why file names should be > > limited to printable ASCII and no spaces if you want > > them to work everywhere. :-) >=20 > Well, for myself, personally, I'm a bit of a stickler for "language > authenticity", you might call it. Having studied both German and > French rather extensively in my younger days, I'm quite fond of both > languages, and rather keen on seeing them represented accurately (I > especially wince at the use of the plain, unaccented vowel followed by > an "e" in place of the umlaut, and to a lesser degree, the use of "ss" > in place of Esszett), which has caused me no small amount of confusion, > aggravation and frustration over the years, to be sure! :-) Make sure to call it "Eszett" ("Es" =3D S and "Zett" =3D Z). The teletyping conventions suggests to dissolve "=DF" to "sz", because it's easier to recombine "sz" to "=DF" because it's likely to be correct, whereas recombining "ss" to "=DF" is often wrong, as there are too many correct "ss" in texts. Example: Mi=DFwirtschaft -> Miszwirtschaft -> Mi=DFwirtschaft =3D=3D=3D> good. Messer -> Me=DFer =3D=3D=3D> wrong. In names (e. g. of towns): Sta=DFfurt (right) !=3D Stassfurt (wrong). Note that !("sz" <-> "=DF") in all cases, and !("ss" <-> "=DF") as well, as the rule states that only a non-truncatable "ss" is to be set as Eszett. There are only few "sz" that are "real 'sz'", typically in word gaps, e. g. Reiszange. :-) The "funny" things start when diacritic marks and other non-US-ASCII representable elements change the meaning of a word. In such cases, it's often justified to use the proper localized representation. However, this is also the point where problems may start if you're doing it wrong (which means: others do not conform to the language settings or fonts you're using). The (limited) US-ASCII set of characters is the only easy way to avoid that. It may not _always_ look pretty, but in worst cases, it works - and you can RELY on that. --=20 Polytropon Magdeburg, Germany Happy FreeBSD user since 4.0 Andra moi ennepe, Mousa, ...