Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 25 Mar 2022 08:27:12 -0600
From:      Warner Losh <imp@bsdimp.com>
To:        Pau Amma <pauamma@gundo.com>
Cc:        FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject:   Re: What's the locale for system files (e.g. /etc/fstab)?
Message-ID:  <CANCZdfpk4Hv-5F8=BX4DpMJZ4-DszHQqij_Tnn-_n68F2kHpwQ@mail.gmail.com>
In-Reply-To: <7773a0c73c77649efaf9f748ee8bb0b4@gundo.com>
References:  <70B211BB-15BA-47A4-8F9C-C833AA8C1EAA@freebsd.org> <202203241519.22OFJ3Mk098649@gndrsh.dnsmgr.net> <CANCZdfp1oJdC2HfU63U_3y4y%2BQE0TswdVSg%2Big4uS3RJC3yK3w@mail.gmail.com> <71356.1648139436@kaos.jnpr.net> <CANCZdfrZjeU_%2BLRew9BOCdktDi3aTUoeEaBkrov9FccvwfaN0g@mail.gmail.com> <EC99D2B8-769D-46BA-AF87-7B48D90E70D1@juniper.net> <CANCZdfrJhva3sgm6HnZbYrvGs90R-sJWJPfuZfT0C3Ozzz37Hg@mail.gmail.com> <7773a0c73c77649efaf9f748ee8bb0b4@gundo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
--000000000000dfc90105db0bc2bc
Content-Type: text/plain; charset="UTF-8"

On Fri, Mar 25, 2022, 5:10 AM Pau Amma <pauamma@gundo.com> wrote:

> (pruned cc: to just the list)
>
> On 2022-03-25 04:08, Warner Losh wrote:
> > On Thu, Mar 24, 2022 at 2:51 PM Phil Shafer <phil@juniper.net> wrote:
> >
> >> On 24 Mar 2022, at 15:12, Warner Losh wrote:
> >> > That is the primary reason for system files always being C.UTF-8...
> >> > There is no way to tag it as anything else... and some of these files
> >> > are often parsed from a context that can't set the locale, like the
> >> > boot loader or the kernel... also, these files have a format that was
> >> > defined back in the 7bit ascii time frame. They also don't make use of
> >> > the text in a way that isn't literal...
> >>
> >> Exactly.  There's just no way to know in the current setup.  And
> >> declaring it UTF-8 will break anyone currently using locale-based
> >> values.  Using the symlink has the value of allowing a simple fix
> >> ("sudo
> >> ln -s $LANG /etc/locale").
> >
> > Except it's not a simple fix. Sure, you can find this value, but
> > nothing
> > will use it, necessarily. Since there's little value and little need, I
> > think it would be more hassle than it's worth absent a much more
> > extensive audit. For system wide things like config files, we assume
> > C.UTF-8 or the lessor ASCII-7 (or maybe ASCII-8).
>
> There's no ASCII-8. (If you meant 8859-*, there's 15 or 16, which
> essentially means "no".) Assuming ASCII (and therefore 7-bit) went out
> of style last millenium. Anything that expects or enforces something
> other than Unicode (which for all practical purposes means UTF-8) needs
> to be fixed urgently.
>

Ascii-8 here is just a sloppy shorthand for no multi byte character
support. All the parsing routines just look for certain fixed byte
separators for sequences of bytes. This will likely never change, but if it
does a lot of work to prove correctness needs to happen and all the things
that read these files would need to change.

UTF-8 works because it mostly avoids encodings that would get in the way of
this naive code since the encoding sequences can't have 7bit ascii values
in them and all the special characters are 7bit ascii.

Warner

-- 
> #BlackLivesMatter #TransWomenAreWomen #AccessibilityMatters
> #StandWithUkrainians
> English: he/him/his (singular they/them/their/theirs OK)
> French: il/le/lui (iel/iel and ielle/ielle OK)
> Tagalog: siya/niya/kaniya (please avoid sila/nila/kanila)
>
>

--000000000000dfc90105db0bc2bc
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"auto"><div><br><br><div class=3D"gmail_quote"><div dir=3D"ltr" =
class=3D"gmail_attr">On Fri, Mar 25, 2022, 5:10 AM Pau Amma &lt;<a href=3D"=
mailto:pauamma@gundo.com" target=3D"_blank" rel=3D"noreferrer">pauamma@gund=
o.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">(pruned cc: to=
 just the list)<br>
<br>
On 2022-03-25 04:08, Warner Losh wrote:<br>
&gt; On Thu, Mar 24, 2022 at 2:51 PM Phil Shafer &lt;<a href=3D"mailto:phil=
@juniper.net" rel=3D"noreferrer noreferrer" target=3D"_blank">phil@juniper.=
net</a>&gt; wrote:<br>
&gt; <br>
&gt;&gt; On 24 Mar 2022, at 15:12, Warner Losh wrote:<br>
&gt;&gt; &gt; That is the primary reason for system files always being C.UT=
F-8...<br>
&gt;&gt; &gt; There is no way to tag it as anything else... and some of the=
se files<br>
&gt;&gt; &gt; are often parsed from a context that can&#39;t set the locale=
, like the<br>
&gt;&gt; &gt; boot loader or the kernel... also, these files have a format =
that was<br>
&gt;&gt; &gt; defined back in the 7bit ascii time frame. They also don&#39;=
t make use of<br>
&gt;&gt; &gt; the text in a way that isn&#39;t literal...<br>
&gt;&gt; <br>
&gt;&gt; Exactly.=C2=A0 There&#39;s just no way to know in the current setu=
p.=C2=A0 And<br>
&gt;&gt; declaring it UTF-8 will break anyone currently using locale-based<=
br>
&gt;&gt; values.=C2=A0 Using the symlink has the value of allowing a simple=
 fix <br>
&gt;&gt; (&quot;sudo<br>
&gt;&gt; ln -s $LANG /etc/locale&quot;).<br>
&gt; <br>
&gt; Except it&#39;s not a simple fix. Sure, you can find this value, but <=
br>
&gt; nothing<br>
&gt; will use it, necessarily. Since there&#39;s little value and little ne=
ed, I<br>
&gt; think it would be more hassle than it&#39;s worth absent a much more<b=
r>
&gt; extensive audit. For system wide things like config files, we assume<b=
r>
&gt; C.UTF-8 or the lessor ASCII-7 (or maybe ASCII-8).<br>
<br>
There&#39;s no ASCII-8. (If you meant 8859-*, there&#39;s 15 or 16, which <=
br>
essentially means &quot;no&quot;.) Assuming ASCII (and therefore 7-bit) wen=
t out <br>
of style last millenium. Anything that expects or enforces something <br>
other than Unicode (which for all practical purposes means UTF-8) needs <br=
>
to be fixed urgently.<br></blockquote></div></div><div dir=3D"auto"><br></d=
iv><div dir=3D"auto">Ascii-8 here is just a sloppy shorthand for no multi b=
yte character support. All the parsing routines just look for certain fixed=
 byte separators for sequences of bytes. This will likely never change, but=
 if it does a lot of work to prove correctness needs to happen and all the =
things that read these files would need to change.</div><div dir=3D"auto"><=
br></div><div dir=3D"auto">UTF-8 works because it mostly avoids encodings t=
hat would get in the way of this naive code since the encoding sequences ca=
n&#39;t have 7bit ascii values in them and all the special characters are 7=
bit ascii.</div><div dir=3D"auto"><br></div><div dir=3D"auto">Warner</div><=
div dir=3D"auto"><br></div><div dir=3D"auto"><div class=3D"gmail_quote"><bl=
ockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #=
ccc solid;padding-left:1ex">
-- <br>
#BlackLivesMatter #TransWomenAreWomen #AccessibilityMatters <br>
#StandWithUkrainians<br>
English: he/him/his (singular they/them/their/theirs OK)<br>
French: il/le/lui (iel/iel and ielle/ielle OK)<br>
Tagalog: siya/niya/kaniya (please avoid sila/nila/kanila)<br>
<br>
</blockquote></div></div></div>

--000000000000dfc90105db0bc2bc--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfpk4Hv-5F8=BX4DpMJZ4-DszHQqij_Tnn-_n68F2kHpwQ>