Date: Thu, 24 Mar 2022 22:08:47 -0600 From: Warner Losh <imp@bsdimp.com> To: Phil Shafer <phil@juniper.net> Cc: "Simon J. Gerraty" <sjg@juniper.net>, "Rodney W. Grimes" <freebsd-rwg@gndrsh.dnsmgr.net>, Phil Shafer <phil@freebsd.org>, FreeBSD Hackers <freebsd-hackers@freebsd.org> Subject: Re: What's the locale for system files (e.g. /etc/fstab)? Message-ID: <CANCZdfrJhva3sgm6HnZbYrvGs90R-sJWJPfuZfT0C3Ozzz37Hg@mail.gmail.com> In-Reply-To: <EC99D2B8-769D-46BA-AF87-7B48D90E70D1@juniper.net> References: <70B211BB-15BA-47A4-8F9C-C833AA8C1EAA@freebsd.org> <202203241519.22OFJ3Mk098649@gndrsh.dnsmgr.net> <CANCZdfp1oJdC2HfU63U_3y4y%2BQE0TswdVSg%2Big4uS3RJC3yK3w@mail.gmail.com> <71356.1648139436@kaos.jnpr.net> <CANCZdfrZjeU_%2BLRew9BOCdktDi3aTUoeEaBkrov9FccvwfaN0g@mail.gmail.com> <EC99D2B8-769D-46BA-AF87-7B48D90E70D1@juniper.net>
next in thread | previous in thread | raw e-mail | index | archive | help
--000000000000f25ecf05db031ebb Content-Type: text/plain; charset="UTF-8" On Thu, Mar 24, 2022 at 2:51 PM Phil Shafer <phil@juniper.net> wrote: > On 24 Mar 2022, at 15:12, Warner Losh wrote: > > On Thu, Mar 24, 2022, 10:30 AM Simon J. Gerraty > > <sjg@juniper.net<mailto:sjg@juniper.net>> wrote: > >> AFAIK virtually everything about locale support tells you about the > >> locale for the current process - which does not necessarily inform > >> you > >> of the locale that was in effect when a system file was last edited. > > Exactly. The value is $LANG is transient, leaving no clue about the > encoding of the data. > > >> There's probably something to be said for enforcing something like > >> C.UTF-8 for system files. > > I'd like to have UTF-8 as a given, or at least something definitive like > the symlink idea. Something that tells df, mount, etc how to treat the > value, so that it knows if it's locale-based ("%hs" for libxo) or utf-8 > ("%s" for libxo). > Right now we use %s for these things in all the other utilities (or have traditionally done so, I've not checked recently). We don't setup the locale stuff in these programs at all, so to match historic practice, I think libxo should use %s. > > That is the primary reason for system files always being C.UTF-8... > > There is no way to tag it as anything else... and some of these files > > are often parsed from a context that can't set the locale, like the > > boot loader or the kernel... also, these files have a format that was > > defined back in the 7bit ascii time frame. They also don't make use of > > the text in a way that isn't literal... > > Exactly. There's just no way to know in the current setup. And > declaring it UTF-8 will break anyone currently using locale-based > values. Using the symlink has the value of allowing a simple fix ("sudo > ln -s $LANG /etc/locale"). > Except it's not a simple fix. Sure, you can find this value, but nothing will use it, necessarily. Since there's little value and little need, I think it would be more hassle than it's worth absent a much more extensive audit. For system wide things like config files, we assume C.UTF-8 or the lessor ASCII-7 (or maybe ASCII-8). > Having said that, I'm unsure how you'd mount /<kanji-for-neko> from > > fstab, or if that is well defined. The kernel just presents a string > > of bytes not containing /... > > Currently it's not well defined, just a string of bytes, which has > worked fine so far, but it's a problem for adding libxo support to df > and mount, since the strings being used don't have a known encoding. > And JSON, XML, or HTML are UTF-8, so we need to know how to treat them. > The patch under review changes mount to use "%hs" which means that > strings will be locale-based, but that means they will be interpreted > using the current process's $LANG, which may not be how the file was > encoded. > Right. They are de-facto C.UTC-8, at least at the top level these days. That's why I think we should use %s unless someone does extensive testing and auditing of these programs to see if they still work (along with test suites to make sure they still work). We should not be in the business of promising that we can set the locale in any meaningful way and have it work for system-level things. In addition, we'd need to add a test suite to test the boot loader so that the presence of non-C.UTC-8 encoded strings in /etc/fstab doesn't cause it to misbehave. My big worry is that this will open up a big can of worms for people that have a system-wide default set to not be C.UTC-8 and these changes will cause subtle behavior changes that we have to play whack-a-mole with in the forums and PR database. They may not be obviously related to this change at first, and you may be hard to track down to fix what comes up. I will be the first to admit that I feel a bit burned by the locale stuff in global settings since I had to rip out a bunch of locale things in our awk because they caused weird compatibility problems with awk scripts written in other systems. Warner > Thanks, > Phil > > --000000000000f25ecf05db031ebb Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">= <div dir=3D"ltr" class=3D"gmail_attr">On Thu, Mar 24, 2022 at 2:51 PM Phil = Shafer <<a href=3D"mailto:phil@juniper.net">phil@juniper.net</a>> wro= te:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px = 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 24 Mar 20= 22, at 15:12, Warner Losh wrote:<br> > On Thu, Mar 24, 2022, 10:30 AM Simon J. Gerraty <br> > <<a href=3D"mailto:sjg@juniper.net" target=3D"_blank">sjg@juniper.n= et</a><mailto:<a href=3D"mailto:sjg@juniper.net" target=3D"_blank">sjg@j= uniper.net</a>>> wrote:<br> >> AFAIK virtually everything about locale support tells you about th= e<br> >> locale for the current process - which does not necessarily inform= <br> >> you<br> >> of the locale that was in effect when a system file was last edite= d.<br> <br> Exactly.=C2=A0 The value is $LANG is transient, leaving no clue about the <= br> encoding of the data.<br> <br> >> There's probably something to be said for enforcing something = like<br> >> C.UTF-8 for system files.<br> <br> I'd like to have UTF-8 as a given, or at least something definitive lik= e <br> the symlink idea.=C2=A0 Something that tells df, mount, etc how to treat th= e <br> value, so that it knows if it's locale-based ("%hs" for libxo= ) or utf-8 <br> ("%s" for libxo).<br></blockquote><div><br></div><div>Right now w= e use %s for these things in all the other utilities (or have</div><div>tra= ditionally done so, I've not checked recently). We don't setup the<= /div><div>locale stuff in these programs at all, so to match historic pract= ice,</div><div>I think libxo should use %s.</div><div>=C2=A0</div><blockquo= te class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px = solid rgb(204,204,204);padding-left:1ex"> > That is the primary reason for system files always being C.UTF-8... <b= r> > There is no way to tag it as anything else... and some of these files = <br> > are often parsed from a context that can't set the locale, like th= e <br> > boot loader or the kernel... also, these files have a format that was = <br> > defined back in the 7bit ascii time frame. They also don't make us= e of <br> > the text in a way that isn't literal...<br> <br> Exactly.=C2=A0 There's just no way to know in the current setup.=C2=A0 = And <br> declaring it UTF-8 will break anyone currently using locale-based <br> values.=C2=A0 Using the symlink has the value of allowing a simple fix (&qu= ot;sudo <br> ln -s $LANG /etc/locale").<br></blockquote><div><br></div><div>Except = it's not a simple fix. Sure, you can find this value, but nothing</div>= <div>will use it, necessarily. Since there's little value and little ne= ed, I</div><div>think it would be more hassle than it's worth absent a = much more</div><div>extensive audit. For system wide things like config fil= es, we assume</div><div>C.UTF-8 or the lessor ASCII-7 (or maybe ASCII-8).</= div><div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0p= x 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">> H= aving said that, I'm unsure how you'd mount /<kanji-for-neko>= from <br> > fstab, or if that is well defined. The kernel just presents a string <= br> > of bytes not containing /...<br> <br> Currently it's not well defined, just a string of bytes, which has <br> worked fine so far, but it's a problem for adding libxo support to df <= br> and mount, since the strings being used don't have a known encoding.=C2= =A0 <br> And JSON, XML, or HTML are UTF-8, so we need to know how to treat them.=C2= =A0 <br> The patch under review changes mount to use "%hs" which means tha= t <br> strings will be locale-based, but that means they will be interpreted <br> using the current process's $LANG, which may not be how the file was <b= r> encoded.<br></blockquote><div><br></div><div>Right. They are de-facto C.UTC= -8, at least at the top level these days. That's</div><div>why I think = we should use %s unless someone does extensive testing and</div><div>auditi= ng of these programs to see if they still work (along with test suites</div= ><div>to make sure they still work). We should not be in the business of pr= omising</div><div>that we can set the locale in any meaningful way and have= it work for system-level</div><div>things. In addition, we'd need to a= dd a test suite to test the boot loader so</div><div>that the presence=C2= =A0of non-C.UTC-8 encoded strings in /etc/fstab doesn't cause it</div><= div>to misbehave. My big worry is that this will open up a big can of worms= for people</div><div>that have a system-wide default set to not be C.UTC-8= and these changes will</div><div>cause subtle behavior changes that we hav= e to play whack-a-mole with in</div><div>the forums and PR database. They m= ay not be obviously related to this change</div><div>at first, and you may = be hard to track down to fix what comes up.</div><div><br></div><div>I will= be the first to admit that I feel a bit burned by the locale stuff in glob= al</div><div>settings since I had to rip out a bunch of locale things in ou= r awk because they</div><div>caused weird compatibility problems with awk s= cripts written in other systems.</div><div><br></div><div>Warner</div><div>= =C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0= .8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> Thanks,<br> =C2=A0 Phil<br> <br> </blockquote></div></div> --000000000000f25ecf05db031ebb--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfrJhva3sgm6HnZbYrvGs90R-sJWJPfuZfT0C3Ozzz37Hg>