From nobody Fri Mar 25 04:08:47 2022 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 6BB9E1A364B5 for ; Fri, 25 Mar 2022 04:09:01 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-ua1-x932.google.com (mail-ua1-x932.google.com [IPv6:2607:f8b0:4864:20::932]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4KPpW046zRz4VJ0 for ; Fri, 25 Mar 2022 04:09:00 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-ua1-x932.google.com with SMTP id i26so2889774uap.6 for ; Thu, 24 Mar 2022 21:09:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=r/yeDJx2xM2/YBuk8VA9ROcAgKr6rLeSmMn0AMXwIr4=; b=w63WoJGo7LWZv9O60e1Ze+DuKk4/ShxAIGDKQ+RAunPR5rvTQxG3k7D84Y1mfNnQSh 8hCX8bkKTXt40CjSWdKrpUGMKXQMdXpapO2mzF9//xfi+kcXAN8VPjPYXVG6KlObd8qX LDs1JAE2GsuSErx0h8/VUOg4bmvj5g0C+7RCkWO7bXYXpZa3hdX5CK8p1t+UTpJK6vJE q64eQnaoggKXpcRE+gEsbTldcR5tmIgtijZHZ5rY7PCcoaPBtzPquhquuFc2RyAPB1Ku McMSLsRTrtnSqYGcmcC+vYPtledsznGZ8VOoxMddwj9TcukLseu4Gkprxi2dNINpLZa7 Yfnw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=r/yeDJx2xM2/YBuk8VA9ROcAgKr6rLeSmMn0AMXwIr4=; b=vkEcvuR6L6TmLijFmft/ko53XxusluDWx+M93AeXYo5RdxZH99t9HNaz2KgP2atXi4 QrKbRmExzJx5xjoQHnmkvH25ja0JMqSs/jQgtC4/OtFXGBbrlGKll/EGp4vmeFCytSmD ceVuPuDBXd1oOoEK9VvRjqKGKtcqaGnYSn1/3pJ5es+CbrFROo2UStWhZUc0hkdM4FOn pHPIyJ+jX71tfMI/jdVpeVqbCPz5f/WZdU4ECKDNkAK93X9Or9BLNndCFqUjY36lPr6p SKTs/fMBeMITNaZQeGxwjLeziryleE7cv0vgdoQYQtXlD7+L7KZpwdswUO9YKYaC/5N8 Vo9g== X-Gm-Message-State: AOAM530uoUcbuHFcq1OxZCvQfAem7XSiz7wrD34tnz44bG2R2G4NLEzw 1Y/vFDx5fnvN4GO3C+ZxARqqaNPyz+YJWaQOF4cPSA== X-Google-Smtp-Source: ABdhPJzbF+yCFB0D2MVaMHOUksC+jWVgWRjusO1MHplpNouZc1JXY8bzE8n2qPWPQrBp+hCI4C39caCWwOA09Z2lDc0= X-Received: by 2002:ab0:6804:0:b0:33c:6fe1:3266 with SMTP id z4-20020ab06804000000b0033c6fe13266mr3973712uar.91.1648181334131; Thu, 24 Mar 2022 21:08:54 -0700 (PDT) List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 References: <70B211BB-15BA-47A4-8F9C-C833AA8C1EAA@freebsd.org> <202203241519.22OFJ3Mk098649@gndrsh.dnsmgr.net> <71356.1648139436@kaos.jnpr.net> In-Reply-To: From: Warner Losh Date: Thu, 24 Mar 2022 22:08:47 -0600 Message-ID: Subject: Re: What's the locale for system files (e.g. /etc/fstab)? To: Phil Shafer Cc: "Simon J. Gerraty" , "Rodney W. Grimes" , Phil Shafer , FreeBSD Hackers Content-Type: multipart/alternative; boundary="000000000000f25ecf05db031ebb" X-Rspamd-Queue-Id: 4KPpW046zRz4VJ0 X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org; dkim=pass header.d=bsdimp-com.20210112.gappssmtp.com header.s=20210112 header.b=w63WoJGo; dmarc=none; spf=none (mx1.freebsd.org: domain of wlosh@bsdimp.com has no SPF policy when checking 2607:f8b0:4864:20::932) smtp.mailfrom=wlosh@bsdimp.com X-Spamd-Result: default: False [-1.96 / 15.00]; RCVD_TLS_ALL(0.00)[]; ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[bsdimp-com.20210112.gappssmtp.com:s=20210112]; NEURAL_HAM_MEDIUM(-1.00)[-0.997]; FROM_HAS_DN(0.00)[]; NEURAL_HAM_LONG(-0.96)[-0.963]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-hackers@freebsd.org]; DMARC_NA(0.00)[bsdimp.com]; RCPT_COUNT_FIVE(0.00)[5]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[]; DKIM_TRACE(0.00)[bsdimp-com.20210112.gappssmtp.com:+]; NEURAL_HAM_SHORT(-1.00)[-1.000]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::932:from]; MLMMJ_DEST(0.00)[freebsd-hackers]; FORGED_SENDER(0.30)[imp@bsdimp.com,wlosh@bsdimp.com]; R_SPF_NA(0.00)[no SPF record]; MIME_TRACE(0.00)[0:+,1:+,2:~]; SUBJECT_ENDS_QUESTION(1.00)[]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; RCVD_COUNT_TWO(0.00)[2]; FROM_NEQ_ENVFROM(0.00)[imp@bsdimp.com,wlosh@bsdimp.com] X-ThisMailContainsUnwantedMimeParts: N --000000000000f25ecf05db031ebb Content-Type: text/plain; charset="UTF-8" On Thu, Mar 24, 2022 at 2:51 PM Phil Shafer wrote: > On 24 Mar 2022, at 15:12, Warner Losh wrote: > > On Thu, Mar 24, 2022, 10:30 AM Simon J. Gerraty > > > wrote: > >> AFAIK virtually everything about locale support tells you about the > >> locale for the current process - which does not necessarily inform > >> you > >> of the locale that was in effect when a system file was last edited. > > Exactly. The value is $LANG is transient, leaving no clue about the > encoding of the data. > > >> There's probably something to be said for enforcing something like > >> C.UTF-8 for system files. > > I'd like to have UTF-8 as a given, or at least something definitive like > the symlink idea. Something that tells df, mount, etc how to treat the > value, so that it knows if it's locale-based ("%hs" for libxo) or utf-8 > ("%s" for libxo). > Right now we use %s for these things in all the other utilities (or have traditionally done so, I've not checked recently). We don't setup the locale stuff in these programs at all, so to match historic practice, I think libxo should use %s. > > That is the primary reason for system files always being C.UTF-8... > > There is no way to tag it as anything else... and some of these files > > are often parsed from a context that can't set the locale, like the > > boot loader or the kernel... also, these files have a format that was > > defined back in the 7bit ascii time frame. They also don't make use of > > the text in a way that isn't literal... > > Exactly. There's just no way to know in the current setup. And > declaring it UTF-8 will break anyone currently using locale-based > values. Using the symlink has the value of allowing a simple fix ("sudo > ln -s $LANG /etc/locale"). > Except it's not a simple fix. Sure, you can find this value, but nothing will use it, necessarily. Since there's little value and little need, I think it would be more hassle than it's worth absent a much more extensive audit. For system wide things like config files, we assume C.UTF-8 or the lessor ASCII-7 (or maybe ASCII-8). > Having said that, I'm unsure how you'd mount / from > > fstab, or if that is well defined. The kernel just presents a string > > of bytes not containing /... > > Currently it's not well defined, just a string of bytes, which has > worked fine so far, but it's a problem for adding libxo support to df > and mount, since the strings being used don't have a known encoding. > And JSON, XML, or HTML are UTF-8, so we need to know how to treat them. > The patch under review changes mount to use "%hs" which means that > strings will be locale-based, but that means they will be interpreted > using the current process's $LANG, which may not be how the file was > encoded. > Right. They are de-facto C.UTC-8, at least at the top level these days. That's why I think we should use %s unless someone does extensive testing and auditing of these programs to see if they still work (along with test suites to make sure they still work). We should not be in the business of promising that we can set the locale in any meaningful way and have it work for system-level things. In addition, we'd need to add a test suite to test the boot loader so that the presence of non-C.UTC-8 encoded strings in /etc/fstab doesn't cause it to misbehave. My big worry is that this will open up a big can of worms for people that have a system-wide default set to not be C.UTC-8 and these changes will cause subtle behavior changes that we have to play whack-a-mole with in the forums and PR database. They may not be obviously related to this change at first, and you may be hard to track down to fix what comes up. I will be the first to admit that I feel a bit burned by the locale stuff in global settings since I had to rip out a bunch of locale things in our awk because they caused weird compatibility problems with awk scripts written in other systems. Warner > Thanks, > Phil > > --000000000000f25ecf05db031ebb Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Thu, Mar 24, 2022 at 2:51 PM Phil = Shafer <phil@juniper.net> wro= te:
On 24 Mar 20= 22, at 15:12, Warner Losh wrote:
> On Thu, Mar 24, 2022, 10:30 AM Simon J. Gerraty
> <sjg@juniper.n= et<mailto:sjg@j= uniper.net>> wrote:
>> AFAIK virtually everything about locale support tells you about th= e
>> locale for the current process - which does not necessarily inform=
>> you
>> of the locale that was in effect when a system file was last edite= d.

Exactly.=C2=A0 The value is $LANG is transient, leaving no clue about the <= br> encoding of the data.

>> There's probably something to be said for enforcing something = like
>> C.UTF-8 for system files.

I'd like to have UTF-8 as a given, or at least something definitive lik= e
the symlink idea.=C2=A0 Something that tells df, mount, etc how to treat th= e
value, so that it knows if it's locale-based ("%hs" for libxo= ) or utf-8
("%s" for libxo).

Right now w= e use %s for these things in all the other utilities (or have
tra= ditionally done so, I've not checked recently). We don't setup the<= /div>
locale stuff in these programs at all, so to match historic pract= ice,
I think libxo should use %s.
=C2=A0
> That is the primary reason for system files always being C.UTF-8... > There is no way to tag it as anything else... and some of these files =
> are often parsed from a context that can't set the locale, like th= e
> boot loader or the kernel... also, these files have a format that was =
> defined back in the 7bit ascii time frame. They also don't make us= e of
> the text in a way that isn't literal...

Exactly.=C2=A0 There's just no way to know in the current setup.=C2=A0 = And
declaring it UTF-8 will break anyone currently using locale-based
values.=C2=A0 Using the symlink has the value of allowing a simple fix (&qu= ot;sudo
ln -s $LANG /etc/locale").

Except = it's not a simple fix. Sure, you can find this value, but nothing
=
will use it, necessarily. Since there's little value and little ne= ed, I
think it would be more hassle than it's worth absent a = much more
extensive audit. For system wide things like config fil= es, we assume
C.UTF-8 or the lessor ASCII-7 (or maybe ASCII-8).

> H= aving said that, I'm unsure how you'd mount /<kanji-for-neko>= from
> fstab, or if that is well defined. The kernel just presents a string <= br> > of bytes not containing /...

Currently it's not well defined, just a string of bytes, which has
worked fine so far, but it's a problem for adding libxo support to df <= br> and mount, since the strings being used don't have a known encoding.=C2= =A0
And JSON, XML, or HTML are UTF-8, so we need to know how to treat them.=C2= =A0
The patch under review changes mount to use "%hs" which means tha= t
strings will be locale-based, but that means they will be interpreted
using the current process's $LANG, which may not be how the file was encoded.

Right. They are de-facto C.UTC= -8, at least at the top level these days. That's
why I think = we should use %s unless someone does extensive testing and
auditi= ng of these programs to see if they still work (along with test suites
to make sure they still work). We should not be in the business of pr= omising
that we can set the locale in any meaningful way and have= it work for system-level
things. In addition, we'd need to a= dd a test suite to test the boot loader so
that the presence=C2= =A0of non-C.UTC-8 encoded strings in /etc/fstab doesn't cause it
<= div>to misbehave. My big worry is that this will open up a big can of worms= for people
that have a system-wide default set to not be C.UTC-8= and these changes will
cause subtle behavior changes that we hav= e to play whack-a-mole with in
the forums and PR database. They m= ay not be obviously related to this change
at first, and you may = be hard to track down to fix what comes up.

I will= be the first to admit that I feel a bit burned by the locale stuff in glob= al
settings since I had to rip out a bunch of locale things in ou= r awk because they
caused weird compatibility problems with awk s= cripts written in other systems.

Warner
= =C2=A0
Thanks,
=C2=A0 Phil

--000000000000f25ecf05db031ebb--