Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 18 Nov 2023 22:10:16 -0600
From:      Mike Karels <mike@karels.net>
To:        Rick Macklem <rick.macklem@gmail.com>
Cc:        FreeBSD CURRENT <freebsd-current@freebsd.org>, Alexander Motin <mav@freebsd.org>, Martin Matuska <mm@freebsd.org>, Garrett Wollman <wollman@bimajority.org>
Subject:   Re: NFS exports of ZFS snapshots broken
Message-ID:  <7C7920E0-B0FC-4255-AD5C-ECBDDC18CF43@karels.net>
In-Reply-To: <CAM5tNy6jqH3nRrtAhTaOpAR1RCXKxygSKHtDMKeQStDnkC1skg@mail.gmail.com>
References:  <25943.60056.880614.452966@hergotha.csail.mit.edu> <DDC3BBBB-561A-4779-B076-83EC054859C2@karels.net> <CAM5tNy6gcThf0nrcs4iC8r6J8cMFOHdUNcUjfJVDSpTW_tqXKQ@mail.gmail.com> <ACAFC3CD-0024-4076-8C69-CAA3B1550B41@karels.net> <AB4690A9-A8F4-490B-82AA-358CD8118DF0@karels.net> <CAM5tNy5LGWF424i9v_w2u%2B41wtzmW9kn6RZWwo6Pi0Y4wj92UA@mail.gmail.com> <CAM5tNy6K6psrtem%2Bc3ta0pM-7fhy9LYci0DEeR0O-EF%2B-iGo6Q@mail.gmail.com> <91988E23-ED50-4379-AA5F-4B069E08D80F@karels.net> <CAM5tNy68tq24C_BA0xNYM-xUbCzpAve0jY-gsr42ixQLHR5gHg@mail.gmail.com> <AD48F344-DDD1-476F-B353-B0F5C05C9808@karels.net> <CAM5tNy6jqH3nRrtAhTaOpAR1RCXKxygSKHtDMKeQStDnkC1skg@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 18 Nov 2023, at 21:19, Rick Macklem wrote:

> On Sat, Nov 18, 2023 at 4:43=E2=80=AFPM Mike Karels <mike@karels.net> w=
rote:
>>
>> On 18 Nov 2023, at 17:23, Rick Macklem wrote:
>>
>>> On Sat, Nov 18, 2023 at 2:27=E2=80=AFPM Mike Karels <mike@karels.net>=
 wrote:
>>>>
>>>> On 18 Nov 2023, at 15:58, Rick Macklem wrote:
>>>>
>>>>> On Sat, Nov 18, 2023 at 8:09=E2=80=AFAM Rick Macklem <rick.macklem@=
gmail.com> wrote:
>>>>>>
>>>>>> On Fri, Nov 17, 2023 at 8:19=E2=80=AFPM Mike Karels <mike@karels.n=
et> wrote:
>>>>>>>
>>>>>>> On 17 Nov 2023, at 22:14, Mike Karels wrote:
>>>>>>>
>>>>>>>> On 17 Nov 2023, at 21:24, Rick Macklem wrote:
>>>>>>>>
>>>>>>>>> Most of the changes in stable/13 that are not in releng/13.2
>>>>>>>>> are the "make it work in a jail" stuff. Unfortunately, they are=

>>>>>>>>> a large # of changes (mostly trivial edits adding vnet macros),=

>>>>>>>>> but it also includes export check changes.
>>>>>>>>>
>>>>>>>>> I have attached a trivial patch that I think disables the expor=
t
>>>>>>>>> checks for jails. If either of you can try it and see if it fix=
es
>>>>>>>>> the problem, that would be great.
>>>>>>>>> (Note that this is only for testing, although it probably does =
not
>>>>>>>>>  matter unless you are running nfsd(8) in vnet jails.)
>>>>>>>>
>>>>>>>> Yes, I can see snapshots with the patch.  This system is just a =
test
>>>>>>>> system that doesn't normally run ZFS or NFS, so no problem messi=
ng
>>>>>>>> with permissions.  It's a bhyve VM, so I just added a small disk=
 and
>>>>>>>> enabled ZFS for testing.
>>>>>>>
>>>>>>> btw, you might try to get mm@ or maybe mav@ to help out from the =
ZFS
>>>>>>> side.  It must be doing something differently inside a snapshot t=
han
>>>>>>> outside, maybe with file handles or something like that.
>>>>>> Yes. I've added freebsd-current@ (although Garrett is not on it, h=
e is
>>>>>> cc'd) and these guys specifically...
>>>>>>
>>>>>> So, here's what appears to be the problem...
>>>>>> Commit 88175af (in main and stable/13, but not 13.2) added checks =
for
>>>>>> nfsd(8) running in jails by filling in mnt_exjail with a reference=
 to the cred
>>>>>> used when the file system is exported.
>>>>>> When mnt_exjail is found NULL, the current nfsd code assumes that =
there
>>>>>> is no access allowed for the mount.
>>>>>>
>>>>>> My vague understanding is that when a ZFS snapshot is accessed, it=
 is
>>>>>> "pseudo-mounted" by zfsctl_snapdir_lookup() and I am guessing that=

>>>>>> mnt_exjail is NULL as a result.
>>>>>> Since I do not know the ZFS code and don't even have an easy way t=
o
>>>>>> test this (thankfully Mike can test easily), I do not know what to=
 do from
>>>>>> here?
>>>>>>
>>>>>> Is there a "struct mount" constructed for this pseudo mount
>>>>>> (or it actually appears to be the lookup of ".." that fails, so it=

>>>>>> might be the parent of the snapshot subdir?)?
>>>>>>
>>>>>> One thought is that I can check to see if the mount pointer is in =
the
>>>>>> mountlist (I don't think the snapshot's mount is in the mountlist)=
 and
>>>>>> avoid the jail test for this case.  This would assume that snapsho=
ts are
>>>>>> always within the file system(s) exported via that jail (which inc=
ludes
>>>>>> the case of prison0, of course), so that they do not need a separa=
te
>>>>>> jail check.
>>>>>>
>>>>>> If this doesn't work, there will need to be some sort of messing a=
bout
>>>>>> in ZFS to set mnt_exjail for these.
>>>>> Ok, so now onto the hard part...
>>>>> Thanks to Mike and others, I did create a snapshot under .zfs and I=
 can
>>>>> see the problem. It is that mnt_exjail =3D=3D NULL.
>>>>> Now, is there a way that this "struct mount" can be recognized as "=
special"
>>>>> for snapshots, so I can avoid the mnt_exjail =3D=3D NULL test?
>>>>> (I had hoped that "mp->mnt_list.tqe_prev" would be NULL, but that i=
s not
>>>>>  the case.)
>>>>
>>>> Dumb question, is the mount point (mp presumably) different between =
the
>>>> snapshot and the main file system?
>>> Not a dump question and the answer is rather interesting...
>>> It is "sometimes" or "usually" according to my printf().
>>> It seems that when you first "cd <snapshot-name"" you get a different=
 mp
>>> where mnt_exjail =3D=3D NULL.. Then when you look at directories with=
in the
>>> snapshot, you get the mp of the file system that .zfs exists in, whic=
h does
>>> have mnt_exjail set non-NULL.
>>>
>>> There is this snippet of code in zfsctl_snapdir_lookup():
>>> /*
>>> * Fix up the root vnode mounted on .zfs/snapshot/<snapname>.
>>> *
>>> * This is where we lie about our v_vfsp in order to
>>> * make .zfs/snapshot/<snapname> accessible over NFS
>>> * without requiring manual mounts of <snapname>.
>>> */
>>> ASSERT3P(VTOZ(*vpp)->z_zfsvfs, !=3D, zfsvfs);
>>> VTOZ(*vpp)->z_zfsvfs->z_parent =3D zfsvfs;
>>>
>>> /* Clear the root flag (set via VFS_ROOT) as well. */
>>> (*vpp)->v_vflag &=3D ~VV_ROOT;
>>> which seems to set the mp to that of the parent, but it
>>> seems this does not happen for the initial lookup of
>>> the <snapname>?
>>>
>>> I'll note that there is code before this in
>>> zfsctl_snapdir_lookup() for handling cases
>>> like "." and ".." that return without doing this.
>>>
>>> Now, why does this work without the mnt_exjail
>>> check (as in 13.2)?
>>> I am not quite sure, but there is this "cheat" in the
>>> NFS server (it has been there for years, maybe decades):
>>>     /*
>>>      * Allow a Lookup, Getattr, GetFH, Secinfo on an
>>>      * non-exported directory if
>>>      * nfs_rootfhset. Do I need to allow any other Ops?
>>>      * (You can only have a non-exported vpnes if
>>>      *  nfs_rootfhset is true. See nfsd_fhtovp())
>>>      * Allow AUTH_SYS to be used for file systems
>>>      * exported GSS only for certain Ops, to allow
>>>      * clients to do mounts more easily.
>>>      */
>>>     if (nfsv4_opflag[op].needscfh && vp) {
>>> if (!NFSVNO_EXPORTED(&vpnes) &&
>>>     op !=3D NFSV4OP_LOOKUP &&
>>>     op !=3D NFSV4OP_GETATTR &&
>>>     op !=3D NFSV4OP_GETFH &&
>>>     op !=3D NFSV4OP_ACCESS &&
>>>     op !=3D NFSV4OP_READLINK &&
>>>     op !=3D NFSV4OP_SECINFO &&
>>>     op !=3D NFSV4OP_SECINFONONAME)
>>> nd->nd_repstat =3D NFSERR_NOFILEHANDLE;
>>> This allows certain operations to be done on
>>> non-exported file systems and I think that is enough
>>> to allow this to work when mnt_exjail is not checked.
>>> (Note that NFSV4OP_LOOKUPP is not in the list,
>>>  which might explain why it is the one that fails for
>>>  Garrett. I don't think it can be added to this list
>>>  safely, since that would allow a client to move above
>>>  the exported file system into "uncharted territory".)
>>>
>>>>  Just curious.  Also, what is mnt_exjail
>>>> normally set to for file systems not in a jail?
>>> mnt_exjail is set to the credentials of the thread/process
>>> that exported the file system (usually mountd(8)).
>>> When not in a jail, cr_prison for these credentials
>>> points to prison0.
>>>
>>> Btw, I checked and the "other mp that has mnt_exjail =3D=3D NULL
>>> is in the mountlist, so the idea of checking "not in mountlist"
>>> is a dead end.
>>>
>>> I am looking for something "unique" about this other mp,
>>> but haven't found anything yet.
>>> Alternately, it might be necessary to add code to
>>> zfsctl_snapdir_lookup() to "cheat and change the mp"
>>> in more cases, such as "." and ".." lookups?
>>
>> It seems to me that if ZFS is creating an additional mount structure,
>> it should be responsible for setting it up correctly.  That could
>> involve a vfs-level routine to do some of the cloning.  In any case,
>> it seems to me that mnt_exjail should be set properly, e.g. by duping
>> the one in the original mount structure.  Probably ZFS is the only
>> file system type that would need this added.
> I've created a patch that I think does this. It seemed to test ok for m=
y case.
> It's in D42672 on reviews.freebsd.org.
> I have also attached it here (this diff doesn't use -U999999, so it mig=
ht be
> easier to apply).

It works for me in 13-stable, and looks plausible.

		Mike

> Hopefully this fixes the problem. Sorry for the breakage.
>
> rick
>
>>
>>                 Mike
>>
>>> rick
>>> ps: I added all the cc's back in because I want the
>>>       ZFS folk to hopefully chime in.
>>>
>>>>
>>>>                 Mike
>>>>
>>>>> Do I need to search mountlist for it?
>>>>>
>>>>> rick
>>>>> ps: The hack patch attached should fix the problem, but can only be=

>>>>>       safely used if mountd/nfsd are not run in any jails.
>>>>>
>>>>>>
>>>>>> I will try and get a test setup going here, which leads me to..
>>>>>> how do I create a ZFS snapshot? (I do have a simple ZFS pool runni=
ng
>>>>>> on a test machine, but I've never done a snapshot.)
>>>>>>
>>>>>> Although this problem is not in 13.2, it will have shipped in 14.0=
=2E
>>>>>>
>>>>>> Any help with be appreciated, rick
>>>>>>
>>>>>>>
>>>>>>>                 Mike
>>>>>>>>
>>>>>>>>> rick
>>>>>>>>>
>>>>>>>>> On Fri, Nov 17, 2023 at 6:14=E2=80=AFPM Mike Karels <mike@karel=
s.net> wrote:
>>>>>>>>>>
>>>>>>>>>> CAUTION: This email originated from outside of the University =
of Guelph. Do not click links or open attachments unless you recognize th=
e sender and know the content is safe. If in doubt, forward suspicious em=
ails to IThelp@uoguelph.ca.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Rick, have you been following this thread on freebsd-stable?  =
I have been able
>>>>>>>>>> to reproduce this using a 13-stable server from Oct 7 and a 15=
-current system
>>>>>>>>>> that is up to date using NFSv3.  I did not reproduce with a 13=
=2E2 server.  The
>>>>>>>>>> client was running 13.2.  Any ideas?  A full bisect seems fair=
ly painful, but
>>>>>>>>>> maybe you have an idea of points to try.  Fortunately, these a=
re all test
>>>>>>>>>> systems that I can reboot at will.
>>>>>>>>>>
>>>>>>>>>>                 Mike
>>>>>>>>>>
>>>>>>>>>> Forwarded message:
>>>>>>>>>>
>>>>>>>>>>> From: Garrett Wollman <wollman@bimajority.org>
>>>>>>>>>>> To: Mike Karels <mike@karels.net>
>>>>>>>>>>> Cc: freebsd-stable@freebsd.org
>>>>>>>>>>> Subject: Re: NFS exports of ZFS snapshots broken
>>>>>>>>>>> Date: Fri, 17 Nov 2023 17:35:04 -0500
>>>>>>>>>>>
>>>>>>>>>>> <<On Fri, 17 Nov 2023 15:57:42 -0600, Mike Karels <mike@karel=
s.net> said:
>>>>>>>>>>>
>>>>>>>>>>>> I have not run into this, so I tried it just now.  I had no =
problem.
>>>>>>>>>>>> The server is 13.2, fully patched, the client is up-to-date =
-current,
>>>>>>>>>>>> and the mount is v4.
>>>>>>>>>>>
>>>>>>>>>>> On my 13.2 client and 13-stable server, I see:
>>>>>>>>>>>
>>>>>>>>>>>  25034 ls       CALL  open(0x237d32f9a000,0x120004<O_RDONLY|O=
_NONBLOCK|O_DIRECTORY|O_CLOEXEC>)
>>>>>>>>>>>  25034 ls       NAMI  "/mnt/tools/.zfs/snapshot/weekly-2023-4=
5"
>>>>>>>>>>>  25034 ls       RET   open 4
>>>>>>>>>>>  25034 ls       CALL  fcntl(0x4,F_ISUNIONSTACK,0x0)
>>>>>>>>>>>  25034 ls       RET   fcntl 0
>>>>>>>>>>>  25034 ls       CALL  getdirentries(0x4,0x237d32faa000,0x1000=
,0x237d32fa7028)
>>>>>>>>>>>  25034 ls       RET   getdirentries -1 errno 5 Input/output e=
rror
>>>>>>>>>>>  25034 ls       CALL  close(0x4)
>>>>>>>>>>>  25034 ls       RET   close 0
>>>>>>>>>>>  25034 ls       CALL  exit(0)
>>>>>>>>>>>
>>>>>>>>>>> Certainly a libc bug here that getdirentries(2) returning [EI=
O]
>>>>>>>>>>> results in ls(1) returning EXIT_SUCCESS, but the [EIO] error =
is
>>>>>>>>>>> consistent across both FreeBSD and Linux clients.
>>>>>>>>>>>
>>>>>>>>>>> Looking at this from the RPC side:
>>>>>>>>>>>
>>>>>>>>>>>       (PUTFH, GETATTR, LOOKUP(snapshotname), GETFH, GETATTR)
>>>>>>>>>>>               [NFS4_OK for all ops]
>>>>>>>>>>>       (PUTFH, GETATTR)
>>>>>>>>>>>               [NFS4_OK, NFS4_OK]
>>>>>>>>>>>       (PUTFH, ACCESS(0x3f), GETATTR)
>>>>>>>>>>>               [NFS4_OK, NFS4_OK, rights =3D 0x03, NFS4_OK]
>>>>>>>>>>>       (PUTFH, GETATTR, LOOKUPP, GETFH, GETATTR)
>>>>>>>>>>>               [NFS4_OK, NFS4_OK, NFS4ERR_NOFILEHANDLE]
>>>>>>>>>>>
>>>>>>>>>>> and at this point the [EIO] is returned.
>>>>>>>>>>>
>>>>>>>>>>> It seems that clients always do a LOOKUPP before calling READ=
DIR, and
>>>>>>>>>>> this is failing when the subject file handle is the snapshot.=
  The
>>>>>>>>>>> client is perfectly able to *traverse into* the snapshot: if =
I try to
>>>>>>>>>>> list a subdirectory I know exists in the snapshot, the client=
 is able to
>>>>>>>>>>> LOOKUP(dirname) just fine, but LOOKUPP still fails with
>>>>>>>>>>> NFS4ERR_NOFILEHANDLE *on the subndirectory*.
>>>>>>>>>>>
>>>>>>>>>>> -GAWollman
>>>>>>>>>>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?7C7920E0-B0FC-4255-AD5C-ECBDDC18CF43>