Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 23 Nov 2023 08:26:00 -0800
From:      Rick Macklem <rick.macklem@gmail.com>
To:        Mike Karels <mike@karels.net>
Cc:        FreeBSD CURRENT <freebsd-current@freebsd.org>, Alexander Motin <mav@freebsd.org>,  Martin Matuska <mm@freebsd.org>, Garrett Wollman <wollman@bimajority.org>
Subject:   Re: NFS exports of ZFS snapshots broken
Message-ID:  <CAM5tNy7q2y7d1A44z7=zT_Y56gpFB2ETJGuQW3eBF-jgeb0SJg@mail.gmail.com>
In-Reply-To: <CAM5tNy7b_uWsvRaWOGd-CD=Sr4yZwVmepnqfNR_EUXVm_OjN2g@mail.gmail.com>
References:  <25943.60056.880614.452966@hergotha.csail.mit.edu> <DDC3BBBB-561A-4779-B076-83EC054859C2@karels.net> <CAM5tNy6gcThf0nrcs4iC8r6J8cMFOHdUNcUjfJVDSpTW_tqXKQ@mail.gmail.com> <ACAFC3CD-0024-4076-8C69-CAA3B1550B41@karels.net> <AB4690A9-A8F4-490B-82AA-358CD8118DF0@karels.net> <CAM5tNy5LGWF424i9v_w2u%2B41wtzmW9kn6RZWwo6Pi0Y4wj92UA@mail.gmail.com> <CAM5tNy6K6psrtem%2Bc3ta0pM-7fhy9LYci0DEeR0O-EF%2B-iGo6Q@mail.gmail.com> <91988E23-ED50-4379-AA5F-4B069E08D80F@karels.net> <CAM5tNy68tq24C_BA0xNYM-xUbCzpAve0jY-gsr42ixQLHR5gHg@mail.gmail.com> <AD48F344-DDD1-476F-B353-B0F5C05C9808@karels.net> <CAM5tNy6jqH3nRrtAhTaOpAR1RCXKxygSKHtDMKeQStDnkC1skg@mail.gmail.com> <7C7920E0-B0FC-4255-AD5C-ECBDDC18CF43@karels.net> <CAM5tNy7b_uWsvRaWOGd-CD=Sr4yZwVmepnqfNR_EUXVm_OjN2g@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Oh, and thanks to everyone who put up with my lame
ZFS snapshot questions.

rick

On Thu, Nov 23, 2023 at 8:13=E2=80=AFAM Rick Macklem <rick.macklem@gmail.co=
m> wrote:
>
> On Sat, Nov 18, 2023 at 8:10=E2=80=AFPM Mike Karels <mike@karels.net> wro=
te:
> >
> > On 18 Nov 2023, at 21:19, Rick Macklem wrote:
> >
> > > On Sat, Nov 18, 2023 at 4:43=E2=80=AFPM Mike Karels <mike@karels.net>=
 wrote:
> > >>
> > >> On 18 Nov 2023, at 17:23, Rick Macklem wrote:
> > >>
> > >>> On Sat, Nov 18, 2023 at 2:27=E2=80=AFPM Mike Karels <mike@karels.ne=
t> wrote:
> > >>>>
> > >>>> On 18 Nov 2023, at 15:58, Rick Macklem wrote:
> > >>>>
> > >>>>> On Sat, Nov 18, 2023 at 8:09=E2=80=AFAM Rick Macklem <rick.mackle=
m@gmail.com> wrote:
> > >>>>>>
> > >>>>>> On Fri, Nov 17, 2023 at 8:19=E2=80=AFPM Mike Karels <mike@karels=
.net> wrote:
> > >>>>>>>
> > >>>>>>> On 17 Nov 2023, at 22:14, Mike Karels wrote:
> > >>>>>>>
> > >>>>>>>> On 17 Nov 2023, at 21:24, Rick Macklem wrote:
> > >>>>>>>>
> > >>>>>>>>> Most of the changes in stable/13 that are not in releng/13.2
> > >>>>>>>>> are the "make it work in a jail" stuff. Unfortunately, they a=
re
> > >>>>>>>>> a large # of changes (mostly trivial edits adding vnet macros=
),
> > >>>>>>>>> but it also includes export check changes.
> > >>>>>>>>>
> > >>>>>>>>> I have attached a trivial patch that I think disables the exp=
ort
> > >>>>>>>>> checks for jails. If either of you can try it and see if it f=
ixes
> > >>>>>>>>> the problem, that would be great.
> > >>>>>>>>> (Note that this is only for testing, although it probably doe=
s not
> > >>>>>>>>>  matter unless you are running nfsd(8) in vnet jails.)
> > >>>>>>>>
> > >>>>>>>> Yes, I can see snapshots with the patch.  This system is just =
a test
> > >>>>>>>> system that doesn't normally run ZFS or NFS, so no problem mes=
sing
> > >>>>>>>> with permissions.  It's a bhyve VM, so I just added a small di=
sk and
> > >>>>>>>> enabled ZFS for testing.
> > >>>>>>>
> > >>>>>>> btw, you might try to get mm@ or maybe mav@ to help out from th=
e ZFS
> > >>>>>>> side.  It must be doing something differently inside a snapshot=
 than
> > >>>>>>> outside, maybe with file handles or something like that.
> > >>>>>> Yes. I've added freebsd-current@ (although Garrett is not on it,=
 he is
> > >>>>>> cc'd) and these guys specifically...
> > >>>>>>
> > >>>>>> So, here's what appears to be the problem...
> > >>>>>> Commit 88175af (in main and stable/13, but not 13.2) added check=
s for
> > >>>>>> nfsd(8) running in jails by filling in mnt_exjail with a referen=
ce to the cred
> > >>>>>> used when the file system is exported.
> > >>>>>> When mnt_exjail is found NULL, the current nfsd code assumes tha=
t there
> > >>>>>> is no access allowed for the mount.
> > >>>>>>
> > >>>>>> My vague understanding is that when a ZFS snapshot is accessed, =
it is
> > >>>>>> "pseudo-mounted" by zfsctl_snapdir_lookup() and I am guessing th=
at
> > >>>>>> mnt_exjail is NULL as a result.
> > >>>>>> Since I do not know the ZFS code and don't even have an easy way=
 to
> > >>>>>> test this (thankfully Mike can test easily), I do not know what =
to do from
> > >>>>>> here?
> > >>>>>>
> > >>>>>> Is there a "struct mount" constructed for this pseudo mount
> > >>>>>> (or it actually appears to be the lookup of ".." that fails, so =
it
> > >>>>>> might be the parent of the snapshot subdir?)?
> > >>>>>>
> > >>>>>> One thought is that I can check to see if the mount pointer is i=
n the
> > >>>>>> mountlist (I don't think the snapshot's mount is in the mountlis=
t) and
> > >>>>>> avoid the jail test for this case.  This would assume that snaps=
hots are
> > >>>>>> always within the file system(s) exported via that jail (which i=
ncludes
> > >>>>>> the case of prison0, of course), so that they do not need a sepa=
rate
> > >>>>>> jail check.
> > >>>>>>
> > >>>>>> If this doesn't work, there will need to be some sort of messing=
 about
> > >>>>>> in ZFS to set mnt_exjail for these.
> > >>>>> Ok, so now onto the hard part...
> > >>>>> Thanks to Mike and others, I did create a snapshot under .zfs and=
 I can
> > >>>>> see the problem. It is that mnt_exjail =3D=3D NULL.
> > >>>>> Now, is there a way that this "struct mount" can be recognized as=
 "special"
> > >>>>> for snapshots, so I can avoid the mnt_exjail =3D=3D NULL test?
> > >>>>> (I had hoped that "mp->mnt_list.tqe_prev" would be NULL, but that=
 is not
> > >>>>>  the case.)
> > >>>>
> > >>>> Dumb question, is the mount point (mp presumably) different betwee=
n the
> > >>>> snapshot and the main file system?
> > >>> Not a dump question and the answer is rather interesting...
> > >>> It is "sometimes" or "usually" according to my printf().
> > >>> It seems that when you first "cd <snapshot-name"" you get a differe=
nt mp
> > >>> where mnt_exjail =3D=3D NULL.. Then when you look at directories wi=
thin the
> > >>> snapshot, you get the mp of the file system that .zfs exists in, wh=
ich does
> > >>> have mnt_exjail set non-NULL.
> > >>>
> > >>> There is this snippet of code in zfsctl_snapdir_lookup():
> > >>> /*
> > >>> * Fix up the root vnode mounted on .zfs/snapshot/<snapname>.
> > >>> *
> > >>> * This is where we lie about our v_vfsp in order to
> > >>> * make .zfs/snapshot/<snapname> accessible over NFS
> > >>> * without requiring manual mounts of <snapname>.
> > >>> */
> > >>> ASSERT3P(VTOZ(*vpp)->z_zfsvfs, !=3D, zfsvfs);
> > >>> VTOZ(*vpp)->z_zfsvfs->z_parent =3D zfsvfs;
> > >>>
> > >>> /* Clear the root flag (set via VFS_ROOT) as well. */
> > >>> (*vpp)->v_vflag &=3D ~VV_ROOT;
> > >>> which seems to set the mp to that of the parent, but it
> > >>> seems this does not happen for the initial lookup of
> > >>> the <snapname>?
> > >>>
> > >>> I'll note that there is code before this in
> > >>> zfsctl_snapdir_lookup() for handling cases
> > >>> like "." and ".." that return without doing this.
> > >>>
> > >>> Now, why does this work without the mnt_exjail
> > >>> check (as in 13.2)?
> > >>> I am not quite sure, but there is this "cheat" in the
> > >>> NFS server (it has been there for years, maybe decades):
> > >>>     /*
> > >>>      * Allow a Lookup, Getattr, GetFH, Secinfo on an
> > >>>      * non-exported directory if
> > >>>      * nfs_rootfhset. Do I need to allow any other Ops?
> > >>>      * (You can only have a non-exported vpnes if
> > >>>      *  nfs_rootfhset is true. See nfsd_fhtovp())
> > >>>      * Allow AUTH_SYS to be used for file systems
> > >>>      * exported GSS only for certain Ops, to allow
> > >>>      * clients to do mounts more easily.
> > >>>      */
> > >>>     if (nfsv4_opflag[op].needscfh && vp) {
> > >>> if (!NFSVNO_EXPORTED(&vpnes) &&
> > >>>     op !=3D NFSV4OP_LOOKUP &&
> > >>>     op !=3D NFSV4OP_GETATTR &&
> > >>>     op !=3D NFSV4OP_GETFH &&
> > >>>     op !=3D NFSV4OP_ACCESS &&
> > >>>     op !=3D NFSV4OP_READLINK &&
> > >>>     op !=3D NFSV4OP_SECINFO &&
> > >>>     op !=3D NFSV4OP_SECINFONONAME)
> > >>> nd->nd_repstat =3D NFSERR_NOFILEHANDLE;
> > >>> This allows certain operations to be done on
> > >>> non-exported file systems and I think that is enough
> > >>> to allow this to work when mnt_exjail is not checked.
> > >>> (Note that NFSV4OP_LOOKUPP is not in the list,
> > >>>  which might explain why it is the one that fails for
> > >>>  Garrett. I don't think it can be added to this list
> > >>>  safely, since that would allow a client to move above
> > >>>  the exported file system into "uncharted territory".)
> > >>>
> > >>>>  Just curious.  Also, what is mnt_exjail
> > >>>> normally set to for file systems not in a jail?
> > >>> mnt_exjail is set to the credentials of the thread/process
> > >>> that exported the file system (usually mountd(8)).
> > >>> When not in a jail, cr_prison for these credentials
> > >>> points to prison0.
> > >>>
> > >>> Btw, I checked and the "other mp that has mnt_exjail =3D=3D NULL
> > >>> is in the mountlist, so the idea of checking "not in mountlist"
> > >>> is a dead end.
> > >>>
> > >>> I am looking for something "unique" about this other mp,
> > >>> but haven't found anything yet.
> > >>> Alternately, it might be necessary to add code to
> > >>> zfsctl_snapdir_lookup() to "cheat and change the mp"
> > >>> in more cases, such as "." and ".." lookups?
> > >>
> > >> It seems to me that if ZFS is creating an additional mount structure=
,
> > >> it should be responsible for setting it up correctly.  That could
> > >> involve a vfs-level routine to do some of the cloning.  In any case,
> > >> it seems to me that mnt_exjail should be set properly, e.g. by dupin=
g
> > >> the one in the original mount structure.  Probably ZFS is the only
> > >> file system type that would need this added.
> > > I've created a patch that I think does this. It seemed to test ok for=
 my case.
> > > It's in D42672 on reviews.freebsd.org.
> > > I have also attached it here (this diff doesn't use -U999999, so it m=
ight be
> > > easier to apply).
> >
> > It works for me in 13-stable, and looks plausible.
>
> The patch has now been fixed so that there is no race between vfs_exjail_=
clone()
> and a jail dying. I believe the current patch fixes the problem.
> The patch can be found as an attachment to PR#275200 (that is a FreeBSD b=
ugzilla
> problem report, not an OpenZFS pull request). I will not close the PR unt=
il
> the above has all happened.
>
> This patch will be needed for systems running stable/13, stable/14 and 14=
.0.
>
> The vfs_exjail_clone() part of the patch is now committed to main and wil=
l be
> MFC'd in 3 days. The ZFS part of the patch is a pull request on OpenZFS.
> I cannot say how long the ZFS part takes or if/when an errata to 14.0 wil=
l
> happen.
>
> Sorry about the breakage and thanks for reporting it (and persevering whe=
n
> I said I knew nothing about ZFS, which is true). Thanks goes to Mike Kare=
ls,
> who was able to determine that the problem was not in 13.2, but was in
> stable/13,
> which allowed me to conclude it was a "nfsd in jail" related breakage.
>
> rick
>
> >
> >                 Mike
> >
> > > Hopefully this fixes the problem. Sorry for the breakage.
> > >
> > > rick
> > >
> > >>
> > >>                 Mike
> > >>
> > >>> rick
> > >>> ps: I added all the cc's back in because I want the
> > >>>       ZFS folk to hopefully chime in.
> > >>>
> > >>>>
> > >>>>                 Mike
> > >>>>
> > >>>>> Do I need to search mountlist for it?
> > >>>>>
> > >>>>> rick
> > >>>>> ps: The hack patch attached should fix the problem, but can only =
be
> > >>>>>       safely used if mountd/nfsd are not run in any jails.
> > >>>>>
> > >>>>>>
> > >>>>>> I will try and get a test setup going here, which leads me to..
> > >>>>>> how do I create a ZFS snapshot? (I do have a simple ZFS pool run=
ning
> > >>>>>> on a test machine, but I've never done a snapshot.)
> > >>>>>>
> > >>>>>> Although this problem is not in 13.2, it will have shipped in 14=
.0.
> > >>>>>>
> > >>>>>> Any help with be appreciated, rick
> > >>>>>>
> > >>>>>>>
> > >>>>>>>                 Mike
> > >>>>>>>>
> > >>>>>>>>> rick
> > >>>>>>>>>
> > >>>>>>>>> On Fri, Nov 17, 2023 at 6:14=E2=80=AFPM Mike Karels <mike@kar=
els.net> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> CAUTION: This email originated from outside of the Universit=
y of Guelph. Do not click links or open attachments unless you recognize th=
e sender and know the content is safe. If in doubt, forward suspicious emai=
ls to IThelp@uoguelph.ca.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Rick, have you been following this thread on freebsd-stable?=
  I have been able
> > >>>>>>>>>> to reproduce this using a 13-stable server from Oct 7 and a =
15-current system
> > >>>>>>>>>> that is up to date using NFSv3.  I did not reproduce with a =
13.2 server.  The
> > >>>>>>>>>> client was running 13.2.  Any ideas?  A full bisect seems fa=
irly painful, but
> > >>>>>>>>>> maybe you have an idea of points to try.  Fortunately, these=
 are all test
> > >>>>>>>>>> systems that I can reboot at will.
> > >>>>>>>>>>
> > >>>>>>>>>>                 Mike
> > >>>>>>>>>>
> > >>>>>>>>>> Forwarded message:
> > >>>>>>>>>>
> > >>>>>>>>>>> From: Garrett Wollman <wollman@bimajority.org>
> > >>>>>>>>>>> To: Mike Karels <mike@karels.net>
> > >>>>>>>>>>> Cc: freebsd-stable@freebsd.org
> > >>>>>>>>>>> Subject: Re: NFS exports of ZFS snapshots broken
> > >>>>>>>>>>> Date: Fri, 17 Nov 2023 17:35:04 -0500
> > >>>>>>>>>>>
> > >>>>>>>>>>> <<On Fri, 17 Nov 2023 15:57:42 -0600, Mike Karels <mike@kar=
els.net> said:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> I have not run into this, so I tried it just now.  I had n=
o problem.
> > >>>>>>>>>>>> The server is 13.2, fully patched, the client is up-to-dat=
e -current,
> > >>>>>>>>>>>> and the mount is v4.
> > >>>>>>>>>>>
> > >>>>>>>>>>> On my 13.2 client and 13-stable server, I see:
> > >>>>>>>>>>>
> > >>>>>>>>>>>  25034 ls       CALL  open(0x237d32f9a000,0x120004<O_RDONLY=
|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC>)
> > >>>>>>>>>>>  25034 ls       NAMI  "/mnt/tools/.zfs/snapshot/weekly-2023=
-45"
> > >>>>>>>>>>>  25034 ls       RET   open 4
> > >>>>>>>>>>>  25034 ls       CALL  fcntl(0x4,F_ISUNIONSTACK,0x0)
> > >>>>>>>>>>>  25034 ls       RET   fcntl 0
> > >>>>>>>>>>>  25034 ls       CALL  getdirentries(0x4,0x237d32faa000,0x10=
00,0x237d32fa7028)
> > >>>>>>>>>>>  25034 ls       RET   getdirentries -1 errno 5 Input/output=
 error
> > >>>>>>>>>>>  25034 ls       CALL  close(0x4)
> > >>>>>>>>>>>  25034 ls       RET   close 0
> > >>>>>>>>>>>  25034 ls       CALL  exit(0)
> > >>>>>>>>>>>
> > >>>>>>>>>>> Certainly a libc bug here that getdirentries(2) returning [=
EIO]
> > >>>>>>>>>>> results in ls(1) returning EXIT_SUCCESS, but the [EIO] erro=
r is
> > >>>>>>>>>>> consistent across both FreeBSD and Linux clients.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Looking at this from the RPC side:
> > >>>>>>>>>>>
> > >>>>>>>>>>>       (PUTFH, GETATTR, LOOKUP(snapshotname), GETFH, GETATTR=
)
> > >>>>>>>>>>>               [NFS4_OK for all ops]
> > >>>>>>>>>>>       (PUTFH, GETATTR)
> > >>>>>>>>>>>               [NFS4_OK, NFS4_OK]
> > >>>>>>>>>>>       (PUTFH, ACCESS(0x3f), GETATTR)
> > >>>>>>>>>>>               [NFS4_OK, NFS4_OK, rights =3D 0x03, NFS4_OK]
> > >>>>>>>>>>>       (PUTFH, GETATTR, LOOKUPP, GETFH, GETATTR)
> > >>>>>>>>>>>               [NFS4_OK, NFS4_OK, NFS4ERR_NOFILEHANDLE]
> > >>>>>>>>>>>
> > >>>>>>>>>>> and at this point the [EIO] is returned.
> > >>>>>>>>>>>
> > >>>>>>>>>>> It seems that clients always do a LOOKUPP before calling RE=
ADDIR, and
> > >>>>>>>>>>> this is failing when the subject file handle is the snapsho=
t.  The
> > >>>>>>>>>>> client is perfectly able to *traverse into* the snapshot: i=
f I try to
> > >>>>>>>>>>> list a subdirectory I know exists in the snapshot, the clie=
nt is able to
> > >>>>>>>>>>> LOOKUP(dirname) just fine, but LOOKUPP still fails with
> > >>>>>>>>>>> NFS4ERR_NOFILEHANDLE *on the subndirectory*.
> > >>>>>>>>>>>
> > >>>>>>>>>>> -GAWollman
> > >>>>>>>>>>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAM5tNy7q2y7d1A44z7=zT_Y56gpFB2ETJGuQW3eBF-jgeb0SJg>