From nobody Sun Nov 19 04:10:16 2023 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4SXxxw4wMQz519S2 for ; Sun, 19 Nov 2023 04:10:28 +0000 (UTC) (envelope-from mike@karels.net) Received: from mail2.karels.net (mail2.karels.net [3.19.118.201]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "freebsd", Issuer "freebsd" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 4SXxxv5f1Mz4DX6; Sun, 19 Nov 2023 04:10:27 +0000 (UTC) (envelope-from mike@karels.net) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=karels.net header.s=mail2 header.b=cETMVQeR; spf=pass (mx1.freebsd.org: domain of mike@karels.net designates 3.19.118.201 as permitted sender) smtp.mailfrom=mike@karels.net; dmarc=none Received: from mail2.karels.net (localhost [IPv6:0:0:0:0:0:0:0:1]) by mail2.karels.net (8.17.1/8.17.1) with ESMTP id 3AJ4AHif062239; Sat, 18 Nov 2023 22:10:17 -0600 (CST) (envelope-from mike@karels.net) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=karels.net; s=mail2; t=1700367017; bh=Kj9VXN0ec4NtdXdmaivIfegRSxeTPaWPpPLK4LAt3Fc=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=cETMVQeReoUR2hEB1MBViT8msRhLhX6rE8ym4p7yvUbCTrcIPTp35W8fjOEMuOE5k yx2Ma7J9JCQJ2eCDlKk7q/Pi9wcI7a6DYn8L5Qeqo0GiytK5ulXM5GKbVj0sSPuMgC VeXZ+Shu2ImjBjYnIeSyOv5awAjZSO1GefqGrQyrEn0XjhQza8ygOkxgjsrKck3pyC YtzO4QLX0VKGVFIgAQ+fTaQ5yaePlOR6N4yMRXTf8XAGBIHLAqtRy+ffHnoo9HROIH F2zoD18FIfupoAmR4KG9aeg6k1hfP5OZ4o0XA/3ua8m8x5YuxEzUXsugGCAu3L9j73 Ew95gpJhXuKAw== Received: from [10.0.2.130] ([73.62.165.147]) by mail2.karels.net with ESMTPSA id A77GGKmKWWUd8wAAs/W3XQ (envelope-from ); Sat, 18 Nov 2023 22:10:17 -0600 From: Mike Karels To: Rick Macklem Cc: FreeBSD CURRENT , Alexander Motin , Martin Matuska , Garrett Wollman Subject: Re: NFS exports of ZFS snapshots broken Date: Sat, 18 Nov 2023 22:10:16 -0600 X-Mailer: MailMate (1.14r5964) Message-ID: <7C7920E0-B0FC-4255-AD5C-ECBDDC18CF43@karels.net> In-Reply-To: References: <25943.60056.880614.452966@hergotha.csail.mit.edu> <91988E23-ED50-4379-AA5F-4B069E08D80F@karels.net> List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Spamd-Result: default: False [-2.00 / 15.00]; SUSPICIOUS_RECIPS(1.50)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-1.000]; R_DKIM_ALLOW(-0.20)[karels.net:s=mail2]; R_SPF_ALLOW(-0.20)[+ip4:3.19.118.201]; MIME_GOOD(-0.10)[text/plain]; MLMMJ_DEST(0.00)[freebsd-current@freebsd.org]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; FREEMAIL_TO(0.00)[gmail.com]; ASN(0.00)[asn:16509, ipnet:3.16.0.0/14, country:US]; RCVD_TLS_LAST(0.00)[]; DKIM_TRACE(0.00)[karels.net:+]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_COUNT_TWO(0.00)[2]; FROM_HAS_DN(0.00)[]; FREEFALL_USER(0.00)[mike]; ARC_NA(0.00)[]; TO_DN_ALL(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; DMARC_NA(0.00)[karels.net]; TAGGED_RCPT(0.00)[]; RCPT_COUNT_FIVE(0.00)[5]; RCVD_VIA_SMTP_AUTH(0.00)[] X-Rspamd-Queue-Id: 4SXxxv5f1Mz4DX6 X-Spamd-Bar: - On 18 Nov 2023, at 21:19, Rick Macklem wrote: > On Sat, Nov 18, 2023 at 4:43=E2=80=AFPM Mike Karels w= rote: >> >> On 18 Nov 2023, at 17:23, Rick Macklem wrote: >> >>> On Sat, Nov 18, 2023 at 2:27=E2=80=AFPM Mike Karels = wrote: >>>> >>>> On 18 Nov 2023, at 15:58, Rick Macklem wrote: >>>> >>>>> On Sat, Nov 18, 2023 at 8:09=E2=80=AFAM Rick Macklem wrote: >>>>>> >>>>>> On Fri, Nov 17, 2023 at 8:19=E2=80=AFPM Mike Karels wrote: >>>>>>> >>>>>>> On 17 Nov 2023, at 22:14, Mike Karels wrote: >>>>>>> >>>>>>>> On 17 Nov 2023, at 21:24, Rick Macklem wrote: >>>>>>>> >>>>>>>>> Most of the changes in stable/13 that are not in releng/13.2 >>>>>>>>> are the "make it work in a jail" stuff. Unfortunately, they are= >>>>>>>>> a large # of changes (mostly trivial edits adding vnet macros),= >>>>>>>>> but it also includes export check changes. >>>>>>>>> >>>>>>>>> I have attached a trivial patch that I think disables the expor= t >>>>>>>>> checks for jails. If either of you can try it and see if it fix= es >>>>>>>>> the problem, that would be great. >>>>>>>>> (Note that this is only for testing, although it probably does = not >>>>>>>>> matter unless you are running nfsd(8) in vnet jails.) >>>>>>>> >>>>>>>> Yes, I can see snapshots with the patch. This system is just a = test >>>>>>>> system that doesn't normally run ZFS or NFS, so no problem messi= ng >>>>>>>> with permissions. It's a bhyve VM, so I just added a small disk= and >>>>>>>> enabled ZFS for testing. >>>>>>> >>>>>>> btw, you might try to get mm@ or maybe mav@ to help out from the = ZFS >>>>>>> side. It must be doing something differently inside a snapshot t= han >>>>>>> outside, maybe with file handles or something like that. >>>>>> Yes. I've added freebsd-current@ (although Garrett is not on it, h= e is >>>>>> cc'd) and these guys specifically... >>>>>> >>>>>> So, here's what appears to be the problem... >>>>>> Commit 88175af (in main and stable/13, but not 13.2) added checks = for >>>>>> nfsd(8) running in jails by filling in mnt_exjail with a reference= to the cred >>>>>> used when the file system is exported. >>>>>> When mnt_exjail is found NULL, the current nfsd code assumes that = there >>>>>> is no access allowed for the mount. >>>>>> >>>>>> My vague understanding is that when a ZFS snapshot is accessed, it= is >>>>>> "pseudo-mounted" by zfsctl_snapdir_lookup() and I am guessing that= >>>>>> mnt_exjail is NULL as a result. >>>>>> Since I do not know the ZFS code and don't even have an easy way t= o >>>>>> test this (thankfully Mike can test easily), I do not know what to= do from >>>>>> here? >>>>>> >>>>>> Is there a "struct mount" constructed for this pseudo mount >>>>>> (or it actually appears to be the lookup of ".." that fails, so it= >>>>>> might be the parent of the snapshot subdir?)? >>>>>> >>>>>> One thought is that I can check to see if the mount pointer is in = the >>>>>> mountlist (I don't think the snapshot's mount is in the mountlist)= and >>>>>> avoid the jail test for this case. This would assume that snapsho= ts are >>>>>> always within the file system(s) exported via that jail (which inc= ludes >>>>>> the case of prison0, of course), so that they do not need a separa= te >>>>>> jail check. >>>>>> >>>>>> If this doesn't work, there will need to be some sort of messing a= bout >>>>>> in ZFS to set mnt_exjail for these. >>>>> Ok, so now onto the hard part... >>>>> Thanks to Mike and others, I did create a snapshot under .zfs and I= can >>>>> see the problem. It is that mnt_exjail =3D=3D NULL. >>>>> Now, is there a way that this "struct mount" can be recognized as "= special" >>>>> for snapshots, so I can avoid the mnt_exjail =3D=3D NULL test? >>>>> (I had hoped that "mp->mnt_list.tqe_prev" would be NULL, but that i= s not >>>>> the case.) >>>> >>>> Dumb question, is the mount point (mp presumably) different between = the >>>> snapshot and the main file system? >>> Not a dump question and the answer is rather interesting... >>> It is "sometimes" or "usually" according to my printf(). >>> It seems that when you first "cd >> where mnt_exjail =3D=3D NULL.. Then when you look at directories with= in the >>> snapshot, you get the mp of the file system that .zfs exists in, whic= h does >>> have mnt_exjail set non-NULL. >>> >>> There is this snippet of code in zfsctl_snapdir_lookup(): >>> /* >>> * Fix up the root vnode mounted on .zfs/snapshot/. >>> * >>> * This is where we lie about our v_vfsp in order to >>> * make .zfs/snapshot/ accessible over NFS >>> * without requiring manual mounts of . >>> */ >>> ASSERT3P(VTOZ(*vpp)->z_zfsvfs, !=3D, zfsvfs); >>> VTOZ(*vpp)->z_zfsvfs->z_parent =3D zfsvfs; >>> >>> /* Clear the root flag (set via VFS_ROOT) as well. */ >>> (*vpp)->v_vflag &=3D ~VV_ROOT; >>> which seems to set the mp to that of the parent, but it >>> seems this does not happen for the initial lookup of >>> the ? >>> >>> I'll note that there is code before this in >>> zfsctl_snapdir_lookup() for handling cases >>> like "." and ".." that return without doing this. >>> >>> Now, why does this work without the mnt_exjail >>> check (as in 13.2)? >>> I am not quite sure, but there is this "cheat" in the >>> NFS server (it has been there for years, maybe decades): >>> /* >>> * Allow a Lookup, Getattr, GetFH, Secinfo on an >>> * non-exported directory if >>> * nfs_rootfhset. Do I need to allow any other Ops? >>> * (You can only have a non-exported vpnes if >>> * nfs_rootfhset is true. See nfsd_fhtovp()) >>> * Allow AUTH_SYS to be used for file systems >>> * exported GSS only for certain Ops, to allow >>> * clients to do mounts more easily. >>> */ >>> if (nfsv4_opflag[op].needscfh && vp) { >>> if (!NFSVNO_EXPORTED(&vpnes) && >>> op !=3D NFSV4OP_LOOKUP && >>> op !=3D NFSV4OP_GETATTR && >>> op !=3D NFSV4OP_GETFH && >>> op !=3D NFSV4OP_ACCESS && >>> op !=3D NFSV4OP_READLINK && >>> op !=3D NFSV4OP_SECINFO && >>> op !=3D NFSV4OP_SECINFONONAME) >>> nd->nd_repstat =3D NFSERR_NOFILEHANDLE; >>> This allows certain operations to be done on >>> non-exported file systems and I think that is enough >>> to allow this to work when mnt_exjail is not checked. >>> (Note that NFSV4OP_LOOKUPP is not in the list, >>> which might explain why it is the one that fails for >>> Garrett. I don't think it can be added to this list >>> safely, since that would allow a client to move above >>> the exported file system into "uncharted territory".) >>> >>>> Just curious. Also, what is mnt_exjail >>>> normally set to for file systems not in a jail? >>> mnt_exjail is set to the credentials of the thread/process >>> that exported the file system (usually mountd(8)). >>> When not in a jail, cr_prison for these credentials >>> points to prison0. >>> >>> Btw, I checked and the "other mp that has mnt_exjail =3D=3D NULL >>> is in the mountlist, so the idea of checking "not in mountlist" >>> is a dead end. >>> >>> I am looking for something "unique" about this other mp, >>> but haven't found anything yet. >>> Alternately, it might be necessary to add code to >>> zfsctl_snapdir_lookup() to "cheat and change the mp" >>> in more cases, such as "." and ".." lookups? >> >> It seems to me that if ZFS is creating an additional mount structure, >> it should be responsible for setting it up correctly. That could >> involve a vfs-level routine to do some of the cloning. In any case, >> it seems to me that mnt_exjail should be set properly, e.g. by duping >> the one in the original mount structure. Probably ZFS is the only >> file system type that would need this added. > I've created a patch that I think does this. It seemed to test ok for m= y case. > It's in D42672 on reviews.freebsd.org. > I have also attached it here (this diff doesn't use -U999999, so it mig= ht be > easier to apply). It works for me in 13-stable, and looks plausible. Mike > Hopefully this fixes the problem. Sorry for the breakage. > > rick > >> >> Mike >> >>> rick >>> ps: I added all the cc's back in because I want the >>> ZFS folk to hopefully chime in. >>> >>>> >>>> Mike >>>> >>>>> Do I need to search mountlist for it? >>>>> >>>>> rick >>>>> ps: The hack patch attached should fix the problem, but can only be= >>>>> safely used if mountd/nfsd are not run in any jails. >>>>> >>>>>> >>>>>> I will try and get a test setup going here, which leads me to.. >>>>>> how do I create a ZFS snapshot? (I do have a simple ZFS pool runni= ng >>>>>> on a test machine, but I've never done a snapshot.) >>>>>> >>>>>> Although this problem is not in 13.2, it will have shipped in 14.0= =2E >>>>>> >>>>>> Any help with be appreciated, rick >>>>>> >>>>>>> >>>>>>> Mike >>>>>>>> >>>>>>>>> rick >>>>>>>>> >>>>>>>>> On Fri, Nov 17, 2023 at 6:14=E2=80=AFPM Mike Karels wrote: >>>>>>>>>> >>>>>>>>>> CAUTION: This email originated from outside of the University = of Guelph. Do not click links or open attachments unless you recognize th= e sender and know the content is safe. If in doubt, forward suspicious em= ails to IThelp@uoguelph.ca. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Rick, have you been following this thread on freebsd-stable? = I have been able >>>>>>>>>> to reproduce this using a 13-stable server from Oct 7 and a 15= -current system >>>>>>>>>> that is up to date using NFSv3. I did not reproduce with a 13= =2E2 server. The >>>>>>>>>> client was running 13.2. Any ideas? A full bisect seems fair= ly painful, but >>>>>>>>>> maybe you have an idea of points to try. Fortunately, these a= re all test >>>>>>>>>> systems that I can reboot at will. >>>>>>>>>> >>>>>>>>>> Mike >>>>>>>>>> >>>>>>>>>> Forwarded message: >>>>>>>>>> >>>>>>>>>>> From: Garrett Wollman >>>>>>>>>>> To: Mike Karels >>>>>>>>>>> Cc: freebsd-stable@freebsd.org >>>>>>>>>>> Subject: Re: NFS exports of ZFS snapshots broken >>>>>>>>>>> Date: Fri, 17 Nov 2023 17:35:04 -0500 >>>>>>>>>>> >>>>>>>>>>> < said: >>>>>>>>>>> >>>>>>>>>>>> I have not run into this, so I tried it just now. I had no = problem. >>>>>>>>>>>> The server is 13.2, fully patched, the client is up-to-date = -current, >>>>>>>>>>>> and the mount is v4. >>>>>>>>>>> >>>>>>>>>>> On my 13.2 client and 13-stable server, I see: >>>>>>>>>>> >>>>>>>>>>> 25034 ls CALL open(0x237d32f9a000,0x120004) >>>>>>>>>>> 25034 ls NAMI "/mnt/tools/.zfs/snapshot/weekly-2023-4= 5" >>>>>>>>>>> 25034 ls RET open 4 >>>>>>>>>>> 25034 ls CALL fcntl(0x4,F_ISUNIONSTACK,0x0) >>>>>>>>>>> 25034 ls RET fcntl 0 >>>>>>>>>>> 25034 ls CALL getdirentries(0x4,0x237d32faa000,0x1000= ,0x237d32fa7028) >>>>>>>>>>> 25034 ls RET getdirentries -1 errno 5 Input/output e= rror >>>>>>>>>>> 25034 ls CALL close(0x4) >>>>>>>>>>> 25034 ls RET close 0 >>>>>>>>>>> 25034 ls CALL exit(0) >>>>>>>>>>> >>>>>>>>>>> Certainly a libc bug here that getdirentries(2) returning [EI= O] >>>>>>>>>>> results in ls(1) returning EXIT_SUCCESS, but the [EIO] error = is >>>>>>>>>>> consistent across both FreeBSD and Linux clients. >>>>>>>>>>> >>>>>>>>>>> Looking at this from the RPC side: >>>>>>>>>>> >>>>>>>>>>> (PUTFH, GETATTR, LOOKUP(snapshotname), GETFH, GETATTR) >>>>>>>>>>> [NFS4_OK for all ops] >>>>>>>>>>> (PUTFH, GETATTR) >>>>>>>>>>> [NFS4_OK, NFS4_OK] >>>>>>>>>>> (PUTFH, ACCESS(0x3f), GETATTR) >>>>>>>>>>> [NFS4_OK, NFS4_OK, rights =3D 0x03, NFS4_OK] >>>>>>>>>>> (PUTFH, GETATTR, LOOKUPP, GETFH, GETATTR) >>>>>>>>>>> [NFS4_OK, NFS4_OK, NFS4ERR_NOFILEHANDLE] >>>>>>>>>>> >>>>>>>>>>> and at this point the [EIO] is returned. >>>>>>>>>>> >>>>>>>>>>> It seems that clients always do a LOOKUPP before calling READ= DIR, and >>>>>>>>>>> this is failing when the subject file handle is the snapshot.= The >>>>>>>>>>> client is perfectly able to *traverse into* the snapshot: if = I try to >>>>>>>>>>> list a subdirectory I know exists in the snapshot, the client= is able to >>>>>>>>>>> LOOKUP(dirname) just fine, but LOOKUPP still fails with >>>>>>>>>>> NFS4ERR_NOFILEHANDLE *on the subndirectory*. >>>>>>>>>>> >>>>>>>>>>> -GAWollman >>>>>>>>>>