Date: Sat, 5 Sep 2020 04:14:19 +0000 From: Rick Macklem <rmacklem@uoguelph.ca> To: Alan Somers <asomers@freebsd.org>, FreeBSD Hackers <freebsd-hackers@freebsd.org> Subject: Re: panic!("docallb") in nfsrv_docallback Message-ID: <YTBPR01MB3966E40F861048C9240A3B89DD2A0@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM> In-Reply-To: <CAOtMX2h3ZTVaZB8bdMg3QeGXCLtJLAuOCEQQm-WNTVaJz1HEDA@mail.gmail.com> References: <CAOtMX2h3ZTVaZB8bdMg3QeGXCLtJLAuOCEQQm-WNTVaJz1HEDA@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Alan Somers wrote:=0A= >I just saw this panic on a 12-stable machine. Unfortunately, I don't have= =0A= >a core dump, just a stack trace. It was serving NFS v4.0, with delegation= s=0A= >enabled. The clients were all Debian, with Linux 3.16.0.=0A= >=0A= >The proximal cause of the panic seems to be that the file had a write=0A= >delegation issued to an unconfirmed client. Root cause is harder to=0A= >determine. Did the kernel previously issue a delegation to an unconfirmed= =0A= >client? Or did the client somehow change to an unconfirmed state after th= e=0A= >delegation was issued, perhaps due to a race?=0A= >=0A= >It's hard to tell, but I don't see any checks for lc_flags &=0A= >LCL_NEEDSCONFIRM in nfsrv_openctrl (which issues the delegations), so I'm= =0A= >guessing that that's the problem.=0A= I guess I should have looked at the code before doing the last post.=0A= The check is in nfsrv_getclient(), that is called by nfsrv_opencheck().=0A= nfsrv_opencheck() - Checks to see if an Open is allowed.=0A= nfsrv_openctrl() - Does the Open, assuming nfsrv_opencheck() determined it= =0A= was allowed.=0A= =0A= > If so, then the event trace would look=0A= >like this:=0A= >=0A= >1) Client Alice sends SETCLIENTID. The server creates a client state=0A= >structure=0A= > for her.=0A= >_) Client Alice should've sent SETCLIENTID_CONFIRM, but doesn't. Bad Alic= e!=0A= >2) Client Alice sends OPEN for some file, and is issued a write delegation= .=0A= > The server shouldn't have issued it, because Alice's client ID is=0A= > unconfirmed. Bad server!=0A= I don't think this can happen. From looking at the code, an NFSERR_EXPIRED= =0A= reply to the Open should have happened.=0A= =0A= >3) Client Bob tries to do a GETATTR on that same file.=0A= >4) In nfsrv_checkgetattr, the kernel finds a write delegation for that fil= e,=0A= > owned by client Alice.=0A= I think the server needs to check for LCL_NEEDSCONFIRM in here.=0A= It gets the "clp" from the FH, but it "assumes" a confirmed ClientID.=0A= =0A= I'll code up a patch to add this check to nfsrv_checkgetattr().=0A= =0A= >5) The kernel tries to send a NFSV4OP_CBGETATTR callback to Alice, to see= =0A= >if the=0A= > file's attributes have changed.=0A= >6) But Alice's client ID is unconfirmed. Oh no! Panic!=0A= >=0A= >Does this sound plausible? Should there be a check for LCL_NEEDSCONFIRM= =0A= >somewhere around line 3166 in nfs_nfsdstate.c? Grateful for any help.=0A= Well, it doesn't appear that the Open could occur when the ClientID was=0A= not confirmed.=0A= --> The obvious case you listed above is caught by nfsrv_opencheck().=0A= Now, could a SetClientID happen between nfsrv_opencheck() and=0A= nfsrv_openctrl()?=0A= --> I don't think so. If you look at nfsrv_setclient() which does Set= ClientID,=0A= it grabs the nfsv4rootfs_lock, locking out all other nfsd thre= ads.=0A= It can't acquire the lock while a "shared lock" (I called a re= fcnt) is=0A= held by any other nfsd thread and the thread doing an Open wil= l=0A= hold the shared lock (refcnt).=0A= =0A= So, I think the Open with delegation would have been issued when the=0A= ClientID was confirmed.=0A= --> Then I suspect the client did another SetClientID that put the ClientID= =0A= back to unconfirmed (the obvious one is a client reboot).=0A= --> One quirk of SetClientID/SetClientIDConfirm is that old Open/Loc= k=0A= state cannot be discarded until it is confirmed, so the old De= legations=0A= would remain on the ClientID.=0A= --> Then a client did a Getattr on the file that had the old Delegation sti= ll=0A= allocated to it.=0A= =0A= I'd definitely say that nfsrv_checkgetattr() needs to check for an unconfir= med=0A= ClientID and return without attempting a callback.=0A= --> The exclusive lock on nfsv4rootfs_lock acquired when doing SetClientID= =0A= should also guarantee that, if the ClientID is confirmed when=0A= nfsrv_checkgetattr() tests for it, it will remain that way until aft= er the=0A= callback is completed.=0A= =0A= I'll come up with a patch and stick it on phabricator, rick=0A= =0A= =0A= =0A= -Alan=0A= =0A= P.S.: stack trace=0A= =0A= kdb_backtrace=0A= vpanic=0A= panic=0A= nfsrv_docallback=0A= nfsrv_checkgetattr=0A= nfsrvd_getattr=0A= nfsrvd_dorpc=0A= nfssvc_program=0A= svc_run_internal=0A= svc_thread_start=0A= fork_exit=0A= fork_trampoline=0A= _______________________________________________=0A= freebsd-hackers@freebsd.org mailing list=0A= https://lists.freebsd.org/mailman/listinfo/freebsd-hackers=0A= To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"= =0A= =0A=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YTBPR01MB3966E40F861048C9240A3B89DD2A0>