Date: Sat, 5 Sep 2020 02:58:51 +0000 From: Rick Macklem <rmacklem@uoguelph.ca> To: Alan Somers <asomers@freebsd.org>, FreeBSD Hackers <freebsd-hackers@freebsd.org> Subject: Re: panic!("docallb") in nfsrv_docallback Message-ID: <YTBPR01MB396648BCDB31564AE81B59B4DD2A0@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM> In-Reply-To: <CAOtMX2h3ZTVaZB8bdMg3QeGXCLtJLAuOCEQQm-WNTVaJz1HEDA@mail.gmail.com> References: <CAOtMX2h3ZTVaZB8bdMg3QeGXCLtJLAuOCEQQm-WNTVaJz1HEDA@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Alan Somers wrote:=0A= >I just saw this panic on a 12-stable machine. Unfortunately, I don't have= =0A= >a core dump, just a stack trace. It was serving NFS v4.0, with delegation= s=0A= >enabled. The clients were all Debian, with Linux 3.16.0.=0A= I will generically note that I believe the Linux NFS client developers most= ly=0A= test NFSv4.1, 4.2, so if the clients support NFSv4.1, it might be worth upg= rading?=0A= =0A= Also, delegations aren't enabled by default for a couple of reasons.=0A= 1 - For a long time, Linux only knew how to use read delegations and I felt= =0A= (and still feel) they are pretty useless.=0A= 2 - They are complex to get right.=0A= 3 - Although they should reduce the number of Open operations against the= =0A= server, I haven't observed dramatic performance improvements because= =0A= of them.=0A= =0A= >The proximal cause of the panic seems to be that the file had a write=0A= >delegation issued to an unconfirmed client. Root cause is harder to=0A= >determine. Did the kernel previously issue a delegation to an unconfirmed= =0A= >client? Or did the client somehow change to an unconfirmed state after th= e=0A= >delegation was issued, perhaps due to a race?=0A= I think the first case is more likely. Since client confirmation happens im= mediately=0A= for NFSv4.1 (the ExchangeID and Createsession must occur before anything=0A= else can happen), I wouldn;t be surprised if the Linux client tries to do a= n Open=0A= before the SetClientIDConfirm has completed for NFSv4.0.=0A= =0A= >It's hard to tell, but I don't see any checks for lc_flags &=0A= >LCL_NEEDSCONFIRM in nfsrv_openctrl (which issues the delegations), so I'm= =0A= >guessing that that's the problem.=0A= The server should definitely check for a confirmed ClientID during Open and= =0A= fail any Open attempt where that is not the case.=0A= --> I'll take a look at the code. I wrote it about 20years ago, but I can p= robably=0A= figure out how it works.;-)=0A= =0A= > If so, then the event trace would look=0A= >like this:=0A= >=0A= >1) Client Alice sends SETCLIENTID. The server creates a client state=0A= >structure=0A= > for her.=0A= >_) Client Alice should've sent SETCLIENTID_CONFIRM, but doesn't. Bad Alic= e!=0A= >2) Client Alice sends OPEN for some file, and is issued a write delegation= .=0A= > The server shouldn't have issued it, because Alice's client ID is=0A= > unconfirmed. Bad server!=0A= >3) Client Bob tries to do a GETATTR on that same file.=0A= >4) In nfsrv_checkgetattr, the kernel finds a write delegation for that fil= e,=0A= > owned by client Alice.=0A= >5) The kernel tries to send a NFSV4OP_CBGETATTR callback to Alice, to see= =0A= >if the=0A= > file's attributes have changed.=0A= >6) But Alice's client ID is unconfirmed. Oh no! Panic!=0A= >=0A= >Does this sound plausible? Should there be a check for LCL_NEEDSCONFIRM= =0A= >somewhere around line 3166 in nfs_nfsdstate.c? Grateful for any help.=0A= Yes, it does. I would have thought that I'd have checked for the unconfirme= d=0A= ClientID, but maybe not.=0A= =0A= It is also possible that the client somehow did a SetClientID after the Ope= n=0A= that issued the delegation, putting it back in "unconfirmed" state.=0A= It that was the case, maybe the panic(), intended to catch corrupted data= =0A= structures, was overkill.=0A= =0A= >-Alan=0A= >=0A= >P.S.: stack trace=0A= >=0A= >kdb_backtrace=0A= >vpanic=0A= >panic=0A= >nfsrv_docallback=0A= >nfsrv_checkgetattr=0A= nfsrv_checkgetattr() should probably check for the case of an unconfirmed= =0A= clientid and then return ignoring any delegations hanging off it instead= =0A= of attempting a callback.=0A= --> This would handle the case where the client did a SetClientID after the= =0A= Open that acquired the delegation, leaving the ClientID unconfirmed.= =0A= - The two RPCs doing SetClientID and SetClientIDConfirm are normally= =0A= done only upon mounting or when the client thinks it has lost the= =0A= ClientID due to a lease expiry, but there is also the case where it= is=0A= changing the callback address. (This could explain the SetClientID= =0A= happening after the Open that acquired the delegation.)=0A= --> Hint. Can you now see why NFSv4.1 chose to do things differently?=0A= =0A= nfsrvd_getattr=0A= nfsrvd_dorpc=0A= nfssvc_program=0A= svc_run_internal=0A= svc_thread_start=0A= fork_exit=0A= fork_trampoline=0A= =0A= Thanks for reporting it. I'll take a look, rick=0A= =0A= _______________________________________________=0A= freebsd-hackers@freebsd.org mailing list=0A= https://lists.freebsd.org/mailman/listinfo/freebsd-hackers=0A= To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"= =0A= =0A=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YTBPR01MB396648BCDB31564AE81B59B4DD2A0>