Date: Fri, 20 Dec 2019 17:19:37 +0000 From: Rick Macklem <rmacklem@uoguelph.ca> To: Adam McDougall <mcdouga9@egr.msu.edu>, "freebsd-stable@freebsd.org" <freebsd-stable@freebsd.org> Subject: Re: nfs lockd errors after NetApp software upgrade. Message-ID: <YQBPR0101MB1427CE52BBA32A888443BFB4DD2D0@YQBPR0101MB1427.CANPRD01.PROD.OUTLOOK.COM> In-Reply-To: <b1182bbf-fd0b-a23d-1cc4-ddf9513bcb2e@egr.msu.edu> References: <EBC4AD74-EC62-4C67-AB93-1AA91F662AAC@cs.huji.ac.il> <YQBPR0101MB1427411AFE335E869B9CF022DD530@YQBPR0101MB1427.CANPRD01.PROD.OUTLOOK.COM> <0121E289-D2AE-44BA-ADAC-4814CAEE676F@cs.huji.ac.il> <CAGfybS-3Rvs57=oGFEfii_9a=aWxPr6dEq1Y1LqHbLXK1ZKmXA@mail.gmail.com> <YQBPR0101MB1427F9BE658B9A46C7E08335DD520@YQBPR0101MB1427.CANPRD01.PROD.OUTLOOK.COM> <854B6E5A-C6BC-44B3-A656-FC9B8EF19881@cs.huji.ac.il> <YQBPR0101MB1427F445F1F1EAF382E5131ADD520@YQBPR0101MB1427.CANPRD01.PROD.OUTLOOK.COM> <8770BD0D-4B72-431A-B4F5-A29D4DBA03B1@cs.huji.ac.il>, <b1182bbf-fd0b-a23d-1cc4-ddf9513bcb2e@egr.msu.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
Adam McDougall wrote:=0A= >Try changing bool_t do_tcp =3D FALSE; to TRUE in=0A= >/usr/src/sys/nlm/nlm_prot_impl.c, recompile the kernel and try again. I=0A= >think this makes it match Linux client behavior. I suspect I ran into=0A= >the same issue as you. I do think I used nolockd is a workaround=0A= >temporarily. I can provide some more details if it works.=0A= If this fixes the problem, please let me know.=0A= =0A= I'm not sure I'd want to change the default, since it might break things fo= r=0A= others, but I can definitely make it a tunable, so that people don't need t= o=0A= recompile a kernel to deal with it.=0A= =0A= rick=0A= =0A= On 12/19/19 9:21 AM, Daniel Braniss wrote:=0A= >=0A= >=0A= >> On 19 Dec 2019, at 16:09, Rick Macklem <rmacklem@uoguelph.ca> wrote:=0A= >>=0A= >> Daniel Braniss wrote:=0A= >> [stuff snipped]=0A= >>> all mounts are nfsv3/tcp=0A= >> This doesn't affect what the NLM code (rpc.lockd) uses. I honestly don't= know when=0A= >> the NLM uses tcp vs udp. I think rpc.statd still uses IP broadcast at ti= mes.=0A= > can the replay cache have any influence here? I tend to remember way back= issues=0A= > with it,=0A= >>=0A= >> To me, it looks like a network configuration issue.=0A= > that was/is my gut feelings too, but, as far as we can tell, nothing has = changed in the network infrastructure,=0A= > the problems appeared after the NetAPP=92s software was updated, it was w= orking fine till then.=0A= >=0A= > the problems are also happening on freebsd 12.1=0A= >=0A= >> You could capture packets (maybe when a client first starts rpc.statd an= d rpc.lockd)=0A= >> and then look at them in wireshark. I'd disable statup of rpc.lockd and = rpc.statd=0A= >> at boot for a test client and then run something like:=0A= >> # tcpdump -s 0 -s out.pcap host <netapp-host>=0A= >> - and then start rpc.statd and rpc.lockd=0A= >> Then I'd look at out.pcap in wireshark (much better at decoding this stu= ff than=0A= >> tcpdump). I'd look for things like different reply IP addresses from the= Netapp,=0A= >> which might confuse this tired old NLM protocol Sun devised in the mid-1= 980s.=0A= >>=0A= > it=92s going to be an interesting week end :-(=0A= >=0A= >>> the error is also appearing on freebsd-11.2-stable, I=92m now checking = if it=92s also=0A= >>> happening on 12.1=0A= >>> btw, the NetApp version is 9.3P17=0A= >> Yes. I wasn't the author of the NSM and NLM code (long ago I refused to = even=0A= >> try to implement it, because I knew the protocol was badly broken) and I= avoid=0A= >> fiddling with. As such, it won't have change much since around FreeBSD7.= =0A= > and we haven=92t had any issues with it for years, so you must have done = something good=0A= >=0A= > cheers,=0A= > danny=0A= >=0A= >>=0A= >> rick=0A= >>=0A= >> cheers,=0A= >> danny=0A= >>=0A= >>> rick=0A= >>>=0A= >>> Cheers=0A= >>>=0A= >>> Richard=0A= >>> (NetApp admin)=0A= >>>=0A= >>> On Wed, 18 Dec 2019 at 15:46, Daniel Braniss <danny@cs.huji.ac.il<mailt= o:danny@cs.huji.ac.il>> wrote:=0A= >>>=0A= >>>=0A= >>>> On 18 Dec 2019, at 16:55, Rick Macklem <rmacklem@uoguelph.ca<mailto:rm= acklem@uoguelph.ca>> wrote:=0A= >>>>=0A= >>>> Daniel Braniss wrote:=0A= >>>>=0A= >>>>> Hi,=0A= >>>>> The server with the problems is running FreeBSD 11.1 stable, it was w= orking fine for >several months,=0A= >>>>> but after a software upgrade of our NetAPP server it=92s reporting ma= ny lockd errors >and becomes catatonic,=0A= >>>>> ...=0A= >>>>> Dec 18 13:11:02 moo-09 kernel: nfs server fr-06:/web/www: lockd not r= esponding=0A= >>>>> Dec 18 13:11:45 moo-09 last message repeated 7 times=0A= >>>>> Dec 18 13:12:55 moo-09 last message repeated 8 times=0A= >>>>> Dec 18 13:13:10 moo-09 kernel: nfs server fr-06:/web/www: lockd is al= ive again=0A= >>>>> Dec 18 13:13:10 moo-09 last message repeated 8 times=0A= >>>>> Dec 18 13:13:29 moo-09 kernel: sonewconn: pcb 0xfffff8004cc051d0: Lis= ten queue >overflow: 194 already in queue awaiting acceptance (1 occurrence= s)=0A= >>>>> Dec 18 13:14:29 moo-09 kernel: sonewconn: pcb 0xfffff8004cc051d0: Lis= ten queue >overflow: 193 already in queue awaiting acceptance (3957 occurre= nces)=0A= >>>>> Dec 18 13:15:29 moo-09 kernel: sonewconn: pcb 0xfffff8004cc051d0: Lis= ten queue >overflow: 193 already in queue awaiting acceptance =85=0A= >>>> Seems like their software upgrade didn't improve handling of NLM RPCs?= =0A= >>>> Appears to be handling RPCs slowly and/or intermittently. Note that no= one=0A= >>>> tests it with IPv6, so at least make sure you are still using IPv4 for= the mounts and=0A= >>>> try and make sure IP broadcast works between client and Netapp. I thin= k the NLM=0A= >>>> and NSM (rpc.statd) still use IP broadcast sometimes.=0A= >>>>=0A= >>> we are ipv4 - we have our own class c :-)=0A= >>>> Maybe the network guys can suggest more w.r.t. why, but as I've stated= before,=0A= >>>> the NLM is a fundamentally broken protocol which was never published b= y Sun,=0A= >>>> so I suggest you avoid using it if at all possible.=0A= >>> well, at the moment the ball is on NetAPP court, and switching to NFSv4= at the moment is out of the question, it=92s=0A= >>> a production server used by several thousand students.=0A= >>>=0A= >>>>=0A= >>>> - If the locks don't need to be seen by other clients, you can just us= e the "nolockd"=0A= >>>> mount option.=0A= >>>> or=0A= >>>> - If locks need to be seen by other clients, try NFSv4 mounts. Netapp = filers=0A= >>>> should support NFSv4.1, which is a much better protocol that NFSv4.0.= =0A= >>>>=0A= >>>> Good luck with it, rick=0A= >>> thanks=0A= >>> danny=0A= >>>=0A= >>>> =85=0A= >>>> any ideas?=0A= >>>>=0A= >>>> thanks,=0A= >>>> danny=0A= >>>>=0A= >>>> _______________________________________________=0A= >>>> freebsd-stable@freebsd.org<mailto:freebsd-stable@freebsd.org> mailing = list=0A= >>>> https://lists.freebsd.org/mailman/listinfo/freebsd-stable=0A= >>>> To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.o= rg<mailto:freebsd-stable-unsubscribe@freebsd.org>"=0A= >>>=0A= >>> _______________________________________________=0A= >>> freebsd-stable@freebsd.org<mailto:freebsd-stable@freebsd.org> mailing l= ist=0A= >>> https://lists.freebsd.org/mailman/listinfo/freebsd-stable=0A= >>> To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.or= g<mailto:freebsd-stable-unsubscribe@freebsd.org>"=0A= >>=0A= >=0A= > _______________________________________________=0A= > freebsd-stable@freebsd.org mailing list=0A= > https://lists.freebsd.org/mailman/listinfo/freebsd-stable=0A= > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"= =0A= >=0A= =0A= _______________________________________________=0A= freebsd-stable@freebsd.org mailing list=0A= https://lists.freebsd.org/mailman/listinfo/freebsd-stable=0A= To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"= =0A=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YQBPR0101MB1427CE52BBA32A888443BFB4DD2D0>