From owner-freebsd-stable@freebsd.org Sun Dec 22 06:18:16 2019 Return-Path: Delivered-To: freebsd-stable@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 351821C9AEA for ; Sun, 22 Dec 2019 06:18:16 +0000 (UTC) (envelope-from danny@cs.huji.ac.il) Received: from kabab.cs.huji.ac.il (kabab.cs.huji.ac.il [132.65.116.210]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 47gXNQ69Jrz45ND for ; Sun, 22 Dec 2019 06:18:14 +0000 (UTC) (envelope-from danny@cs.huji.ac.il) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=cs.huji.ac.il; s=57791128; h=To:References:Message-Id:Content-Transfer-Encoding:Cc:Date:In-Reply-To:From:Subject:Mime-Version:Content-Type; bh=oCHr4N3YunchyBOu6VOLNaRlOweK7dC3TyloGwjYh4k=; b=YD69MFH0rPYjgiY6E+t+rAe+YPg66cPacehgQ4LqXfCvY9eh+S/pExmRTatirC+btASJNiPRxj6bOPFw6H97Jv/pccGe8r+uZOjtoH4VzDYIXsfdFCmZ6Pr5Q4T4D5PuHEEL3fA1/amceoK6ER5TzXlFA1cxABc0WuX3QQKupUeQOBJuSZpVhaGNK+MOYmxWgvFumfgY3pcvv2+7yeR9btRbINFzM2hxbfotjek7+pIS5yMqNDfCgq//prlu23NFag/miTV3aAWfux2bEfv1PG5HCBlxYuzf62A0N56l/8v+X6LkAYjEMOK/y5asNFoRxD9VY1TtJzlNLQHqHU9fBw==; Received: from bach.cs.huji.ac.il ([132.65.80.20]) by kabab.cs.huji.ac.il with esmtp id 1iiuZ6-000B7H-Hu; Sun, 22 Dec 2019 08:18:08 +0200 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\)) Subject: Re: nfs lockd errors after NetApp software upgrade. From: Daniel Braniss In-Reply-To: Date: Sun, 22 Dec 2019 08:18:08 +0200 Cc: Adam McDougall , "freebsd-stable@freebsd.org" Content-Transfer-Encoding: quoted-printable Message-Id: References: <0121E289-D2AE-44BA-ADAC-4814CAEE676F@cs.huji.ac.il> <854B6E5A-C6BC-44B3-A656-FC9B8EF19881@cs.huji.ac.il> <8770BD0D-4B72-431A-B4F5-A29D4DBA03B1@cs.huji.ac.il> <8A78F67B-C244-45CF-B9BF-D7062669B33B@cs.huji.ac.il> To: Rick Macklem X-Mailer: Apple Mail (2.3445.9.1) X-Rspamd-Queue-Id: 47gXNQ69Jrz45ND X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=cs.huji.ac.il header.s=57791128 header.b=YD69MFH0; dmarc=pass (policy=none) header.from=huji.ac.il; spf=none (mx1.freebsd.org: domain of danny@cs.huji.ac.il has no SPF policy when checking 132.65.116.210) smtp.mailfrom=danny@cs.huji.ac.il X-Spamd-Result: default: False [-3.96 / 15.00]; ARC_NA(0.00)[]; TO_DN_EQ_ADDR_SOME(0.00)[]; R_DKIM_ALLOW(-0.20)[cs.huji.ac.il:s=57791128]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; TO_DN_SOME(0.00)[]; MV_CASE(0.50)[]; MIME_GOOD(-0.10)[text/plain]; IP_SCORE(-1.66)[ip: (-4.11), ipnet: 132.64.0.0/13(-2.34), asn: 378(-1.87), country: IL(0.05)]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; TO_MATCH_ENVRCPT_SOME(0.00)[]; DKIM_TRACE(0.00)[cs.huji.ac.il:+]; DMARC_POLICY_ALLOW(-0.50)[huji.ac.il,none]; RCVD_IN_DNSWL_NONE(0.00)[210.116.65.132.list.dnswl.org : 127.0.10.0]; R_SPF_NA(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; RCVD_TLS_LAST(0.00)[]; ASN(0.00)[asn:378, ipnet:132.64.0.0/13, country:IL]; MID_RHS_MATCH_FROM(0.00)[]; RCVD_COUNT_TWO(0.00)[2] X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Dec 2019 06:18:16 -0000 > On 21 Dec 2019, at 19:32, Rick Macklem wrote: >=20 > Daniel Braniss wrote: >>> On 20 Dec 2019, at 19:19, Rick Macklem = >>> wrote: >>>=20 >>> Adam McDougall wrote: >>>> Try changing bool_t do_tcp =3D FALSE; to TRUE in >>>> /usr/src/sys/nlm/nlm_prot_impl.c, recompile the kernel and try = again. I >>>> think this makes it match Linux client behavior. I suspect I ran = into >>>> the same issue as you. I do think I used nolockd is a workaround >>>> temporarily. I can provide some more details if it works. >>> If this fixes the problem, please let me know. >>>=20 >>> I'm not sure I'd want to change the default, since it might break = things for >>> others, but I can definitely make it a tunable, so that people don't = need to >>> recompile a kernel to deal with it. >>>=20 >>>=20 >> great! I was just about to see how it can be done(tunable) but need = to check if it can >be done >> at any time, or just at boot time. > I haven't looked at the code, but I suspect changing it on the fly = could cause problems, > so I am inclined to make it a tunable (boot time only). my feelings too. >=20 >> thanks. >> btw, currently, from several hours of analysing the traffic, it seems = that nlm is UDP. > I assume that means you haven't tried flipping it to TCP yet. I will soon, but I have my doubts, the problem is caused my multiple = events, i.e, it happened once while I was doing svn checkout, but i have done it several times since, and no = issues. So it must be an aggregation of factors. Other hosts are reporting locks times too. danny >=20 > Please let us know how it goes, rick >=20 > danny >=20 >=20 > rick >=20 > On 12/19/19 9:21 AM, Daniel Braniss wrote: >=20 >=20 > On 19 Dec 2019, at 16:09, Rick Macklem = > wrote: >=20 > Daniel Braniss wrote: > [stuff snipped] > all mounts are nfsv3/tcp > This doesn't affect what the NLM code (rpc.lockd) uses. I honestly = don't know when > the NLM uses tcp vs udp. I think rpc.statd still uses IP broadcast at = times. > can the replay cache have any influence here? I tend to remember way = back issues > with it, >=20 > To me, it looks like a network configuration issue. > that was/is my gut feelings too, but, as far as we can tell, nothing = has changed in the network infrastructure, > the problems appeared after the NetAPP=E2=80=99s software was updated, = it was working fine till then. >=20 > the problems are also happening on freebsd 12.1 >=20 > You could capture packets (maybe when a client first starts rpc.statd = and rpc.lockd) > and then look at them in wireshark. I'd disable statup of rpc.lockd = and rpc.statd > at boot for a test client and then run something like: > # tcpdump -s 0 -s out.pcap host > - and then start rpc.statd and rpc.lockd > Then I'd look at out.pcap in wireshark (much better at decoding this = stuff than > tcpdump). I'd look for things like different reply IP addresses from = the Netapp, > which might confuse this tired old NLM protocol Sun devised in the = mid-1980s. >=20 > it=E2=80=99s going to be an interesting week end :-( >=20 > the error is also appearing on freebsd-11.2-stable, I=E2=80=99m now = checking if it=E2=80=99s also > happening on 12.1 > btw, the NetApp version is 9.3P17 > Yes. I wasn't the author of the NSM and NLM code (long ago I refused = to even > try to implement it, because I knew the protocol was badly broken) and = I avoid > fiddling with. As such, it won't have change much since around = FreeBSD7. > and we haven=E2=80=99t had any issues with it for years, so you must = have done something good >=20 > cheers, > danny >=20 >=20 > rick >=20 > cheers, > danny >=20 > rick >=20 > Cheers >=20 > Richard > (NetApp admin) >=20 > On Wed, 18 Dec 2019 at 15:46, Daniel Braniss = > wrote: >=20 >=20 > On 18 Dec 2019, at 16:55, Rick Macklem = > wrote: >=20 > Daniel Braniss wrote: >=20 > Hi, > The server with the problems is running FreeBSD 11.1 stable, it was = working fine for >several months, > but after a software upgrade of our NetAPP server it=E2=80=99s = reporting many lockd errors >and becomes catatonic, > ... > Dec 18 13:11:02 moo-09 kernel: nfs server fr-06:/web/www: lockd not = responding > Dec 18 13:11:45 moo-09 last message repeated 7 times > Dec 18 13:12:55 moo-09 last message repeated 8 times > Dec 18 13:13:10 moo-09 kernel: nfs server fr-06:/web/www: lockd is = alive again > Dec 18 13:13:10 moo-09 last message repeated 8 times > Dec 18 13:13:29 moo-09 kernel: sonewconn: pcb 0xfffff8004cc051d0: = Listen queue >overflow: 194 already in queue awaiting acceptance (1 = occurrences) > Dec 18 13:14:29 moo-09 kernel: sonewconn: pcb 0xfffff8004cc051d0: = Listen queue >overflow: 193 already in queue awaiting acceptance (3957 = occurrences) > Dec 18 13:15:29 moo-09 kernel: sonewconn: pcb 0xfffff8004cc051d0: = Listen queue >overflow: 193 already in queue awaiting acceptance =E2=80=A6= > Seems like their software upgrade didn't improve handling of NLM RPCs? > Appears to be handling RPCs slowly and/or intermittently. Note that no = one > tests it with IPv6, so at least make sure you are still using IPv4 for = the mounts and > try and make sure IP broadcast works between client and Netapp. I = think the NLM > and NSM (rpc.statd) still use IP broadcast sometimes. >=20 > we are ipv4 - we have our own class c :-) > Maybe the network guys can suggest more w.r.t. why, but as I've stated = before, > the NLM is a fundamentally broken protocol which was never published = by Sun, > so I suggest you avoid using it if at all possible. > well, at the moment the ball is on NetAPP court, and switching to = NFSv4 at the moment is out of the question, it=E2=80=99s > a production server used by several thousand students. >=20 >=20 > - If the locks don't need to be seen by other clients, you can just = use the "nolockd" > mount option. > or > - If locks need to be seen by other clients, try NFSv4 mounts. Netapp = filers > should support NFSv4.1, which is a much better protocol that NFSv4.0. >=20 > Good luck with it, rick > thanks > danny >=20 > =E2=80=A6 > any ideas? >=20 > thanks, > danny >=20 > _______________________________________________ > = freebsd-stable@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to = "freebsd-stable-unsubscribe@freebsd.org" >=20 > _______________________________________________ > = freebsd-stable@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to = "freebsd-stable-unsubscribe@freebsd.org" >=20 >=20 > _______________________________________________ > freebsd-stable@freebsd.org mailing = list > https://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to = "freebsd-stable-unsubscribe@freebsd.org" >=20 >=20 > _______________________________________________ > freebsd-stable@freebsd.org mailing = list > https://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to = "freebsd-stable-unsubscribe@freebsd.org" > _______________________________________________ > freebsd-stable@freebsd.org mailing = list > https://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to = "freebsd-stable-unsubscribe@freebsd.org" >=20