Date: Thu, 16 Jun 2011 16:14:37 -0700 From: Artem Belevich <art@freebsd.org> To: Rick Macklem <rmacklem@uoguelph.ca> Cc: Rick Macklem <rmacklem@freebsd.org>, FreeBSD Net <freebsd-net@freebsd.org> Subject: Re: amd + NFS reconnect = ICMP storm + unkillable process. Message-ID: <BANLkTi=2oCtY-XaqXDrytxxBxtaM6Bru=A@mail.gmail.com> In-Reply-To: <395930590.686693.1308254494066.JavaMail.root@erie.cs.uoguelph.ca> References: <BANLkTim=F%2BBemoUMJpnrZhRW99jZZSZk4A@mail.gmail.com> <395930590.686693.1308254494066.JavaMail.root@erie.cs.uoguelph.ca>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Jun 16, 2011 at 1:01 PM, Rick Macklem <rmacklem@uoguelph.ca> wrote: >> * We try to reconnect again, and again, and again.... >> * the process in this state is unkillable >> > If you use the "intr" mount option, then an nfs reconnect > should be killable. I know diddly about amd, so I can't > help beyond that. I'll give it a try. Amd does not have an option to mount itself with 'intr', so I'll need to hack it in. I did a bit more digging into the problem and it starts to look like amd may not be the culprit after all. I've captured the traffic during the issue and the very first thing that popped up in wireshark was that it reported a lot of retransmissions and duplicates: http://pastebin.com/PzcwKu1J It looks like the way we generate XID makes XID collisions possible/likely in some situations. I suspected that this could be what causes my problem, so I've hacked RPC code to generate XID the way opensolaris does -- start with a time-based value and then allocate XIDs sequentially. With this patch ( http://pastebin.com/PzcwKu1J ) collisions (and retransmits/duplicates reported by wireshark) mostly went away. Unfortunately, the problem remained. Now the capture during the problem looks like this: http://pastebin.com/3M6HZrcq Port 1022 is on amd side. Other ports belong to the NFS client in kernel. Normally there seems to be only one NFS client per NFS mount. I wonder how comes we ended up with many NFS clients actively calling GETATTR on the same file handle even though there's only *one* process stuck trying to do a stat() call? amd does reply to those requests just fine, but for some reason NFS client code in the kernel does not seem to see those replies. I didn't wrap my head around RPC code enough yet to figure out what could cause amd replies to be lost and what triggers reconnect on NFS client side. If you could nudge me in the right direction, I'd appreciate that. Thanks, --Artem
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?BANLkTi=2oCtY-XaqXDrytxxBxtaM6Bru=A>