FreeBSD Mail Archives

Date:      Thu, 16 Jun 2011 16:14:37 -0700
From:      Artem Belevich <art@freebsd.org>
To:        Rick Macklem <rmacklem@uoguelph.ca>
Cc:        Rick Macklem <rmacklem@freebsd.org>, FreeBSD Net <freebsd-net@freebsd.org>
Subject:   Re: amd + NFS reconnect = ICMP storm + unkillable process.
Message-ID:  <BANLkTi=2oCtY-XaqXDrytxxBxtaM6Bru=A@mail.gmail.com>
In-Reply-To: <395930590.686693.1308254494066.JavaMail.root@erie.cs.uoguelph.ca>
References:  <BANLkTim=F%2BBemoUMJpnrZhRW99jZZSZk4A@mail.gmail.com> <395930590.686693.1308254494066.JavaMail.root@erie.cs.uoguelph.ca>

index | next in thread | previous in thread | raw e-mail

On Thu, Jun 16, 2011 at 1:01 PM, Rick Macklem <rmacklem@uoguelph.ca> wrote:
>> * We try to reconnect again, and again, and again....
>> * the process in this state is unkillable
>>
> If you use the "intr" mount option, then an nfs reconnect
> should be killable. I know diddly about amd, so I can't
> help beyond that.

I'll give it a try. Amd does not have an option to mount itself with
'intr', so I'll need to hack it in.

I did a bit more digging into the problem and it starts to look like
amd may not be the culprit after all.

I've captured the traffic during the issue and the very first thing
that popped up in wireshark was that it reported a lot of
retransmissions and duplicates:
http://pastebin.com/PzcwKu1J

It looks like the way we generate XID makes XID collisions
possible/likely in some situations. I suspected that this could be
what causes my problem, so I've hacked RPC code to generate XID the
way opensolaris does -- start with a time-based value and then
allocate XIDs sequentially. With this patch (
http://pastebin.com/PzcwKu1J ) collisions (and retransmits/duplicates
reported by wireshark) mostly went away. Unfortunately, the problem
remained.

Now the capture during the problem looks like this: http://pastebin.com/3M6HZrcq
Port 1022 is on amd side. Other ports belong to the NFS client in
kernel. Normally there seems to be only one NFS client per NFS mount.
I wonder how comes we ended up with many NFS clients actively calling
GETATTR on the same file handle even though there's only *one* process
stuck trying to do a stat() call? amd does reply to those requests
just fine, but for some reason NFS client code in the kernel does not
seem to see those replies. I didn't wrap my head around RPC code
enough yet to figure out what could cause amd replies to be lost and
what triggers reconnect on NFS client side. If you could nudge me in
the right direction, I'd appreciate that.

Thanks,
--Artem

home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?BANLkTi=2oCtY-XaqXDrytxxBxtaM6Bru=A>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation