Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 25 Aug 2011 21:24:28 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Artem Belevich <art@freebsd.org>
Cc:        freebsd-net@freebsd.org, Martin Birgmeier <la5lbtyi@aon.at>
Subject:   Re: amd + NFS reconnect = ICMP storm + unkillable process.
Message-ID:  <1499650185.371230.1314321868068.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <CAFqOu6hWry%2B_wkx8MJ7ept7v2o0EWBsiwu=%2BSHLJOVH69ToanA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Artem Belevich wrote:
> On Wed, Jul 6, 2011 at 4:50 AM, Martin Birgmeier <la5lbtyi@aon.at>
> wrote:
> > Hi Artem,
> >
> > I have exactly the same problem as you are describing below, also
> > with quite
> > a number of amd mounts.
> >
> > In addition to the scenario you describe, another way this happens
> > here
> > is when downloading a file via firefox to a directory currently open
> > in
> > dolphin (KDE file manager). This will almost surely trigger the
> > symptoms
> > you describe.
> >
> > I've had 7.4 running on the box before, now with 8.2 this has
> > started to
> > happen.
> >
> > Alas, I don't have a solution.
> 
> I may be on to something. Here's what seems to be happening in my
> case:
> 
> * Process, that's in the middle of a syscall accessing amd mountpoint
> gets interrupted.
> * If the syscall was restartable, msleep at the beginning of
> get_reply: loop in in clnt_dg_call() would return ERESTART.
> * ERESTART will result in clnt_dg_call() returning with RPC_CANTRECV
> * clnt_reconnect_call() then will try to reconnect, and msleep will
> eventually fail with ERESTART in clnt_dg_call() again and the whole
> thing will be repeating for a while.
> 
Btw, I fixed exactly the same issue for the TCP code (clnt_vc.c) in
r221127, so I wouldn't be surprised if the UDP code suffers the same
problem. I'll take a look at your patch tomorrow. You could also try
a TCP mount and see if the problem goes away. (For TCP on a pre-r221127
system, the symptom would be a client thread looping in the kernel in
"R" state.)

I'll look tomorrow, but it sounds like you've figured it out. Looks like
a good catch to me at this point, rick

> I'm not familiar enough with the RPC code, but looking and clnt_vc.c
> and other RPC places, it appears that both EINTR and ERESTART should
> translate into RPC_INTR error. However in clnt_dg.c that's not the
> case and that's what seems to make amd-mounted accesses hang.
> 
> Following patch (against RELENG-8 @ r225118) seems to have fixed the
> issue for me. With the patch I no longer see the hangs or ICMP storms
> on the test case that could reliably reproduce the issue within
> minutes. Let me know if it helps in your case.
> 
> --- a/sys/rpc/clnt_dg.c
> +++ b/sys/rpc/clnt_dg.c
> @@ -636,7 +636,7 @@ get_reply:
> */
> if (error != EWOULDBLOCK) {
> errp->re_errno = error;
> - if (error == EINTR)
> + if (error == EINTR || error == ERESTART)
> errp->re_status = stat = RPC_INTR;
> else
> errp->re_status = stat = RPC_CANTRECV;
> 
> --Artem
> 
> >
> > We should probably file a PR, but I don't even know where to assign
> > it to.
> > Amd does not seem much maintained, it's probably using some
> > old-style
> > mounts (it never mounts anything via IPv6, for example).
> >
> > Regards,
> >
> > Martin
> >
> >> Hi,
> >>
> >> I wonder if someone else ran into this issue before and, maybe,
> >> have a
> >> solution.
> >>
> >> I've been running into a problem where access to filesystems mouted
> >> with amd wedges processes in an unkillable state and produces ICMP
> >> storm on loopback interface.I've managed to narrow down to NFS
> >> reconnect, but that's when I ran out of ideas.
> >>
> >> Usually the problem happens when I abort a parallel build job in an
> >> i386 jail on FreeBSD-8/amd64 (r223055). When the build job is
> >> killed
> >> now and then I end up with one process consuming 100% of CPU time
> >> on
> >> one of the cores. At the same time I get a lot of messages on the
> >> console saying "Limiting icmp unreach response from 49837 to 200
> >> packets/sec" and the loopback traffic goes way up.
> >>
> >> As far as I can tell here's what's happening:
> >>
> >> * My setup uses a lot of filesystems mounted by amd.
> >> * amd itself pretends to be an NFS server running on the localhost
> >> and
> >> serving requests for amd mounts.
> >> * Now and then amd seems to change the ports it uses. Beats me why.
> >> * the problem seems to happen when some process is about to access
> >> amd
> >> mountpoint when amd instance disappears from the port it used to
> >> listen on. In my case it does correlate with interrupted builds,
> >> but I
> >> have no clue why.
> >> * NFS client detects disconnect and tries to reconnect using the
> >> same
> >> destination port.
> >> * That generates ICMP response as port is unreachable and it
> >> reconnect
> >> call returns almost immediatelly.
> >> * We try to reconnect again, and again, and again....
> >> * the process in this state is unkillable
> >>
> >> Here's what the stack of the 'stuck' process looks like in those
> >> rare
> >> moments when it gets to sleep:
> >> 18779 100511 collect2 - mi_switch+0x176
> >> turnstile_wait+0x1cb _mtx_lock_sleep+0xe1
> >> sleepq_catch_signals+0x386
> >> sleepq_timedwait_sig+0x19 _sleep+0x1b1 clnt_dg_call+0x7e6
> >> clnt_reconnect_call+0x12e nfs_request+0x212 nfs_getattr+0x2e4
> >> VOP_GETATTR_APV+0x44 nfs_bioread+0x42a VOP_READLINK_APV+0x4a
> >> namei+0x4f9 kern_statat_vnhook+0x92 kern_statat+0x15
> >> freebsd32_stat+0x2e syscallenter+0x23d
> >>
> >> * Usually some timeout expires in few minutes, the process dies,
> >> ICMP
> >> storm stops and the system is usable again.
> >> * On occasion the process is stuck forever and I have to reboot the
> >> box.
> >>
> >> I'm not sure who's to blame here.
> >>
> >> Is the automounter at fault for disappearing from the port it was
> >> supposed to listen to?
> >> If NFS guilty of trying blindly to reconnect on the same port and
> >> not
> >> giving up sooner?
> >> Should I flog the operator (ALA myself) for misconfiguring
> >> something
> >> (what?) in amd or NFS?
> >>
> >> More importantly -- how do I fix it?
> >> Any suggestions on fixing/debugging this issue?
> >>
> >> --Artem
> >
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1499650185.371230.1314321868068.JavaMail.root>