From owner-freebsd-net@FreeBSD.ORG Fri Aug 26 19:04:44 2011 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D1C051065670; Fri, 26 Aug 2011 19:04:44 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.mail.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 579AA8FC12; Fri, 26 Aug 2011 19:04:44 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ap4EAATuV06DaFvO/2dsb2JhbABDhEykUYFAAQEFIwRSGw4KAgINGQJZBhOwaZFlgSyED4ERBJMakRw X-IronPort-AV: E=Sophos;i="4.68,286,1312171200"; d="scan'208";a="132377283" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-annu-pri.mail.uoguelph.ca with ESMTP; 26 Aug 2011 15:04:43 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 9B4D5B3F29; Fri, 26 Aug 2011 15:04:43 -0400 (EDT) Date: Fri, 26 Aug 2011 15:04:43 -0400 (EDT) From: Rick Macklem To: Artem Belevich Message-ID: <2062808982.416174.1314385483626.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.203] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: freebsd-net@freebsd.org, Martin Birgmeier Subject: Re: amd + NFS reconnect = ICMP storm + unkillable process. X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Aug 2011 19:04:44 -0000 Artem Belevich wrote: > On Thu, Aug 25, 2011 at 6:24 PM, Rick Macklem > wrote: > > Btw, I fixed exactly the same issue for the TCP code (clnt_vc.c) in > > r221127, so I wouldn't be surprised if the UDP code suffers the same > > The code in clnt_vc.c was exactly what made me wonder about treatment > of ERESTART. > > > problem. I'll take a look at your patch tomorrow. You could also try > > a TCP mount and see if the problem goes away. (For TCP on a > > pre-r221127 > > system, the symptom would be a client thread looping in the kernel > > in > > "R" state.) > > In my case the process was also stuck in unkillable running state > because the process never returns from the syscall. > > Unfortunately amd itself seems to handle NFS requests for its own > top-level mountpoints only via UDP. At least I haven't found a way to > do so without hacking rather convoluted amd code. > > > I'll look tomorrow, but it sounds like you've figured it out. Looks > > like > > a good catch to me at this point, rick > > Let me know if you're OK with the patch and I'll commit to head and > MFC it to stable/8. > The patch looks good to me. The only thing is that *maybe* it should also do the same for the other msleep() higher up in clnt_dg_call()? (It seems to me that if this msleep() were to return ERESTART, the same kernel loop would occur.) Here's this variant of the patch (I'll let you decide which to commit). Good work tracking this down, rick --- rpc/clnt_dg.c.sav 2011-08-26 14:44:27.000000000 -0400 +++ rpc/clnt_dg.c 2011-08-26 14:48:07.000000000 -0400 @@ -467,7 +467,10 @@ send_again: cu->cu_waitflag, "rpccwnd", 0); if (error) { errp->re_errno = error; - errp->re_status = stat = RPC_CANTSEND; + if (error == EINTR || error == ERESTART) + errp->re_status = stat = RPC_INTR; + else + errp->re_status = stat = RPC_CANTSEND; goto out; } } @@ -636,7 +639,7 @@ get_reply: */ if (error != EWOULDBLOCK) { errp->re_errno = error; - if (error == EINTR) + if (error == EINTR || error == ERESTART) errp->re_status = stat = RPC_INTR; else errp->re_status = stat = RPC_CANTRECV;