From owner-freebsd-questions@FreeBSD.ORG  Fri Mar 19 20:29:49 2010
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A54C9106564A
	for <freebsd-questions@freebsd.org>;
	Fri, 19 Mar 2010 20:29:49 +0000 (UTC)
	(envelope-from korvus@comcast.net)
Received: from mx04.pub.collaborativefusion.com
	(mx04.pub.collaborativefusion.com [206.210.72.84])
	by mx1.freebsd.org (Postfix) with ESMTP id 5E7488FC19
	for <freebsd-questions@freebsd.org>;
	Fri, 19 Mar 2010 20:29:49 +0000 (UTC)
Received: from [192.168.2.164] ([206.210.89.202])
	by mx04.pub.collaborativefusion.com (StrongMail Enterprise
	4.1.1.4(4.1.1.4-47689)); Fri, 19 Mar 2010 16:45:20 -0400
X-VirtualServerGroup: Default
X-MailingID: 00000::00000::00000::00000::::1300
X-SMHeaderMap: mid="X-MailingID"
X-Destination-ID: freebsd-questions@freebsd.org
X-SMFBL: ZnJlZWJzZC1xdWVzdGlvbnNAZnJlZWJzZC5vcmc=
Message-ID: <4BA3DEBC.2000608@comcast.net>
Date: Fri, 19 Mar 2010 16:29:48 -0400
From: Steve Polyack <korvus@comcast.net>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US;
	rv:1.9.1.7) Gecko/20100311 Thunderbird/3.0.1
MIME-Version: 1.0
To: John Baldwin <jhb@freebsd.org>
References: <4BA3613F.4070606@comcast.net> <201003190831.00950.jhb@freebsd.org>
	<4BA37AE9.4060806@comcast.net> <4BA392B1.4050107@comcast.net>
In-Reply-To: <4BA392B1.4050107@comcast.net>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-fs@freebsd.org, User Questions <freebsd-questions@freebsd.org>,
	bseklecki@noc.cfi.pgh.pa.us
Subject: Re: FreeBSD NFS client goes into infinite retry loop
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 19 Mar 2010 20:29:49 -0000

On 03/19/10 11:05, Steve Polyack wrote:
> On 03/19/10 09:23, Steve Polyack wrote:
>> On 03/19/10 08:31, John Baldwin wrote:
>>> On Friday 19 March 2010 7:34:23 am Steve Polyack wrote:
>>>> Hi, we use a FreeBSD 8-STABLE (from shortly after release) system 
>>>> as an
>>>> NFS server to provide user home directories which get mounted across a
>>>> few machines (all 6.3-RELEASE).  For the past few weeks we have been
>>>> running into problems where one particular client will go into an
>>>> infinite loop where it is repeatedly trying to write data which causes
>>>> the NFS server to return "reply ok 40 write ERROR: Input/output error
>>>> PRE: POST:".  This retry loop can cause between 20mbps and 500mbps of
>>>> constant traffic on our network, depending on the size of the data
>>>> associated with the failed write.
>>>>
>>> Yes, your feeling is correct.  This sort of race is inherent to NFS 
>>> if you do
>>> not use some sort of locking protocol to resolve the race.  The 
>>> infinite
>>> retries sound like a client-side issue.  Have you been able to try a 
>>> newer OS
>>> version on a client to see if it still causes the same behavior?
>>>
>> I can't try a newer FBSD version on the client where we are seeing 
>> the problems, but I can recreate the problem fairly easily.  Perhaps 
>> I'll try it with an 8.0 client.  If I remember correctly, one of the 
>> strange things is that it doesn't seem to hit "critical mass" until a 
>> few hours after the operation first fails.  I may be wrong, but I'll 
>> double check that when I check vs. 8.0-release.
>>
>> I forgot to add this in the first post, but these are all TCP NFS v3 
>> mounts.
>>
>> Thanks for the response.
>
> Ok, so I'm still able to trigger what appears to be the same retry 
> loop with an 8.0-RELEASE nfsv3 client (going on 1.5 hours now):
> $ cat nfs.sh
> client#!/usr/local/bin/bash
> for a in {1..15} ; do
>   sleep 1;
>   echo "$a$a$";
> done
> client$ ./nfs.sh >~/output
>
> the on the server while the above is running:
> server$ rm ~/output
>
> What happens is that you will see 3-4 of the same write attempts 
> happen per minute via tcpdump.  Our previous logs show that this is 
> how it starts, and then ~4 hours later it begins to spiral out of 
> control, throwing out up to 3,000 of the same failed write requests 
> per second.

To anyone who is interested: I did some poking around with DTrace, which 
led me to the nfsiod client code.
In src/sys/nfsclient/nfs_nfsiod.c:
         } else {
             if (bp->b_iocmd == BIO_READ)
                 (void) nfs_doio(bp->b_vp, bp, bp->b_rcred, NULL);
             else
                 (void) nfs_doio(bp->b_vp, bp, bp->b_wcred, NULL);
         }

These two calls to nfs_doio trash the return codes (which are errors 
cascading up from various other NFS write-related functions).  I'm not 
entirely familiar with the way nfsiod works, but if nfs_doio() or other 
subsequent functions are supposed to be removing the current async NFS 
operation from a queue which nfsiod handles, they are not doing so when 
they encounter an error.  They simply report the error back to the 
caller, who in this case is not even looking at the value.

I've tested this by pushing the return code into a new int, errno, and 
adding:
   if (errno) {
                     NFS_DPF(ASYNCIO,
                              ("nfssvc_iod: iod %d nfs_doio returned 
errno: %d\n",
                              myiod, errno));
    }

The result is that my problematic repeatable circumstance begins logging 
"nfssvc_iod: iod 0 nfs_doio returned errno: 5" (corresponding to 
NFSERR_INVAL?) for each repetition of the failed write.  The only things 
triggering this are my failed writes.  I can also see the nfsiod0 
process waking up each iteration.

Do we need some kind of "retry x times then abort" logic within 
nfsiod_iod(), or does this belong in the subsequent functions, such as 
nfs_doio()?  I think it's best to avoid these sorts of infinite loops 
which have the potential to take out the system or overload the network 
due to dumb decisions made by unprivileged users.