From owner-freebsd-questions@FreeBSD.ORG  Fri Mar 19 15:05:23 2010
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id DE1E71065670
	for <freebsd-questions@freebsd.org>;
	Fri, 19 Mar 2010 15:05:22 +0000 (UTC)
	(envelope-from korvus@comcast.net)
Received: from mx04.pub.collaborativefusion.com
	(mx04.pub.collaborativefusion.com [206.210.72.84])
	by mx1.freebsd.org (Postfix) with ESMTP id 987008FC13
	for <freebsd-questions@freebsd.org>;
	Fri, 19 Mar 2010 15:05:22 +0000 (UTC)
Received: from [192.168.2.164] ([206.210.89.202])
	by mx04.pub.collaborativefusion.com (StrongMail Enterprise
	4.1.1.4(4.1.1.4-47689)); Fri, 19 Mar 2010 11:20:56 -0400
X-VirtualServerGroup: Default
X-MailingID: 00000::00000::00000::00000::::599
X-SMHeaderMap: mid="X-MailingID"
X-Destination-ID: freebsd-questions@freebsd.org
X-SMFBL: ZnJlZWJzZC1xdWVzdGlvbnNAZnJlZWJzZC5vcmc=
Message-ID: <4BA392B1.4050107@comcast.net>
Date: Fri, 19 Mar 2010 11:05:21 -0400
From: Steve Polyack <korvus@comcast.net>
User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US;
	rv:1.9.1.7) Gecko/20100311 Thunderbird/3.0.1
MIME-Version: 1.0
To: John Baldwin <jhb@freebsd.org>
References: <4BA3613F.4070606@comcast.net> <201003190831.00950.jhb@freebsd.org>
	<4BA37AE9.4060806@comcast.net>
In-Reply-To: <4BA37AE9.4060806@comcast.net>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-fs@freebsd.org, User Questions <freebsd-questions@freebsd.org>,
	bseklecki@noc.cfi.pgh.pa.us
Subject: Re: FreeBSD NFS client goes into infinite retry loop
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 19 Mar 2010 15:05:23 -0000

On 03/19/10 09:23, Steve Polyack wrote:
> On 03/19/10 08:31, John Baldwin wrote:
>> On Friday 19 March 2010 7:34:23 am Steve Polyack wrote:
>>> Hi, we use a FreeBSD 8-STABLE (from shortly after release) system as an
>>> NFS server to provide user home directories which get mounted across a
>>> few machines (all 6.3-RELEASE).  For the past few weeks we have been
>>> running into problems where one particular client will go into an
>>> infinite loop where it is repeatedly trying to write data which causes
>>> the NFS server to return "reply ok 40 write ERROR: Input/output error
>>> PRE: POST:".  This retry loop can cause between 20mbps and 500mbps of
>>> constant traffic on our network, depending on the size of the data
>>> associated with the failed write.
>>>
>>> We spent some time on the issue and determined that something on one of
>>> the clients is deleting a file as it is being written to by another NFS
>>> client.  We were able to enable the NFS lockmgr and use lockf(1) to fix
>>> most of these conditions, and the frequency of this problem has dropped
>>> from once a night to once a week.  However, it's still a problem and we
>>> can't necessarily force all of our users to "play nice" and use 
>>> lockf/flock.
>>>
>>> Has anyone seen this before?  No errors are being logged on the NFS
>>> server itself, but the "Server Ret-Failed" counter begins to increase
>>> rapidly whenever a client gets stuck in this infinite retry loop:
>>> Server Ret-Failed
>>>           224768961
>>>
>>> I have a feeling that using NFS in such a matter may simply be prone to
>>> such problems, but what confuses me is why the NFS client system is
>>> infinitely retrying the write operation and causing itself so much 
>>> grief.
>> Yes, your feeling is correct.  This sort of race is inherent to NFS 
>> if you do
>> not use some sort of locking protocol to resolve the race.  The infinite
>> retries sound like a client-side issue.  Have you been able to try a 
>> newer OS
>> version on a client to see if it still causes the same behavior?
>>
> I can't try a newer FBSD version on the client where we are seeing the 
> problems, but I can recreate the problem fairly easily.  Perhaps I'll 
> try it with an 8.0 client.  If I remember correctly, one of the 
> strange things is that it doesn't seem to hit "critical mass" until a 
> few hours after the operation first fails.  I may be wrong, but I'll 
> double check that when I check vs. 8.0-release.
>
> I forgot to add this in the first post, but these are all TCP NFS v3 
> mounts.
>
> Thanks for the response.

Ok, so I'm still able to trigger what appears to be the same retry loop 
with an 8.0-RELEASE nfsv3 client (going on 1.5 hours now):
$ cat nfs.sh
client#!/usr/local/bin/bash
for a in {1..15} ; do
   sleep 1;
   echo "$a$a$";
done
client$ ./nfs.sh >~/output

the on the server while the above is running:
server$ rm ~/output

What happens is that you will see 3-4 of the same write attempts happen 
per minute via tcpdump.  Our previous logs show that this is how it 
starts, and then ~4 hours later it begins to spiral out of control, 
throwing out up to 3,000 of the same failed write requests per second.