From owner-freebsd-questions@FreeBSD.ORG Fri Mar 19 15:05:23 2010 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DE1E71065670 for ; Fri, 19 Mar 2010 15:05:22 +0000 (UTC) (envelope-from korvus@comcast.net) Received: from mx04.pub.collaborativefusion.com (mx04.pub.collaborativefusion.com [206.210.72.84]) by mx1.freebsd.org (Postfix) with ESMTP id 987008FC13 for ; Fri, 19 Mar 2010 15:05:22 +0000 (UTC) Received: from [192.168.2.164] ([206.210.89.202]) by mx04.pub.collaborativefusion.com (StrongMail Enterprise 4.1.1.4(4.1.1.4-47689)); Fri, 19 Mar 2010 11:20:56 -0400 X-VirtualServerGroup: Default X-MailingID: 00000::00000::00000::00000::::599 X-SMHeaderMap: mid="X-MailingID" X-Destination-ID: freebsd-questions@freebsd.org X-SMFBL: ZnJlZWJzZC1xdWVzdGlvbnNAZnJlZWJzZC5vcmc= Message-ID: <4BA392B1.4050107@comcast.net> Date: Fri, 19 Mar 2010 11:05:21 -0400 From: Steve Polyack User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US; rv:1.9.1.7) Gecko/20100311 Thunderbird/3.0.1 MIME-Version: 1.0 To: John Baldwin References: <4BA3613F.4070606@comcast.net> <201003190831.00950.jhb@freebsd.org> <4BA37AE9.4060806@comcast.net> In-Reply-To: <4BA37AE9.4060806@comcast.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-fs@freebsd.org, User Questions , bseklecki@noc.cfi.pgh.pa.us Subject: Re: FreeBSD NFS client goes into infinite retry loop X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Mar 2010 15:05:23 -0000 On 03/19/10 09:23, Steve Polyack wrote: > On 03/19/10 08:31, John Baldwin wrote: >> On Friday 19 March 2010 7:34:23 am Steve Polyack wrote: >>> Hi, we use a FreeBSD 8-STABLE (from shortly after release) system as an >>> NFS server to provide user home directories which get mounted across a >>> few machines (all 6.3-RELEASE). For the past few weeks we have been >>> running into problems where one particular client will go into an >>> infinite loop where it is repeatedly trying to write data which causes >>> the NFS server to return "reply ok 40 write ERROR: Input/output error >>> PRE: POST:". This retry loop can cause between 20mbps and 500mbps of >>> constant traffic on our network, depending on the size of the data >>> associated with the failed write. >>> >>> We spent some time on the issue and determined that something on one of >>> the clients is deleting a file as it is being written to by another NFS >>> client. We were able to enable the NFS lockmgr and use lockf(1) to fix >>> most of these conditions, and the frequency of this problem has dropped >>> from once a night to once a week. However, it's still a problem and we >>> can't necessarily force all of our users to "play nice" and use >>> lockf/flock. >>> >>> Has anyone seen this before? No errors are being logged on the NFS >>> server itself, but the "Server Ret-Failed" counter begins to increase >>> rapidly whenever a client gets stuck in this infinite retry loop: >>> Server Ret-Failed >>> 224768961 >>> >>> I have a feeling that using NFS in such a matter may simply be prone to >>> such problems, but what confuses me is why the NFS client system is >>> infinitely retrying the write operation and causing itself so much >>> grief. >> Yes, your feeling is correct. This sort of race is inherent to NFS >> if you do >> not use some sort of locking protocol to resolve the race. The infinite >> retries sound like a client-side issue. Have you been able to try a >> newer OS >> version on a client to see if it still causes the same behavior? >> > I can't try a newer FBSD version on the client where we are seeing the > problems, but I can recreate the problem fairly easily. Perhaps I'll > try it with an 8.0 client. If I remember correctly, one of the > strange things is that it doesn't seem to hit "critical mass" until a > few hours after the operation first fails. I may be wrong, but I'll > double check that when I check vs. 8.0-release. > > I forgot to add this in the first post, but these are all TCP NFS v3 > mounts. > > Thanks for the response. Ok, so I'm still able to trigger what appears to be the same retry loop with an 8.0-RELEASE nfsv3 client (going on 1.5 hours now): $ cat nfs.sh client#!/usr/local/bin/bash for a in {1..15} ; do sleep 1; echo "$a$a$"; done client$ ./nfs.sh >~/output the on the server while the above is running: server$ rm ~/output What happens is that you will see 3-4 of the same write attempts happen per minute via tcpdump. Our previous logs show that this is how it starts, and then ~4 hours later it begins to spiral out of control, throwing out up to 3,000 of the same failed write requests per second.