FreeBSD Mail Archives

Date:      Wed, 1 Sep 2010 12:05:47 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Steve Polyack <korvus@comcast.net>
Cc:        yanefbsd@gmail.com, freebsd-stable@freebsd.org
Subject:   Re: NFS 75 second stall
Message-ID:  <1767168849.374184.1283357147943.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <4C7E743A.1040506@comcast.net>

index | next in thread | previous in thread | raw e-mail


[-- Attachment #1 --]
> On 07/01/10 15:23, Garrett Cooper wrote:
> > On Thu, Jul 1, 2010 at 11:51 AM, alan bryan<alan.bryan@yahoo.com>
> > wrote:
> >>
> >> --- On Thu, 7/1/10, Garrett Cooper<yanefbsd@gmail.com> wrote:
> >>
> >>> From: Garrett Cooper<yanefbsd@gmail.com>
> >>> Subject: Re: NFS 75 second stall
> >>> To: "alan bryan"<alan.bryan@yahoo.com>
> >>> Cc: freebsd-stable@freebsd.org
> >>> Date: Thursday, July 1, 2010, 11:13 AM
> >>> On Thu, Jul 1, 2010 at 11:01 AM, alan
> >>> bryan<alan.bryan@yahoo.com>
> >>> wrote:
> >>>> Setup:
> >>>>
> >>>> server - FreeBSD 8-stable from today. 2 UFS dirs
> >>> exported via NFS.
> >>>> client - FreeBSD 8.0-Release. Running a test php
> >>> script that copies around various files to/from 2 separate
> >>> NFS mounts.
> >>>> Situation:
> >>>>
> >>>> script is started (forked to do 20 simultaneous runs)
> >>> and 20 1GB files are copied to the NFS dir which works
> >>> fine. When it then switches to reading those files back
> >>> and simultaneously writing to the other NFS mount I see a
> >>> hang of 75 seconds. If I do an "ls -l" on the NFS mount it
> >>> hangs too. After 75 seconds the client has reported:
> >>>> nfs server 192.168.10.133:/usr/local/export1: not
> >>> responding
> >>>> nfs server 192.168.10.133:/usr/local/export1: is alive
> >>> again
> >>>> nfs server 192.168.10.133:/usr/local/export1: not
> >>> responding
> >>>> nfs server 192.168.10.133:/usr/local/export1: is alive
> >>> again
> >>>> and then things start working again. The server was
> >>> originally FreeBSD 8.0-Release also but was upgraded to the
> >>> latest stable to see if this issue could be avoided.
> >>>> # nfsstat -s -W -w 1
> >>>>   GtAttr Lookup Rdlink Read Write Rename
> >>> Access Rddir
> >>>>        0 0 0 222 257
> >>>    0 0 0
> >>>>        0 0 0 178 135
> >>>    0 0 0
> >>>>        0 0 0 85 127
> >>>      0 0 0
> >>>>        0 0 0 0 0
> >>>      0 0 0
> >>>>        0 0 0 0 0
> >>>      0 0 0
> >>>>        0 0 0 0 0
> >>>      0 0 0
> >>>>        0 0 0 0 0
> >>>      0 0 0
> >>>>        0 0 0 0 0
> >>>      0 0 0
> >>>> ... for 75 rows of all zeros
> >>>>
> >>>>        0 0 0 272 266
> >>>    0 0 0
> >>>>        0 0 0 167 165
> >>>    0 0 0
> >>>> I also tried runs with 15 simultaneous processes and
> >>> 25. 15 processes gave only about a 5 second stall but 25
> >>> gave again the same 75 second stall.
> >>>> Further, I tested with 2 mounts to the same server but
> >>> from ZFS filesytems with the exact same stall/timeout
> >>> periods. So, it doesn't appear to matter what the
> >>> underlying filesystem is - it's something in NFS or
> >>> networking code.
> >>>> Any ideas on what's going on here? What's causing
> >>> the complete stall period of zero NFS activity? Any flaws
> >>> with my testing methods?
> >>>> Thanks for any and all help/ideas.
> >>> What network driver are you using? Have you tried
> >>> tcpdumping the packets?
> >>> -Garrett
> >>>
> >> I'm using igb currently but have also used em. I have not tried
> >> tcpdumping the packets yet on this test. Any suggestions on things
> >> to look out for (I'm not that familiar with that whole process).
> >>
> >> Which brings up another point - I'm using TCP connections for NFS,
> >> not UDP.
> >      Is the net.inet.tcp.tso sysctl enabled or not? What about
> >      rxcsum and txcsum?
> > Thanks,
> > -Garrett
> 
> We're occaisionally seeing these same types of stalls (+ repeated "is
> not responding" "is alive again" messages in quick succession). We're
> seeing it only on our 8.1-RELEASE systems against a variety of NFS
> servers (6.3-RELEASE, 7.2-RELEASE, and 8-STABLE from before the
> release
> of 8.1). We also see it happen with a variety of client hardware and
> network adapters (em, bce, bge); the only common denominator is
> 8.1-RELEASE on the clients.
> 
You could try the attached patch. It won't fix anything, but it
should print out what the errno is that is causing a TCP reconnect
and might give us a hint w.r.t. what is going on.

rick


[-- Attachment #2 --]
--- rpc/clnt_rc.c.sav	2010-09-01 10:56:56.000000000 -0400
+++ rpc/clnt_rc.c	2010-09-01 11:00:49.000000000 -0400
@@ -264,6 +264,7 @@
 			mtx_unlock(&rc->rc_lock);
 			stat = clnt_reconnect_connect(cl);
 			if (stat == RPC_SYSTEMERROR) {
+printf("recon err=%d\n", rpc_createerr.cf_error.re_errno);
 				error = tsleep(&fake_wchan,
 				    rc->rc_intr ? PCATCH | PBDRY : 0, "rpccon",
 				    hz);

home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1767168849.374184.1283357147943.JavaMail.root>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation