Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 1 Sep 2010 12:05:47 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Steve Polyack <korvus@comcast.net>
Cc:        yanefbsd@gmail.com, freebsd-stable@freebsd.org
Subject:   Re: NFS 75 second stall
Message-ID:  <1767168849.374184.1283357147943.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <4C7E743A.1040506@comcast.net>

next in thread | previous in thread | raw e-mail | index | archive | help
------=_Part_374183_1556044666.1283357147941
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

> On 07/01/10 15:23, Garrett Cooper wrote:
> > On Thu, Jul 1, 2010 at 11:51 AM, alan bryan<alan.bryan@yahoo.com>
> > wrote:
> >>
> >> --- On Thu, 7/1/10, Garrett Cooper<yanefbsd@gmail.com> wrote:
> >>
> >>> From: Garrett Cooper<yanefbsd@gmail.com>
> >>> Subject: Re: NFS 75 second stall
> >>> To: "alan bryan"<alan.bryan@yahoo.com>
> >>> Cc: freebsd-stable@freebsd.org
> >>> Date: Thursday, July 1, 2010, 11:13 AM
> >>> On Thu, Jul 1, 2010 at 11:01 AM, alan
> >>> bryan<alan.bryan@yahoo.com>
> >>> wrote:
> >>>> Setup:
> >>>>
> >>>> server - FreeBSD 8-stable from today. 2 UFS dirs
> >>> exported via NFS.
> >>>> client - FreeBSD 8.0-Release. Running a test php
> >>> script that copies around various files to/from 2 separate
> >>> NFS mounts.
> >>>> Situation:
> >>>>
> >>>> script is started (forked to do 20 simultaneous runs)
> >>> and 20 1GB files are copied to the NFS dir which works
> >>> fine. When it then switches to reading those files back
> >>> and simultaneously writing to the other NFS mount I see a
> >>> hang of 75 seconds. If I do an "ls -l" on the NFS mount it
> >>> hangs too. After 75 seconds the client has reported:
> >>>> nfs server 192.168.10.133:/usr/local/export1: not
> >>> responding
> >>>> nfs server 192.168.10.133:/usr/local/export1: is alive
> >>> again
> >>>> nfs server 192.168.10.133:/usr/local/export1: not
> >>> responding
> >>>> nfs server 192.168.10.133:/usr/local/export1: is alive
> >>> again
> >>>> and then things start working again. The server was
> >>> originally FreeBSD 8.0-Release also but was upgraded to the
> >>> latest stable to see if this issue could be avoided.
> >>>> # nfsstat -s -W -w 1
> >>>>   GtAttr Lookup Rdlink Read Write Rename
> >>> Access Rddir
> >>>>        0 0 0 222 257
> >>>    0 0 0
> >>>>        0 0 0 178 135
> >>>    0 0 0
> >>>>        0 0 0 85 127
> >>>      0 0 0
> >>>>        0 0 0 0 0
> >>>      0 0 0
> >>>>        0 0 0 0 0
> >>>      0 0 0
> >>>>        0 0 0 0 0
> >>>      0 0 0
> >>>>        0 0 0 0 0
> >>>      0 0 0
> >>>>        0 0 0 0 0
> >>>      0 0 0
> >>>> ... for 75 rows of all zeros
> >>>>
> >>>>        0 0 0 272 266
> >>>    0 0 0
> >>>>        0 0 0 167 165
> >>>    0 0 0
> >>>> I also tried runs with 15 simultaneous processes and
> >>> 25. 15 processes gave only about a 5 second stall but 25
> >>> gave again the same 75 second stall.
> >>>> Further, I tested with 2 mounts to the same server but
> >>> from ZFS filesytems with the exact same stall/timeout
> >>> periods. So, it doesn't appear to matter what the
> >>> underlying filesystem is - it's something in NFS or
> >>> networking code.
> >>>> Any ideas on what's going on here? What's causing
> >>> the complete stall period of zero NFS activity? Any flaws
> >>> with my testing methods?
> >>>> Thanks for any and all help/ideas.
> >>> What network driver are you using? Have you tried
> >>> tcpdumping the packets?
> >>> -Garrett
> >>>
> >> I'm using igb currently but have also used em. I have not tried
> >> tcpdumping the packets yet on this test. Any suggestions on things
> >> to look out for (I'm not that familiar with that whole process).
> >>
> >> Which brings up another point - I'm using TCP connections for NFS,
> >> not UDP.
> >      Is the net.inet.tcp.tso sysctl enabled or not? What about
> >      rxcsum and txcsum?
> > Thanks,
> > -Garrett
> 
> We're occaisionally seeing these same types of stalls (+ repeated "is
> not responding" "is alive again" messages in quick succession). We're
> seeing it only on our 8.1-RELEASE systems against a variety of NFS
> servers (6.3-RELEASE, 7.2-RELEASE, and 8-STABLE from before the
> release
> of 8.1). We also see it happen with a variety of client hardware and
> network adapters (em, bce, bge); the only common denominator is
> 8.1-RELEASE on the clients.
> 
You could try the attached patch. It won't fix anything, but it
should print out what the errno is that is causing a TCP reconnect
and might give us a hint w.r.t. what is going on.

rick


------=_Part_374183_1556044666.1283357147941
Content-Type: text/x-patch; name=clnt_rc.patch
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename=clnt_rc.patch

LS0tIHJwYy9jbG50X3JjLmMuc2F2CTIwMTAtMDktMDEgMTA6NTY6NTYuMDAwMDAwMDAwIC0wNDAw
CisrKyBycGMvY2xudF9yYy5jCTIwMTAtMDktMDEgMTE6MDA6NDkuMDAwMDAwMDAwIC0wNDAwCkBA
IC0yNjQsNiArMjY0LDcgQEAKIAkJCW10eF91bmxvY2soJnJjLT5yY19sb2NrKTsKIAkJCXN0YXQg
PSBjbG50X3JlY29ubmVjdF9jb25uZWN0KGNsKTsKIAkJCWlmIChzdGF0ID09IFJQQ19TWVNURU1F
UlJPUikgeworcHJpbnRmKCJyZWNvbiBlcnI9JWRcbiIsIHJwY19jcmVhdGVlcnIuY2ZfZXJyb3Iu
cmVfZXJybm8pOwogCQkJCWVycm9yID0gdHNsZWVwKCZmYWtlX3djaGFuLAogCQkJCSAgICByYy0+
cmNfaW50ciA/IFBDQVRDSCB8IFBCRFJZIDogMCwgInJwY2NvbiIsCiAJCQkJICAgIGh6KTsK
------=_Part_374183_1556044666.1283357147941--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1767168849.374184.1283357147943.JavaMail.root>