Date: Wed, 1 Sep 2010 12:05:47 -0400 (EDT) From: Rick Macklem <rmacklem@uoguelph.ca> To: Steve Polyack <korvus@comcast.net> Cc: yanefbsd@gmail.com, freebsd-stable@freebsd.org Subject: Re: NFS 75 second stall Message-ID: <1767168849.374184.1283357147943.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <4C7E743A.1040506@comcast.net>
next in thread | previous in thread | raw e-mail | index | archive | help
------=_Part_374183_1556044666.1283357147941 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit > On 07/01/10 15:23, Garrett Cooper wrote: > > On Thu, Jul 1, 2010 at 11:51 AM, alan bryan<alan.bryan@yahoo.com> > > wrote: > >> > >> --- On Thu, 7/1/10, Garrett Cooper<yanefbsd@gmail.com> wrote: > >> > >>> From: Garrett Cooper<yanefbsd@gmail.com> > >>> Subject: Re: NFS 75 second stall > >>> To: "alan bryan"<alan.bryan@yahoo.com> > >>> Cc: freebsd-stable@freebsd.org > >>> Date: Thursday, July 1, 2010, 11:13 AM > >>> On Thu, Jul 1, 2010 at 11:01 AM, alan > >>> bryan<alan.bryan@yahoo.com> > >>> wrote: > >>>> Setup: > >>>> > >>>> server - FreeBSD 8-stable from today. 2 UFS dirs > >>> exported via NFS. > >>>> client - FreeBSD 8.0-Release. Running a test php > >>> script that copies around various files to/from 2 separate > >>> NFS mounts. > >>>> Situation: > >>>> > >>>> script is started (forked to do 20 simultaneous runs) > >>> and 20 1GB files are copied to the NFS dir which works > >>> fine. When it then switches to reading those files back > >>> and simultaneously writing to the other NFS mount I see a > >>> hang of 75 seconds. If I do an "ls -l" on the NFS mount it > >>> hangs too. After 75 seconds the client has reported: > >>>> nfs server 192.168.10.133:/usr/local/export1: not > >>> responding > >>>> nfs server 192.168.10.133:/usr/local/export1: is alive > >>> again > >>>> nfs server 192.168.10.133:/usr/local/export1: not > >>> responding > >>>> nfs server 192.168.10.133:/usr/local/export1: is alive > >>> again > >>>> and then things start working again. The server was > >>> originally FreeBSD 8.0-Release also but was upgraded to the > >>> latest stable to see if this issue could be avoided. > >>>> # nfsstat -s -W -w 1 > >>>> GtAttr Lookup Rdlink Read Write Rename > >>> Access Rddir > >>>> 0 0 0 222 257 > >>> 0 0 0 > >>>> 0 0 0 178 135 > >>> 0 0 0 > >>>> 0 0 0 85 127 > >>> 0 0 0 > >>>> 0 0 0 0 0 > >>> 0 0 0 > >>>> 0 0 0 0 0 > >>> 0 0 0 > >>>> 0 0 0 0 0 > >>> 0 0 0 > >>>> 0 0 0 0 0 > >>> 0 0 0 > >>>> 0 0 0 0 0 > >>> 0 0 0 > >>>> ... for 75 rows of all zeros > >>>> > >>>> 0 0 0 272 266 > >>> 0 0 0 > >>>> 0 0 0 167 165 > >>> 0 0 0 > >>>> I also tried runs with 15 simultaneous processes and > >>> 25. 15 processes gave only about a 5 second stall but 25 > >>> gave again the same 75 second stall. > >>>> Further, I tested with 2 mounts to the same server but > >>> from ZFS filesytems with the exact same stall/timeout > >>> periods. So, it doesn't appear to matter what the > >>> underlying filesystem is - it's something in NFS or > >>> networking code. > >>>> Any ideas on what's going on here? What's causing > >>> the complete stall period of zero NFS activity? Any flaws > >>> with my testing methods? > >>>> Thanks for any and all help/ideas. > >>> What network driver are you using? Have you tried > >>> tcpdumping the packets? > >>> -Garrett > >>> > >> I'm using igb currently but have also used em. I have not tried > >> tcpdumping the packets yet on this test. Any suggestions on things > >> to look out for (I'm not that familiar with that whole process). > >> > >> Which brings up another point - I'm using TCP connections for NFS, > >> not UDP. > > Is the net.inet.tcp.tso sysctl enabled or not? What about > > rxcsum and txcsum? > > Thanks, > > -Garrett > > We're occaisionally seeing these same types of stalls (+ repeated "is > not responding" "is alive again" messages in quick succession). We're > seeing it only on our 8.1-RELEASE systems against a variety of NFS > servers (6.3-RELEASE, 7.2-RELEASE, and 8-STABLE from before the > release > of 8.1). We also see it happen with a variety of client hardware and > network adapters (em, bce, bge); the only common denominator is > 8.1-RELEASE on the clients. > You could try the attached patch. It won't fix anything, but it should print out what the errno is that is causing a TCP reconnect and might give us a hint w.r.t. what is going on. rick ------=_Part_374183_1556044666.1283357147941 Content-Type: text/x-patch; name=clnt_rc.patch Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename=clnt_rc.patch LS0tIHJwYy9jbG50X3JjLmMuc2F2CTIwMTAtMDktMDEgMTA6NTY6NTYuMDAwMDAwMDAwIC0wNDAw CisrKyBycGMvY2xudF9yYy5jCTIwMTAtMDktMDEgMTE6MDA6NDkuMDAwMDAwMDAwIC0wNDAwCkBA IC0yNjQsNiArMjY0LDcgQEAKIAkJCW10eF91bmxvY2soJnJjLT5yY19sb2NrKTsKIAkJCXN0YXQg PSBjbG50X3JlY29ubmVjdF9jb25uZWN0KGNsKTsKIAkJCWlmIChzdGF0ID09IFJQQ19TWVNURU1F UlJPUikgeworcHJpbnRmKCJyZWNvbiBlcnI9JWRcbiIsIHJwY19jcmVhdGVlcnIuY2ZfZXJyb3Iu cmVfZXJybm8pOwogCQkJCWVycm9yID0gdHNsZWVwKCZmYWtlX3djaGFuLAogCQkJCSAgICByYy0+ cmNfaW50ciA/IFBDQVRDSCB8IFBCRFJZIDogMCwgInJwY2NvbiIsCiAJCQkJICAgIGh6KTsK ------=_Part_374183_1556044666.1283357147941--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1767168849.374184.1283357147943.JavaMail.root>