From owner-freebsd-stable@FreeBSD.ORG Wed Feb 13 01:50:47 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id B1690CE7; Wed, 13 Feb 2013 01:50:47 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 436431D7; Wed, 13 Feb 2013 01:50:46 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAEnwGlGDaFvO/2dsb2JhbAA6CoZOujlzgh8BAQEEAQEBIAQnIAsbDgoCAg0ZAikBCSYGCAcEAQgUBIdxDK1EgkCPd4EjjBwKgymBEwOIZosLgjOBHY82gyRPgQU1 X-IronPort-AV: E=Sophos;i="4.84,654,1355115600"; d="scan'208";a="13758444" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-annu.net.uoguelph.ca with ESMTP; 12 Feb 2013 20:50:39 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 77686B3F4A; Tue, 12 Feb 2013 20:50:39 -0500 (EST) Date: Tue, 12 Feb 2013 20:50:39 -0500 (EST) From: Rick Macklem To: Marc Fournier Message-ID: <339364797.2960794.1360720239431.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <61DAA500-EB20-4861-AA7F-402FF1047B81@hub.org> Subject: Re: 9-STABLE -> NFS -> NetAPP: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: Kostik Belousov , freebsd-stable@freebsd.org, John Baldwin X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 13 Feb 2013 01:50:47 -0000 Marc Fournier wrote: > Just reset server, so any further details will have to be 'next time' > =E2=80=A6 but, just did a csup and am rebuilding =E2=80=A6 the following = three files > were modified since last build: >=20 > grep nfs /tmp/output > Edit src/sys/fs/nfs/nfs_commonsubs.c > Edit src/sys/fs/nfsclient/nfs_clrpcops.c > Edit src/sys/fs/nfsserver/nfs_nfsdserv.c >=20 >=20 > On 2013-02-10, at 4:56 PM, Marc Fournier wrote: >=20 > > > > On 2013-02-10, at 4:31 PM, Rick Macklem > > wrote: > > > >> Marc Fournier wrote: > >>> Hi John =E2=80=A6 > >>> > >>> Does this help? > >>> > >>> root@io:~ # ps auxl | grep du > >>> root 1054 0.0 0.1 16176 6600 ?? D 3:15AM 0:05.38 du -skx /vm/2799 > >>> 0 > >>> 81426 0 20 0 newnfs > >>> root 12353 0.0 0.1 16176 5104 ?? D Sat03AM 0:05.41 du -skx > >>> /vm/2799 0 > >>> 91597 0 20 0 newnfs > >>> root 64529 0.0 0.1 16176 5164 ?? D Fri03AM 0:05.40 du -skx > >>> /vm/2799 0 > >>> 43227 0 20 0 newnfs > >>> root 12855 0.0 0.0 16308 1988 0 S+ 5:26AM 0:00.00 grep du 0 12847 > >>> 0 20 > >>> 0 piperd > >> It is probably too late, but all the lines (without the | grep du) > >> would be > >> more useful. I also include the "H" flag, so it lists threads as > >> well as > >> processes. The above just says the "du" command is waiting for a > >> vnode lock. > >> The interesting process/thread is the one that is holding a vnode > >> lock > >> while waiting for something else. > > > > As requested, 'ps auxlH' attached =E2=80=A6 > > > > > > > > Well, I took a look at the ps output and I didn't see anything that would identify what the hang is. There are a lot of processes sleeping on "newnfs= " (waiting for a vnode lock) and many sleeping on "vofflock" (waiting for the f_offset lock). Unfortunately, I can't spot any process/thread that is blocked on something else, where it would seem likely to be holding either an nfs vnode lock or f_offset lock that isn't one of these. There were changes about 5 months ago which it appears fixed a deadlock rac= e between vnode locks and offset locks for paging (r236321 and friends). I am wondering if there could be other similar races, possibly specific to paging in over NFS? (I can't see any case where there is a LOR, so I can't think of what it might be?) If you just want the hangs to go away, I'd suggest moving the executable is /usr/local/sbin (httpd maybe) to a local file system on the server, since it does seem to be related to paging this executable in over NFS. rick ps: I've added kib@ to the cc, in case he is aware of other related races? > >> > >> Are you still getting the: > >> nfs_getpages: error 13 > >> vm_fault: pager read error, pid 11355 (https) > > > > Fairly quiet: > > > > > > > > And that is it since last reboot ~20 days ago =E2=80=A6 > > > >> > >> messages logged? > >> > >> With John's recent patch, the error# would no longer be 13 if it > >> was > >> caused by the "intr" flag resulting in a Read RPC terminating with > >> EINTR. > >> If you are still getting the above with "error 13", it suggests > >> that > >> the server is replying EACCES for the Read RPC. > >> I suggested before that you check to make sure that the executable > >> had > >> read access for everyone one the file server. Since I didn't hear > >> back, > >> I'll assume this is the case. > > > > Don't understand this question =E2=80=A6 I have 34 VPSs running off of = this > > server right now =E2=80=A6 that 'du process' runs against each of those= VPSs > > every night, and this problem started happening on Friday night's > > run =E2=80=A6 ~18 days into uptime =E2=80=A6 so the same process has ru= n repeatedly, > > with no issues, 18 times before it hung on Friday =E2=80=A6 also, the h= ang, > > once 'triggered', only seems to recur against the same directory =E2=80= =A6 > > the same directory doesn't necessarily trigger it, but once it > > starts, it appears to do it for the same directory =E2=80=A6 I'm not su= re if > > I've ever seem it happening to two different directories at the same > > time =E2=80=A6 > > > > Also, please note that the du command is run from the physical > > server, as root =E2=80=A6 > > > >> rick > >> ps: If it is still up and hasn't been rebooted, you could: > >> sysctl debug.kdb.break_to_debugger=3D1 > >> - then type at the console and do the following > >> from the debugger > >> http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook= /kerneldebug-deadlocks.html > >> How well this work depends on what options your kernel was built > >> with. > > > > My remote console on that one doesn't work very well =E2=80=A6 I can vi= ew, > > but I can't type =E2=80=A6 > > > > >=20 > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to > "freebsd-stable-unsubscribe@freebsd.org"