Date: Wed, 13 Feb 2013 17:50:13 -0500 (EST) From: Rick Macklem <rmacklem@uoguelph.ca> To: Konstantin Belousov <kostikbel@gmail.com> Cc: Marc Fournier <scrappy@hub.org>, Kostik Belousov <kib@freebsd.org>, freebsd-stable@freebsd.org, John Baldwin <jhb@freebsd.org> Subject: Re: 9-STABLE -> NFS -> NetAPP: Message-ID: <431606432.2998831.1360795813954.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <20130213203042.GW2522@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
------=_Part_2998830_1069103783.1360795813952 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Konstantin Belousov wrote: > On Tue, Feb 12, 2013 at 08:50:39PM -0500, Rick Macklem wrote: > > Marc Fournier wrote: > > > Just reset server, so any further details will have to be 'next > > > time' > > > ??? but, just did a csup and am rebuilding ??? the following three > > > files > > > were modified since last build: > > > > > > grep nfs /tmp/output > > > Edit src/sys/fs/nfs/nfs_commonsubs.c > > > Edit src/sys/fs/nfsclient/nfs_clrpcops.c > > > Edit src/sys/fs/nfsserver/nfs_nfsdserv.c > > > > > > > > > On 2013-02-10, at 4:56 PM, Marc Fournier <scrappy@hub.org> wrote: > > > > > > > > > > > On 2013-02-10, at 4:31 PM, Rick Macklem <rmacklem@uoguelph.ca> > > > > wrote: > > > > > > > >> Marc Fournier wrote: > > > >>> Hi John ??? > > > >>> > > > >>> Does this help? > > > >>> > > > >>> root@io:~ # ps auxl | grep du > > > >>> root 1054 0.0 0.1 16176 6600 ?? D 3:15AM 0:05.38 du -skx > > > >>> /vm/2799 > > > >>> 0 > > > >>> 81426 0 20 0 newnfs > > > >>> root 12353 0.0 0.1 16176 5104 ?? D Sat03AM 0:05.41 du -skx > > > >>> /vm/2799 0 > > > >>> 91597 0 20 0 newnfs > > > >>> root 64529 0.0 0.1 16176 5164 ?? D Fri03AM 0:05.40 du -skx > > > >>> /vm/2799 0 > > > >>> 43227 0 20 0 newnfs > > > >>> root 12855 0.0 0.0 16308 1988 0 S+ 5:26AM 0:00.00 grep du 0 > > > >>> 12847 > > > >>> 0 20 > > > >>> 0 piperd > > > >> It is probably too late, but all the lines (without the | grep > > > >> du) > > > >> would be > > > >> more useful. I also include the "H" flag, so it lists threads > > > >> as > > > >> well as > > > >> processes. The above just says the "du" command is waiting for > > > >> a > > > >> vnode lock. > > > >> The interesting process/thread is the one that is holding a > > > >> vnode > > > >> lock > > > >> while waiting for something else. > > > > > > > > As requested, 'ps auxlH' attached ??? > > > > > > > > > > > > <ps.out.bz2> > > > > > > Well, I took a look at the ps output and I didn't see anything that > > would > > identify what the hang is. There are a lot of processes sleeping on > > "newnfs" > > (waiting for a vnode lock) and many sleeping on "vofflock" (waiting > > for the > > f_offset lock). > I never got any attachments on the thread. > I got it resent from him. I've attached it to this post, just in case you are interested in taking a look at it. > See > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html > for the description of what is needed to start debugging. I already pointed this out (thanks to your previous email thread), but apparently he can't run a console, so I don't know if there is another way to do the same things? > > > > Unfortunately, I can't spot any process/thread that is blocked on > > something > > else, where it would seem likely to be holding either an nfs vnode > > lock or > > f_offset lock that isn't one of these. > > > > There were changes about 5 months ago which it appears fixed a > > deadlock race > > between vnode locks and offset locks for paging (r236321 and > > friends). > No, I do not think that the description of the changes is right. > He does get the odd error reported by nfs_getpages() and I don't think we've isolated why yet. The error is 13 (EACCES), but jhb@ thought it might be because of the bug he fixed where the krpc reported EACCES for the EINTR case. I don't think we've heard back from Marc w.r.t. whether he has gotten any more of these erros logged since applying jhb@'s patch and whether or not the errno has changed to EINTR? I'll admit I don't understand when the VOP_GETPAGES() path gets called vs the vn_io_fault() one. I plan on taking a closer look at the VOP_GETPAGES() call path and see if I can spot any locking issue. > > > > I am wondering if there could be other similar races, possibly > > specific to > > paging in over NFS? (I can't see any case where there is a LOR, so I > > can't > > think of what it might be?) > > > > If you just want the hangs to go away, I'd suggest moving the > > executable > > is /usr/local/sbin (httpd maybe) to a local file system on the > > server, > > since it does seem to be related to paging this executable in over > > NFS. > > > > rick > > ps: I've added kib@ to the cc, in case he is aware of other related > > races? > > > > > >> > > > >> Are you still getting the: > > > >> nfs_getpages: error 13 > > > >> vm_fault: pager read error, pid 11355 (https) > > > > > > > > Fairly quiet: > > > > > > > > <Screen Shot 2013-02-10 at 4.43.55 PM.png> > > > > > > > > And that is it since last reboot ~20 days ago ??? > > > > > > > >> > > > >> messages logged? > > > >> > > > >> With John's recent patch, the error# would no longer be 13 if > > > >> it > > > >> was > > > >> caused by the "intr" flag resulting in a Read RPC terminating > > > >> with > > > >> EINTR. > > > >> If you are still getting the above with "error 13", it suggests > > > >> that > > > >> the server is replying EACCES for the Read RPC. > > > >> I suggested before that you check to make sure that the > > > >> executable > > > >> had > > > >> read access for everyone one the file server. Since I didn't > > > >> hear > > > >> back, > > > >> I'll assume this is the case. > > > > > > > > Don't understand this question ??? I have 34 VPSs running off of I was just asking if you have seen any of the nfs_getpages errors logged since applying jhb@'s patch and whether or not the errno in it has changed from 13 to something else? > > > > this > > > > server right now ??? that 'du process' runs against each of > > > > those VPSs > > > > every night, and this problem started happening on Friday > > > > night's > > > > run ??? ~18 days into uptime ??? so the same process has run > > > > repeatedly, > > > > with no issues, 18 times before it hung on Friday ??? also, the > > > > hang, > > > > once 'triggered', only seems to recur against the same directory > > > > ??? > > > > the same directory doesn't necessarily trigger it, but once it > > > > starts, it appears to do it for the same directory ??? I'm not > > > > sure if > > > > I've ever seem it happening to two different directories at the > > > > same > > > > time ??? > > > > > > > > Also, please note that the du command is run from the physical > > > > server, as root ??? > > > > > > > >> rick > > > >> ps: If it is still up and hasn't been rebooted, you could: > > > >> sysctl debug.kdb.break_to_debugger=1 > > > >> - then type <ctrl><alt><esc> at the console and do the > > > >> following > > > >> from the debugger > > > >> http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html > > > >> How well this work depends on what options your kernel was > > > >> built > > > >> with. > > > > > > > > My remote console on that one doesn't work very well ??? I can > > > > view, > > > > but I can't type ??? > > > > Unfortunately, I don't know how to do this unless you are in the kernel DB. rick > > > > > > > > > > _______________________________________________ > > > freebsd-stable@freebsd.org mailing list > > > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > > > To unsubscribe, send any mail to > > > "freebsd-stable-unsubscribe@freebsd.org" ------=_Part_2998830_1069103783.1360795813952--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?431606432.2998831.1360795813954.JavaMail.root>