From owner-freebsd-stable@FreeBSD.ORG  Wed Feb 13 22:50:22 2013
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 2C011D2B;
 Wed, 13 Feb 2013 22:50:22 +0000 (UTC)
 (envelope-from rmacklem@uoguelph.ca)
Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca
 [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 2DC6A3CA;
 Wed, 13 Feb 2013 22:50:20 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqAEAF0XHFGDaFvO/2dsb2JhbAA7CoZPujZzgh8BAQEDAQEBASAEJyALBRYOChEZAgQlAQkmBggHBAEIFASHawYMrSySKI06BgqDJ4ETA4hmhi+EXIIzgR2PNoMkT4EFNQ
X-IronPort-AV: E=Sophos;i="4.84,660,1355115600"; 
 d="bz2'66?scan'66,208,66";a="16528385"
Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.206])
 by esa-jnhn.mail.uoguelph.ca with ESMTP; 13 Feb 2013 17:50:13 -0500
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id EBCDDB4054;
 Wed, 13 Feb 2013 17:50:13 -0500 (EST)
Date: Wed, 13 Feb 2013 17:50:13 -0500 (EST)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: Konstantin Belousov <kostikbel@gmail.com>
Message-ID: <431606432.2998831.1360795813954.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <20130213203042.GW2522@kib.kiev.ua>
Subject: Re: 9-STABLE -> NFS -> NetAPP:
MIME-Version: 1.0
Content-Type: multipart/mixed; 
 boundary="----=_Part_2998830_1069103783.1360795813952"
X-Originating-IP: [172.17.91.203]
X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692)
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: Marc Fournier <scrappy@hub.org>, Kostik Belousov <kib@freebsd.org>,
 freebsd-stable@freebsd.org, John Baldwin <jhb@freebsd.org>
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 13 Feb 2013 22:50:22 -0000

------=_Part_2998830_1069103783.1360795813952
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

Konstantin Belousov wrote:
> On Tue, Feb 12, 2013 at 08:50:39PM -0500, Rick Macklem wrote:
> > Marc Fournier wrote:
> > > Just reset server, so any further details will have to be 'next
> > > time'
> > > ??? but, just did a csup and am rebuilding ??? the following three
> > > files
> > > were modified since last build:
> > >
> > > grep nfs /tmp/output
> > > Edit src/sys/fs/nfs/nfs_commonsubs.c
> > > Edit src/sys/fs/nfsclient/nfs_clrpcops.c
> > > Edit src/sys/fs/nfsserver/nfs_nfsdserv.c
> > >
> > >
> > > On 2013-02-10, at 4:56 PM, Marc Fournier <scrappy@hub.org> wrote:
> > >
> > > >
> > > > On 2013-02-10, at 4:31 PM, Rick Macklem <rmacklem@uoguelph.ca>
> > > > wrote:
> > > >
> > > >> Marc Fournier wrote:
> > > >>> Hi John ???
> > > >>>
> > > >>> Does this help?
> > > >>>
> > > >>> root@io:~ # ps auxl | grep du
> > > >>> root 1054 0.0 0.1 16176 6600 ?? D 3:15AM 0:05.38 du -skx
> > > >>> /vm/2799
> > > >>> 0
> > > >>> 81426 0 20 0 newnfs
> > > >>> root 12353 0.0 0.1 16176 5104 ?? D Sat03AM 0:05.41 du -skx
> > > >>> /vm/2799 0
> > > >>> 91597 0 20 0 newnfs
> > > >>> root 64529 0.0 0.1 16176 5164 ?? D Fri03AM 0:05.40 du -skx
> > > >>> /vm/2799 0
> > > >>> 43227 0 20 0 newnfs
> > > >>> root 12855 0.0 0.0 16308 1988 0 S+ 5:26AM 0:00.00 grep du 0
> > > >>> 12847
> > > >>> 0 20
> > > >>> 0 piperd
> > > >> It is probably too late, but all the lines (without the | grep
> > > >> du)
> > > >> would be
> > > >> more useful. I also include the "H" flag, so it lists threads
> > > >> as
> > > >> well as
> > > >> processes. The above just says the "du" command is waiting for
> > > >> a
> > > >> vnode lock.
> > > >> The interesting process/thread is the one that is holding a
> > > >> vnode
> > > >> lock
> > > >> while waiting for something else.
> > > >
> > > > As requested, 'ps auxlH' attached ???
> > > >
> > > >
> > > > <ps.out.bz2>
> > > >
> > Well, I took a look at the ps output and I didn't see anything that
> > would
> > identify what the hang is. There are a lot of processes sleeping on
> > "newnfs"
> > (waiting for a vnode lock) and many sleeping on "vofflock" (waiting
> > for the
> >  f_offset lock).
> I never got any attachments on the thread.
> 
I got it resent from him. I've attached it to this post, just in case you
are interested in taking a look at it.

> See
> http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
> for the description of what is needed to start debugging.

I already pointed this out (thanks to your previous email thread), but
apparently he can't run a console, so I don't know if there is another
way to do the same things?

> >
> > Unfortunately, I can't spot any process/thread that is blocked on
> > something
> > else, where it would seem likely to be holding either an nfs vnode
> > lock or
> > f_offset lock that isn't one of these.
> >
> > There were changes about 5 months ago which it appears fixed a
> > deadlock race
> > between vnode locks and offset locks for paging (r236321 and
> > friends).
> No, I do not think that the description of the changes is right.
> 
He does get the odd error reported by nfs_getpages() and I don't
think we've isolated why yet. The error is 13 (EACCES), but jhb@
thought it might be because of the bug he fixed where the krpc
reported EACCES for the EINTR case. I don't think we've heard
back from Marc w.r.t. whether he has gotten any more of these
erros logged since applying jhb@'s patch and whether or not
the errno has changed to EINTR?

I'll admit I don't understand when the VOP_GETPAGES() path gets
called vs the vn_io_fault() one. I plan on taking a closer look
at the VOP_GETPAGES() call path and see if I can spot any locking
issue.

> >
> > I am wondering if there could be other similar races, possibly
> > specific to
> > paging in over NFS? (I can't see any case where there is a LOR, so I
> > can't
> > think of what it might be?)
> >
> > If you just want the hangs to go away, I'd suggest moving the
> > executable
> > is /usr/local/sbin (httpd maybe) to a local file system on the
> > server,
> > since it does seem to be related to paging this executable in over
> > NFS.
> >
> > rick
> > ps: I've added kib@ to the cc, in case he is aware of other related
> > races?
> >
> > > >>
> > > >> Are you still getting the:
> > > >> nfs_getpages: error 13
> > > >> vm_fault: pager read error, pid 11355 (https)
> > > >
> > > > Fairly quiet:
> > > >
> > > > <Screen Shot 2013-02-10 at 4.43.55 PM.png>
> > > >
> > > > And that is it since last reboot ~20 days ago ???
> > > >
> > > >>
> > > >> messages logged?
> > > >>
> > > >> With John's recent patch, the error# would no longer be 13 if
> > > >> it
> > > >> was
> > > >> caused by the "intr" flag resulting in a Read RPC terminating
> > > >> with
> > > >> EINTR.
> > > >> If you are still getting the above with "error 13", it suggests
> > > >> that
> > > >> the server is replying EACCES for the Read RPC.
> > > >> I suggested before that you check to make sure that the
> > > >> executable
> > > >> had
> > > >> read access for everyone one the file server. Since I didn't
> > > >> hear
> > > >> back,
> > > >> I'll assume this is the case.
> > > >
> > > > Don't understand this question ??? I have 34 VPSs running off of
I was just asking if you have seen any of the nfs_getpages errors logged
since applying jhb@'s patch and whether or not the errno in it has changed
from 13 to something else?

> > > > this
> > > > server right now ??? that 'du process' runs against each of
> > > > those VPSs
> > > > every night, and this problem started happening on Friday
> > > > night's
> > > > run ??? ~18 days into uptime ??? so the same process has run
> > > > repeatedly,
> > > > with no issues, 18 times before it hung on Friday ??? also, the
> > > > hang,
> > > > once 'triggered', only seems to recur against the same directory
> > > > ???
> > > > the same directory doesn't necessarily trigger it, but once it
> > > > starts, it appears to do it for the same directory ??? I'm not
> > > > sure if
> > > > I've ever seem it happening to two different directories at the
> > > > same
> > > > time ???
> > > >
> > > > Also, please note that the du command is run from the physical
> > > > server, as root ???
> > > >
> > > >> rick
> > > >> ps: If it is still up and hasn't been rebooted, you could:
> > > >>   sysctl debug.kdb.break_to_debugger=1
> > > >>   - then type <ctrl><alt><esc> at the console and do the
> > > >>   following
> > > >>     from the debugger
> > > >>   http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
> > > >>   How well this work depends on what options your kernel was
> > > >>   built
> > > >>   with.
> > > >
> > > > My remote console on that one doesn't work very well ??? I can
> > > > view,
> > > > but I can't type ???
> > > >
Unfortunately, I don't know how to do this unless you are in the kernel DB.

rick

> > > >
> > >
> > > _______________________________________________
> > > freebsd-stable@freebsd.org mailing list
> > > http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> > > To unsubscribe, send any mail to
> > > "freebsd-stable-unsubscribe@freebsd.org"

------=_Part_2998830_1069103783.1360795813952--