From owner-freebsd-fs@FreeBSD.ORG Wed Sep 15 15:44:13 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A4BE910656A6 for ; Wed, 15 Sep 2010 15:44:13 +0000 (UTC) (envelope-from ecrist@secure-computing.net) Received: from kenny.secure-computing.net (unknown [IPv6:2001:470:1f11:463::210]) by mx1.freebsd.org (Postfix) with ESMTP id 588788FC14 for ; Wed, 15 Sep 2010 15:44:13 +0000 (UTC) Received: from swordfish.ply.claimlynx.com (mtka.claimlynx.com [74.95.66.25]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: ecrist@secure-computing.net) by kenny.secure-computing.net (Postfix) with ESMTP id 907FF2E06D; Wed, 15 Sep 2010 10:44:12 -0500 (CDT) Mime-Version: 1.0 (Apple Message framework v1081) Content-Type: text/plain; charset=us-ascii From: Eric Crist In-Reply-To: <4C90E88D.9050608@comcast.net> Date: Wed, 15 Sep 2010 10:44:11 -0500 Content-Transfer-Encoding: quoted-printable Message-Id: References: <1260697257.960376.1284564539991.JavaMail.root@erie.cs.uoguelph.ca> <4C90E88D.9050608@comcast.net> To: Steve Polyack X-Mailer: Apple Mail (2.1081) Cc: freebsd-fs@freebsd.org, Thomas Johnson Subject: Re: NFS nfs_getpages errors X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Sep 2010 15:44:13 -0000 On Sep 15, 2010, at 10:38:53, Steve Polyack wrote: > On 09/15/10 11:28, Rick Macklem wrote: >>> Hey folks, >>>=20 >>> We've got 4 servers running FreeBSD 8.1-RELEASE which PXE boot with >>> NFS root. On these machines, we run proftpd and apache 2.2. Over the >>> past couple weeks, we've seen a ton of errors as follows: >>>=20 >>> Sep 14 20:28:59 lion-3 proftpd[31761]: 0.0.0.0 >>> (folsom-1-red.claimlynx.com[216.17.68.130]) - ProFTPD terminating >>> (signal 11) >>> Sep 14 20:28:59 lion-3 kernel: nfs_getpages: error 1046353552 >>> Sep 14 20:28:59 lion-3 kernel: vm_fault: pager read error, pid 31761 >>> (proftpd) >>> Sep 14 20:28:59 lion-3 kernel: Sep 14 20:28:59 lion-3 = proftpd[31761]: >>> 0.0.0.0 (folsom-1-red.claimlynx.com[216.17.68.130]) - ProFTPD >>> terminating (signal 11) >>> Sep 14 20:28:59 lion-3 kernel: nfs_getpages: error 1046353552 >>> Sep 14 20:28:59 lion-3 kernel: vm_fault: pager read error, pid 31761 >>> (proftpd) >>> Sep 14 20:28:59 lion-3 kernel: pid 31761 (proftpd), uid 0: exited on >>> signal 11 >>>=20 >>> These, in this case, occurred on three of the four machines until >>> midnight after which all three of the machines had proftpd exit on >>> signal 11. The message above was for child processes. At midnight, = the >>> logfile rotated, and newsyslog sent singal 1 to the parent process, >>> which I think finally finished it off. The fourth machine remained >>> running and did not display these messages. >>>=20 >>> The number following 'nfs_getpages: error' changes for each cycle = and >>> I'm not certain if any of them repeat. >>>=20 >> Well, at a quick glance, those errors seem to be coming from the NFS >> server in a read reply. Also, the error values seem bogus, since they >> should be small positive numbers (1<->70 + a few just above 10000). > We see these errors on some 8.1 clients as well: > nfs_getpages: error 1110586608 > nfs_getpages: error 1108948624 > vm_fault: pager read error, pid 56216 (php) > nfs_getpages: error 1114969744 > vm_fault: pager read error, pid 54770 (php) > nfs_getpages: error 1137006224 > vm_fault: pager read error, pid 50578 (php) >=20 > They do not show up often, so we haven't spent much time looking into = it (no tcpdumps yet). Our NFS server is a 8-STABLE system backed by = ZFS, so maybe its related to that (again :) ). >=20 > Eric, is your NFS server backed by ZFS as well? >=20 > The NFS server doesn't seem to be logging any errors, but the = ret-failed count is always increasing: > Server Info: > Getattr Setattr Lookup Readlink Read Write Create = Remove > 543523097 14397049 1949982185 6380 17587820 14002952 8980955 = 8070238 > Rename Link Symlink Mkdir Rmdir Readdir RdirPlus = Access > 6966495 9 1668 1117125 904969 5567689 22307 = 184929325 > Mknod Fsstat Fsinfo PathConf Commit > 0 338500745 57 0 7129262 > Server Ret-Failed > 29089796 > Server Faults > 0 > Server Cache Stats: > Inprog Idem Non-idem Misses > 0 0 0 0 > Server Write Gathering: > WriteOps WriteRPC Opsaved > 14001235 14002952 1717 >=20 >> Could you possibly get a packet capture when one of these happens? >> ("tcpdump -s -0 -w xxx host" would suffice, but you need = to >> have it running when the error occurs. If you can reproduce it by >> talking to the proftpd server, so the tcpdump doesn't run for too >> long, that would be best.) >>=20 >> You can look in the tcpdump via wireshark and see what it being = returned >> for the Read RPCs at that time. (You can email me the "xxx" packet = trace >> as an attachment and I can look at it, if you get that far.) >>=20 >> rick >> ps: Otherwise, I'd go look at your NFS server and see if it's logging >> errors or if there are indications of problems. The NFS server is logging nothing at all related to NFS. It *is* = running 8.1-RC2, so there is potential for an update. If/when we notice = these errors again, we'll try to get a packet capture and forward it to = you. Our NFS server is backed by ZFS, as well. Eric