From owner-freebsd-fs@FreeBSD.ORG Wed Sep 15 15:54:42 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AFCE11065670 for ; Wed, 15 Sep 2010 15:54:42 +0000 (UTC) (envelope-from korvus@comcast.net) Received: from mx04.pub.collaborativefusion.com (mx04.pub.collaborativefusion.com [206.210.72.84]) by mx1.freebsd.org (Postfix) with ESMTP id 68D668FC16 for ; Wed, 15 Sep 2010 15:54:42 +0000 (UTC) Received: from [192.168.2.164] ([206.210.89.202]) by mx04.pub.collaborativefusion.com (StrongMail Enterprise 4.1.1.4(4.1.1.4-47689)); Wed, 15 Sep 2010 11:20:14 -0400 X-VirtualServerGroup: Default X-MailingID: 00000::00000::00000::00000::::2974 X-SMHeaderMap: mid="X-MailingID" X-Destination-ID: freebsd-fs@freebsd.org X-SMFBL: ZnJlZWJzZC1mc0BmcmVlYnNkLm9yZw== Message-ID: <4C90E88D.9050608@comcast.net> Date: Wed, 15 Sep 2010 11:38:53 -0400 From: Steve Polyack User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US; rv:1.9.2.7) Gecko/20100805 Lightning/1.0b2 Thunderbird/3.1.1 MIME-Version: 1.0 To: Rick Macklem References: <1260697257.960376.1284564539991.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <1260697257.960376.1284564539991.JavaMail.root@erie.cs.uoguelph.ca> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: Eric Crist , freebsd-fs@freebsd.org, Thomas Johnson Subject: Re: NFS nfs_getpages errors X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Sep 2010 15:54:42 -0000 On 09/15/10 11:28, Rick Macklem wrote: >> Hey folks, >> >> We've got 4 servers running FreeBSD 8.1-RELEASE which PXE boot with >> NFS root. On these machines, we run proftpd and apache 2.2. Over the >> past couple weeks, we've seen a ton of errors as follows: >> >> Sep 14 20:28:59 lion-3 proftpd[31761]: 0.0.0.0 >> (folsom-1-red.claimlynx.com[216.17.68.130]) - ProFTPD terminating >> (signal 11) >> Sep 14 20:28:59 lion-3 kernel: nfs_getpages: error 1046353552 >> Sep 14 20:28:59 lion-3 kernel: vm_fault: pager read error, pid 31761 >> (proftpd) >> Sep 14 20:28:59 lion-3 kernel: Sep 14 20:28:59 lion-3 proftpd[31761]: >> 0.0.0.0 (folsom-1-red.claimlynx.com[216.17.68.130]) - ProFTPD >> terminating (signal 11) >> Sep 14 20:28:59 lion-3 kernel: nfs_getpages: error 1046353552 >> Sep 14 20:28:59 lion-3 kernel: vm_fault: pager read error, pid 31761 >> (proftpd) >> Sep 14 20:28:59 lion-3 kernel: pid 31761 (proftpd), uid 0: exited on >> signal 11 >> >> These, in this case, occurred on three of the four machines until >> midnight after which all three of the machines had proftpd exit on >> signal 11. The message above was for child processes. At midnight, the >> logfile rotated, and newsyslog sent singal 1 to the parent process, >> which I think finally finished it off. The fourth machine remained >> running and did not display these messages. >> >> The number following 'nfs_getpages: error' changes for each cycle and >> I'm not certain if any of them repeat. >> > Well, at a quick glance, those errors seem to be coming from the NFS > server in a read reply. Also, the error values seem bogus, since they > should be small positive numbers (1<->70 + a few just above 10000). We see these errors on some 8.1 clients as well: nfs_getpages: error 1110586608 nfs_getpages: error 1108948624 vm_fault: pager read error, pid 56216 (php) nfs_getpages: error 1114969744 vm_fault: pager read error, pid 54770 (php) nfs_getpages: error 1137006224 vm_fault: pager read error, pid 50578 (php) They do not show up often, so we haven't spent much time looking into it (no tcpdumps yet). Our NFS server is a 8-STABLE system backed by ZFS, so maybe its related to that (again :) ). Eric, is your NFS server backed by ZFS as well? The NFS server doesn't seem to be logging any errors, but the ret-failed count is always increasing: Server Info: Getattr Setattr Lookup Readlink Read Write Create Remove 543523097 14397049 1949982185 6380 17587820 14002952 8980955 8070238 Rename Link Symlink Mkdir Rmdir Readdir RdirPlus Access 6966495 9 1668 1117125 904969 5567689 22307 184929325 Mknod Fsstat Fsinfo PathConf Commit 0 338500745 57 0 7129262 Server Ret-Failed 29089796 Server Faults 0 Server Cache Stats: Inprog Idem Non-idem Misses 0 0 0 0 Server Write Gathering: WriteOps WriteRPC Opsaved 14001235 14002952 1717 > Could you possibly get a packet capture when one of these happens? > ("tcpdump -s -0 -w xxx host" would suffice, but you need to > have it running when the error occurs. If you can reproduce it by > talking to the proftpd server, so the tcpdump doesn't run for too > long, that would be best.) > > You can look in the tcpdump via wireshark and see what it being returned > for the Read RPCs at that time. (You can email me the "xxx" packet trace > as an attachment and I can look at it, if you get that far.) > > rick > ps: Otherwise, I'd go look at your NFS server and see if it's logging > errors or if there are indications of problems. > >