From owner-freebsd-fs@FreeBSD.ORG Wed Apr 15 00:36:38 2015 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 68F1390 for ; Wed, 15 Apr 2015 00:36:38 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 2F5EFB51 for ; Wed, 15 Apr 2015 00:36:37 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2CoBADPsS1V/95baINcg15cBYMQwmaBToYBAoF9EwEBAQEBAQF9hB8BAQEDASMEUgUWDgoCAg0ZAlkGiDUIDbEdhVOQUQEBAQEBAQEDAQEBAQEBAQEWBIEhigqEGREBBhc0B4JogUUFsAsihAsiMgGBCTl/AQEB X-IronPort-AV: E=Sophos;i="5.11,579,1422939600"; d="scan'208";a="205667232" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 14 Apr 2015 20:36:36 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 84081B3F13; Tue, 14 Apr 2015 20:36:36 -0400 (EDT) Date: Tue, 14 Apr 2015 20:36:36 -0400 (EDT) From: Rick Macklem To: Adam Guimont Cc: freebsd-fs@freebsd.org Message-ID: <238081719.19055888.1429058196527.JavaMail.root@uoguelph.ca> In-Reply-To: <551F072C.1000505@tezzaron.com> Subject: Re: NFSD high CPU usage MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.95.11] X-Mailer: Zimbra 7.2.6_GA_2926 (ZimbraWebClient - FF3.0 (Win)/7.2.6_GA_2926) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Apr 2015 00:36:38 -0000 Adam Guimont wrote: > Rick Macklem wrote: > > I can think of two explanations for this. > > 1 - The server nfsd threads get confused when the TCP recv Q fills > > and start looping around. > > OR > > 2 - The client is sending massive #s of RPCs (or crap that is > > incomplete RPCs). > > > > To get a better idea w.r.t. what is going on, I'd suggest that > > you capture packets (for a relatively short period) when the > > server is 100% CPU busy. > > # tcpdump -s 0 -w out.pcap host > > - run on the server should do it. > > Then look at out.pcap in wireshark and see what the packets > > look like. (wireshark understands NFS, whereas tcpdump doesn't) > > If #1, I'd guess very little traffic (maybe TCP layer stuff), > > if #2, I'd guess you'll see a lot of RPC requests or garbage > > that isn't a valid request. (This latter case would suggest a > > CentOS problem.) > > > > If you capture the packets but can't look at them in wireshark, > > you could email me the packet capture as an attachment and I > > can look at it after Apr. 10, when I get home. > > > > rick > > > > Thanks Rick, > > I was able to capture this today while it was happening. The capture > is > for about 100 seconds. I took a look at it in wireshark and to me it > appears like the #2 situation you were describing. > > If you would like to confirm that I've uploaded the pcap file here: > > https://www.dropbox.com/s/pdhwj5z5tz7iwou/out.pcap.20150403 > Well, I took a look, but I'll admit I couldn't figure out much from it. It appears that the TCP connection is in a pretty degraded state. - FreeBSD is sending a whole bunch of TCP segments with 164bytes of data (that appears to be the same for each one, but I didn't look at them closely). Each of them has a Window size == 0 (PUSH + ACK). --> Linux responds with an ACK and no data (which makes sense because of the 0 length Window) eventually FreeBSD does open up the Window after something like 1200 of the above TCP segments. --> It is possible that all these segments are RPC replies to similar requests, but Wireshark just think they're all RPC continuations and doesn't recognize an RPC message. (I couldn't be bothered to try and decode one manually.) One thing I see is that the Linux window size is 24576. If TSO is enabled in FreeBSD's net device, you might try disabling TSO, in case it is sending too much or somehow getting confused. Other than that, I think it would take a packet capture just when the trouble starts to try and figure out how things get messed up. I'm not good enough w.r.t. TCP to have any idea what might be happening. Maybe someone conversant with TCP can look at the trace? rick > I will continue running some tests and trying to gather as much data > as > I can. > > Regards, > > Adam Guimont >