From owner-freebsd-fs@FreeBSD.ORG Sat Oct 14 06:07:00 2006 Return-Path: X-Original-To: fs@freebsd.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3583A16A412; Sat, 14 Oct 2006 06:07:00 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout2.pacific.net.au (mailout2-3.pacific.net.au [61.8.2.226]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7FA7243D55; Sat, 14 Oct 2006 06:06:57 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.2.163]) by mailout2.pacific.net.au (Postfix) with ESMTP id E67B110A1BC; Sat, 14 Oct 2006 16:06:55 +1000 (EST) Received: from epsplex.bde.org (katana.zip.com.au [61.8.7.246]) by mailproxy2.pacific.net.au (Postfix) with ESMTP id 103EA2740C; Sat, 14 Oct 2006 16:06:53 +1000 (EST) Date: Sat, 14 Oct 2006 16:06:53 +1000 (EST) From: Bruce Evans X-X-Sender: bde@epsplex.bde.org To: fs@freebsd.org In-Reply-To: <20061006050913.Y5250@epsplex.bde.org> Message-ID: <20061014143825.F1264@epsplex.bde.org> References: <20061006050913.Y5250@epsplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: mohans@freebsd.org Subject: Re: lost dotdot caching pessimizes nfs especially X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 14 Oct 2006 06:07:00 -0000 On Fri, 6 Oct 2006, Bruce Evans wrote: > This change: > > % Index: vfs_cache.c > % =================================================================== > % RCS file: /home/ncvs/src/sys/kern/vfs_cache.c,v > % retrieving revision 1.102 > % retrieving revision 1.103 > % diff -u -2 -r1.102 -r1.103 > % --- vfs_cache.c 13 Jun 2005 05:59:59 -0000 1.102 > % +++ vfs_cache.c 17 Jun 2005 01:05:13 -0000 1.103 > % ... > > is responsible for about half of the performance loss since RELENG_4 > for building kernels over nfs (/usr and sys trees on nfs). The kernel > build uses "../../" a lot, and the above change apparently results in > lots of network activity for things that should be cached locally. > > Some times for building a RELENG_4 kernel under conditions invariant > except for the host kernel (after "make clean; sleep 2; make depend; > make; make clean; sleep 2; make depend" to warm up caches): > > kernel: > RELENG_4 77.51 real 60.62 user 4.36 sys > current.2004.07.01 ~78.5 (lost details) > current.2005.01.01 ~79 (lost details) > current.2005.06.17 82.42 real 62.50 user 4.71 sys > current.2005.06.19 89.53 real 62.18 user 5.44 sys > current.2005.06.17+ ~89.5 (lost details) > .17+ = .17 plus above change > current.2005.06.17+* 86.08 real 62.43 user 5.13 sys > .17+* = .17+ with ../.. in Makefile avoided using a symlink > @ -> > RELENG_6 91.14 real 62.04 user 5.71 sys > current similar to RELENG_6 (lost details) > > The total performance loss is about 18%. > > The total performance loss for a local sys tree (/usr still on nfs) is much > smaller (about 4%): > > RELENG_4 65.19 real 60.50 user 3.95 sys > current.2005.06.17 67.49 real 62.13 user 4.27 sys > RELENG_6 67.83 real 61.84 user 4.71 sys > current similar to RELENG_6 (lost details) > > The nfs performance for building of things that should be entirely > cached locally is very dependent on network latency. Not caching > things very well causes lots of unnecessary network traffic for Getattr > and Lookup. The packets are small, so throughput is unimportant and > latency dominates. For building over nfs without -j, the dead time > (real - user - sys) is almost directly proportional to the latency. > My usual local network has fairly low latency (~100uS unloaded) and > the ~14 seconds dead time in the above is for it. Switching to a 1 > Gbps network with lower quality NICs gives an unloaded latency of ~160uS > and a dead time of ~21 seconds. Building with -j helps even for UP, > at the cost of extra CPU, by letting some processes advance using cached > stuff while others are waiting for the network. Building with -j helps > even more on FreeBSD cluster machines, more because they have a much > higher network latency than because they are SMP. I finished finding almost all the lost performance. As indicated above, It was almost all in nfs. This change: % Index: nfs_vnops.c % =================================================================== % RCS file: /home/ncvs/src/sys/nfsclient/nfs_vnops.c,v % retrieving revision 1.235 % retrieving revision 1.236 % diff -u -2 -r1.235 -r1.236 % --- nfs_vnops.c 6 Dec 2004 18:52:28 -0000 1.235 % +++ nfs_vnops.c 6 Dec 2004 19:18:00 -0000 1.236 % @@ -418,10 +418,11 @@ % if (error) % return (error); % - np->n_mtime = vattr.va_mtime.tv_sec; % + np->n_mtime = vattr.va_mtime; % } else { % + np->n_attrstamp = 0; ^^^^^^^^^^^^^^^^^^^^ % error = VOP_GETATTR(vp, &vattr, ap->a_cred, ap->a_td); % if (error) % return (error); % - if (np->n_mtime != vattr.va_mtime.tv_sec) { % + if (NFS_TIMESPEC_COMPARE(&np->n_mtime, &vattr.va_mtime)) { % if (vp->v_type == VDIR) % np->n_direofoffset = 0; and associated changes give silly behaviour that almost doubles the number of Access RPCs. One of the associated changes clears n_attrstamp on close(). Then on open(), since lookup() is called before the above is reached, nfs_access_otw() has always just been called, and the above forces another call. Counting RPCs gives a good metric for the pessimizations. Removing the above clearing in RELENG_6 gives the following improvement: Before: 89.90 real 62.16 user 5.50 sys Lookup Read Write Create Access Fsstat Setattr Other Total 60010 2410 5353 442 43785 1742 5194 6 118942 After: 86.46 real 62.22 user 5.21 sys Lookup Read Write Create Access Fsstat Setattr Other Total 59986 2410 5353 442 20935 1742 5194 6 96068 Note the RPC delta-counts barely changed except for the Access one. About 20000 Access calls were avoided. Just removing the clearing is not correct but is close. The pessimization in vfs_cache.c 1.103 is now easy to quantify. It triples the number of Lookup RPCs. Removing it in addition to the above gives a much larger improvement: 79.24 real 61.87 user 5.04 sys Lookup Read Write Create Access Fsstat Setattr Other Total 19548 2410 5353 442 20922 1742 5194 6 55617 Note the RPC delta-counts barely changed except for the Lookup one. About 40000 Lookup calls were avoided. Just removing the change in vfs_cache.c 1.103 is not close to being correct. The last major pessimization is another silly one. The changes to mark atimes on exec() and mmap() cause a silly null Setattr RPC for every exec() (more for interprters?) and every mmap(). This is easy to fix (almost) correctly. VOP_SETATTR() is assumed to do nothing for requests that it doesn't understand, but nfs_setattr() does null RPCs instead. The following fix: % diff -c2 ./nfsclient/nfs_vnops.c~ ./nfsclient/nfs_vnops.c % *** ./nfsclient/nfs_vnops.c~ Sun Oct 8 23:08:57 2006 % --- ./nfsclient/nfs_vnops.c Fri Oct 13 09:58:12 2006 % *************** % *** 669,675 **** % % /* % ! * Setting of flags is not supported. % */ % ! if (vap->va_flags != VNOVAL) % return (EOPNOTSUPP); % % --- 677,684 ---- % % /* % ! * Setting of flags and marking of atimes are not supported. % */ % ! if (vap->va_flags != VNOVAL || % ! ((bdefix & 4) && (vap->va_vaflags & VA_MARK_ATIME))) % return (EOPNOTSUPP); % in addition to the removals gives the following improvement with bdefix set to 7: 78.14 real 62.03 user 4.79 sys Lookup Read Write Create Access Fsstat Other Total 19556 2410 5353 442 19581 1738 14 49094 Bruce