From owner-freebsd-fs@freebsd.org Sat Feb 27 03:51:52 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 81FF9AB2D51 for ; Sat, 27 Feb 2016 03:51:52 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id 634BD1FED for ; Sat, 27 Feb 2016 03:51:52 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: by mailman.ysv.freebsd.org (Postfix) id 60B5BAB2D4F; Sat, 27 Feb 2016 03:51:52 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 60403AB2D4E for ; Sat, 27 Feb 2016 03:51:52 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 03B791FEC for ; Sat, 27 Feb 2016 03:51:51 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) IronPort-PHdr: 9a23:9SmprhyhWlXC/OnXCy+O+j09IxM/srCxBDY+r6Qd0e8VIJqq85mqBkHD//Il1AaPBtWErawYwLSL+4nbGkU+or+5+EgYd5JNUxJXwe43pCcHRPC/NEvgMfTxZDY7FskRHHVs/nW8LFQHUJ2mPw6anHS+4HYoFwnlMkItf6KuStGU0Zj8ib360qaQSjsLrQL1Wal1IhSyoFeZnegtqqwmFJwMzADUqGBDYeVcyDAgD1uSmxHh+pX4p8Y7oGwD884mouRaTK73N4kmRLpDRGAsKWw4zMrzqQTYSwaToHAbVyMfj0wbLRLC6UTAX5zy+g7zvel51SzSadfzRLs3XTmnx7psRwLljD8HcTUwpjKEwvdshb5W9Ury7yd0xJTZNdmY X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2CsAgDvG9FW/61jaINehAwsQQa6SQENgWYXCoI8gmxKAoF1FAEBAQEBAQEBYyeCLYIUAQEBAwEBAQEgBCcgCwULAgEIDgoCAg0ZAgInAQkmAgQIBwQBHASHdggOr0GOVAEBAQEBAQEDAQEBAQEBARUEe4UXgXSCRoQPAQYBAQWDGIE6BY0rdIhrhVmCb4IyhEaHaYUthXKIVQIeAQFCggMZgWYeLgeHCQEIFx1+AQEB X-IronPort-AV: E=Sophos;i="5.22,506,1449550800"; d="scan'208";a="268040196" Received: from nipigon.cs.uoguelph.ca (HELO zcs1.mail.uoguelph.ca) ([131.104.99.173]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 26 Feb 2016 22:51:44 -0500 Received: from localhost (localhost [127.0.0.1]) by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id 5AE1315F56D; Fri, 26 Feb 2016 22:51:44 -0500 (EST) Received: from zcs1.mail.uoguelph.ca ([127.0.0.1]) by localhost (zcs1.mail.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id JD9IJ0vGQfWk; Fri, 26 Feb 2016 22:51:43 -0500 (EST) Received: from localhost (localhost [127.0.0.1]) by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id 2D5A715F56E; Fri, 26 Feb 2016 22:51:43 -0500 (EST) X-Virus-Scanned: amavisd-new at zcs1.mail.uoguelph.ca Received: from zcs1.mail.uoguelph.ca ([127.0.0.1]) by localhost (zcs1.mail.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id mqF-7Tp34eAt; Fri, 26 Feb 2016 22:51:43 -0500 (EST) Received: from zcs1.mail.uoguelph.ca (zcs1.mail.uoguelph.ca [172.17.95.18]) by zcs1.mail.uoguelph.ca (Postfix) with ESMTP id 0CE4F15F56D; Fri, 26 Feb 2016 22:51:43 -0500 (EST) Date: Fri, 26 Feb 2016 22:51:43 -0500 (EST) From: Rick Macklem To: Bruce Evans Cc: fs@freebsd.org Message-ID: <1403082388.11082060.1456545103011.JavaMail.zimbra@uoguelph.ca> In-Reply-To: <20160226164613.N2180@besplex.bde.org> References: <20160226164613.N2180@besplex.bde.org> Subject: Re: silly write caching in nfs3 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.95.11] X-Mailer: Zimbra 8.0.9_GA_6191 (ZimbraWebClient - FF44 (Win)/8.0.9_GA_6191) Thread-Topic: silly write caching in nfs3 Thread-Index: YIxK/xdpmoJvF6pX2tVxl52yQSHoEQ== X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 27 Feb 2016 03:51:52 -0000 Bruce Evans wrote: > nfs3 is slower than in old versions of FreeBSD. I debugged one of the > reasons today. > > Writes have apparently always done silly caching. Typical behaviour > is for iozone writing a 512MB file where the file fits in the buffer > cache/VMIO. The write is cached perfectly. But then when nfs_open() > reeopens the file, it calls vinvalbuf() to discard all of the cached > data. Thus nfs write caching usually discards useful older data to > make space for newer data that will never be never used (unless the > file is opened r/w and read using the same fd (and is not accessed > for a setattr or advlock operation -- these call vinvalbuf() too, if > NMODIFIED)). The discarding may be delayed for a long time. Then > keeping the useless data causes even more older data to be discarded. > Discarding it on close would at least prevent further loss. It would > have to be committed on close before discarding it of course. > Committing it on close does some good things even without discarding > there, and in oldnfs it gives a bug that prevents discaring in open -- > see below. > > nfs_open() does the discarding for different reasons in the NMODIFIED > and !NMODIFIED cases. In the NMODIFED case, it discard unconditionally. > This case can be avoided by fsync() before close or setting the sysctl > to commit in close. iozone does he fsync(). This helps in oldnfs but > not in newfs. With it, iozone on newfs now behaves like it did on oldnfs > 10-20 years ago. Something (perhaps just the timestamp bugs discussed > later) "fixed" the discarding on oldnfs 5-10 years ago. > > I think not committing in close is supposed to be an optimization, but > it is actually a pessimization for my kernel build tests (with object > files on nfs, which I normally avoid). Builds certainly have to reopen > files after writing them, to link them and perhaps to install them. > This causes the discarding. My kernel build tests also do a lot of > utimes() calls which cause the discarding before commit-on-close can > avoid the above cause for it it by clearing NMODIFIED. Enabling > commit-on-close gives a small optimisation with oldnfs by avoiding all > of the discarding except for utimes(). It reduces read RPCs by about > 25% without increasing write RPCs or real time. It decreases real time > by a few percent. > Well, the new NFS client code was cloned from the old one (about FreeBSD7). I did this so that the new client wouldn't exhibit different caching behaviour than the old one (avoiding any POLA). If you look in stable/10/sys/nfsclient/nfs_vnops.c and stable/10/sys/fs/nfsclient/nfs_clvnops.c at the nfs_open() and nfs_close() functions, the algorithm appears to be identical for NFSv3. (The new one has a bunch of NFSv4 gunk, but if you scratch out that stuff and ignore function name differences (nfs_flush() vs ncl_flush()), I think you'll find them the same. I couldn't spot any differences at a glance.) --> see r214513 in head/sys/fs/nfsclient/nfs_clvnops.c for example > The other reason for discarding is because the timestamps changed -- you > just wrote them, so the timestamps should have changed. Different bugs > in comparing the timestamps gave different misbehaviours. > > In old versions of FreeBSD and/or nfs, the timestamps had seconds > granularity, so many changes were missed. This explains mysterious > behaviours by iozone 10-20 years ago: the write caching is seen to > work perfectly for most small total sizes, since all the writes take > less than 1 second so the timestamps usually don't change (but sometimes > the writes lie across a seconds boundary so the timestamps do change). > > oldnfs was fixed many years ago to use timestamps with nanoseconds > resolution, but it doesn't suffer from the discarding in nfs_open() > in the !NMODIFIED case which is reached by either fsync() before close > of commit on close. I think this is because it updates n_mtime to > the server's new timestamp in nfs_writerpc(). This seems to be wrong, > since the file might have been written to by other clients and then > the change would not be noticed until much later if ever (setting the > timestamp prevents seeing it change when it is checked later, but you > might be able to see another metadata change). > > newfs has quite different code for nfs_writerpc(). Most of it was > moved to another function in nanother file. I understand this even > less, but it doesn't seem to have fetch the server's new timestamp or > update n_mtime in the v3 case. > I'm pretty sure it does capture the new attributes (including mtime in the reply. The function is called something like nfscl_loadattrcache(). In general, close-to-open consistency isn't needed for most mounts. (The only case where it matters is when multiple clients are concurrently updating files.) - There are a couple of options that might help performance when doing software builds on an NFS mount: nocto (I remember you don't like the name) - Actually, I can't remember why the code would still do the cache invalidation in nfs_open() when this is set. I wonder if the code in nfs_open() should maybe avoid invalidating the buffer cache when this is set? (I need to think about this.) noncontigwr - This one allows the writes to happen for byte aligned chunks when they are non-contiguous without pushing the individual writes to the server. (Again, this shouldn't cause problems unless multiple clients are writing to the file concurrently.) Both of these are worth trying for mounts where software builds are being done. > There are many other reasons why nfs is slower than in old versions. > One is that writes are more often done out of order. This tends to > give a slowness factor of about 2 unless the server can fix up the > order. I use an old server which can do the fixup for old clients but > not for newer clients starting in about FreeBSD-9 (or 7?). I actually thought this was mainly caused by the krpc that was introduced in FreeBSD7 (for both old and new NFS), separating the RPC from NFS. There are 2 layers in the krpc (sys/rpc/clnt_rc.c and sys/rpc/clnt_vc.c) that each use acquisition of a mutex to allow an RPC message to be sent. (Whichever thread happens to acquire the mutex first, sends first.) I had a couple of patches that tried to keep the RPC messages more ordered. (They would not have guaranteed exact ordering.) They seemed to help for the limited testing I could do, but since I wasn't seeing a lot of "out of order" reads/writes on my single core hardware, I couldn't verify how well these patches worked. mav@ was working on this at the time, but didn't get these patches tested either, from what I recall. --> Unfortunately, I seem to have lost these patches or I would have attached them so you could try them. Ouch. (I've cc'd mav@. Maybe he'll have them lying about. I think one was related to the nfsiod and the other for either sys/rpc/clnt_rc.c or sys/rpc/clnt_vc.c.) The patches were all client side. Maybe I'll try and recreate them. > I suspect > that this is just because Giant locking in old clients gave accidental > serialization. Multiple nfsiod's and/or nfsd's are are clearly needed > for performance if you have multiple NICs serving multiple mounts. Shared vnode locks are also a factor, at least for reads. (Before shared vnode locks, the vnode lock essentially serialized all reads.) As you note, a single threaded benchmark test is quite different than a lot of clients with a lot of threads doing I/O on a lot of files concurrently. The bandwidth * delay product of your network interconnect is also a factor. The larger this is, the more bits you need to be in transit to "fill the data pipe". You can increase the # of bits in transit by either using larger rsize/wsize or more read-ahead/write-behind. It would be nice to figure out why your case is performing better on the old nfs client (and/or server). If you have a fairly recent FreeBSD10 system, you could try doing mounts with new vs old client (and no other changes) and see what differences occur. (that would isolate new vs old from recent "old" and "really old") Good luck with it, rick > Other cases are less clear. For the iozone benchmark, there is only > 1 stream and multiple nfsiod's pessimize it into multiple streams that > give buffers which arrive out of order on the server if the multiple > nfsiod's are actually active. I use the following configuration to > ameliorate this, but the slowness factor is still often about 2 for > iozone: > - limit nfsd's to 4 > - limit nfsiod's to 4 > - limit nfs i/o sizes to 8K. The server fs block size is 16K, and > using a smaller block size usually helps by giving some delayed > writes which can be clustered better. (The non-nfs parts of the > server could be smarter and do this intentionally. The out-of-order > buffers look like random writes to the server.) 16K i/o sizes > otherwise work OK, but 32K i/o sizes are much slower for unknown > reasons. > > Bruce > _______________________________________________ > freebsd-fs@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" >