From owner-freebsd-fs@freebsd.org Fri Feb 26 07:07:03 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 35BDBAB405F for ; Fri, 26 Feb 2016 07:07:03 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id 26587108B for ; Fri, 26 Feb 2016 07:07:03 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id 22AFBAB405E; Fri, 26 Feb 2016 07:07:03 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 22394AB405D for ; Fri, 26 Feb 2016 07:07:03 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail107.syd.optusnet.com.au (mail107.syd.optusnet.com.au [211.29.132.53]) by mx1.freebsd.org (Postfix) with ESMTP id C6C73108A for ; Fri, 26 Feb 2016 07:07:02 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c110-21-41-193.carlnfd1.nsw.optusnet.com.au (c110-21-41-193.carlnfd1.nsw.optusnet.com.au [110.21.41.193]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id D03C0D49C1B for ; Fri, 26 Feb 2016 18:06:54 +1100 (AEDT) Date: Fri, 26 Feb 2016 18:06:53 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: fs@freebsd.org Subject: silly write caching in nfs3 Message-ID: <20160226164613.N2180@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=EfU1O6SC c=1 sm=1 tr=0 a=73JWPhLeruqQCjN69UNZtQ==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=Y_XGasR6flW_SNLsOy4A:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Feb 2016 07:07:03 -0000 nfs3 is slower than in old versions of FreeBSD. I debugged one of the reasons today. Writes have apparently always done silly caching. Typical behaviour is for iozone writing a 512MB file where the file fits in the buffer cache/VMIO. The write is cached perfectly. But then when nfs_open() reeopens the file, it calls vinvalbuf() to discard all of the cached data. Thus nfs write caching usually discards useful older data to make space for newer data that will never be never used (unless the file is opened r/w and read using the same fd (and is not accessed for a setattr or advlock operation -- these call vinvalbuf() too, if NMODIFIED)). The discarding may be delayed for a long time. Then keeping the useless data causes even more older data to be discarded. Discarding it on close would at least prevent further loss. It would have to be committed on close before discarding it of course. Committing it on close does some good things even without discarding there, and in oldnfs it gives a bug that prevents discaring in open -- see below. nfs_open() does the discarding for different reasons in the NMODIFIED and !NMODIFIED cases. In the NMODIFED case, it discard unconditionally. This case can be avoided by fsync() before close or setting the sysctl to commit in close. iozone does he fsync(). This helps in oldnfs but not in newfs. With it, iozone on newfs now behaves like it did on oldnfs 10-20 years ago. Something (perhaps just the timestamp bugs discussed later) "fixed" the discarding on oldnfs 5-10 years ago. I think not committing in close is supposed to be an optimization, but it is actually a pessimization for my kernel build tests (with object files on nfs, which I normally avoid). Builds certainly have to reopen files after writing them, to link them and perhaps to install them. This causes the discarding. My kernel build tests also do a lot of utimes() calls which cause the discarding before commit-on-close can avoid the above cause for it it by clearing NMODIFIED. Enabling commit-on-close gives a small optimisation with oldnfs by avoiding all of the discarding except for utimes(). It reduces read RPCs by about 25% without increasing write RPCs or real time. It decreases real time by a few percent. The other reason for discarding is because the timestamps changed -- you just wrote them, so the timestamps should have changed. Different bugs in comparing the timestamps gave different misbehaviours. In old versions of FreeBSD and/or nfs, the timestamps had seconds granularity, so many changes were missed. This explains mysterious behaviours by iozone 10-20 years ago: the write caching is seen to work perfectly for most small total sizes, since all the writes take less than 1 second so the timestamps usually don't change (but sometimes the writes lie across a seconds boundary so the timestamps do change). oldnfs was fixed many years ago to use timestamps with nanoseconds resolution, but it doesn't suffer from the discarding in nfs_open() in the !NMODIFIED case which is reached by either fsync() before close of commit on close. I think this is because it updates n_mtime to the server's new timestamp in nfs_writerpc(). This seems to be wrong, since the file might have been written to by other clients and then the change would not be noticed until much later if ever (setting the timestamp prevents seeing it change when it is checked later, but you might be able to see another metadata change). newfs has quite different code for nfs_writerpc(). Most of it was moved to another function in nanother file. I understand this even less, but it doesn't seem to have fetch the server's new timestamp or update n_mtime in the v3 case. There are many other reasons why nfs is slower than in old versions. One is that writes are more often done out of order. This tends to give a slowness factor of about 2 unless the server can fix up the order. I use an old server which can do the fixup for old clients but not for newer clients starting in about FreeBSD-9 (or 7?). I suspect that this is just because Giant locking in old clients gave accidental serialization. Multiple nfsiod's and/or nfsd's are are clearly needed for performance if you have multiple NICs serving multiple mounts. Other cases are less clear. For the iozone benchmark, there is only 1 stream and multiple nfsiod's pessimize it into multiple streams that give buffers which arrive out of order on the server if the multiple nfsiod's are actually active. I use the following configuration to ameliorate this, but the slowness factor is still often about 2 for iozone: - limit nfsd's to 4 - limit nfsiod's to 4 - limit nfs i/o sizes to 8K. The server fs block size is 16K, and using a smaller block size usually helps by giving some delayed writes which can be clustered better. (The non-nfs parts of the server could be smarter and do this intentionally. The out-of-order buffers look like random writes to the server.) 16K i/o sizes otherwise work OK, but 32K i/o sizes are much slower for unknown reasons. Bruce