Date: Tue, 5 Jan 2016 16:19:30 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Tom Curry <thomasrcurry@gmail.com> Cc: "Mikhail T." <mi+thun@aldan.algebra.com>, freebsd-fs@freebsd.org Subject: Re: NFS reads vs. writes Message-ID: <20160105143542.X1191@besplex.bde.org> In-Reply-To: <CAGtEZUD28UZDYyHtHtzXgys%2Brpv_37u4fotwR%2BqZLc1%2BtK0dwA@mail.gmail.com> References: <8291bb85-bd01-4c8c-80f7-2adcf9947366@email.android.com> <5688D3C1.90301@aldan.algebra.com> <495055121.147587416.1451871433217.JavaMail.zimbra@uoguelph.ca> <568A047B.1010000@aldan.algebra.com> <CAGtEZUD28UZDYyHtHtzXgys%2Brpv_37u4fotwR%2BqZLc1%2BtK0dwA@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 4 Jan 2016, Tom Curry wrote: > On Mon, Jan 4, 2016 at 12:34 AM, Mikhail T. <mi+thun@aldan.algebra.com> > wrote: > >> On 03.01.2016 20:37, Rick Macklem wrote: >> ... >> I just tried lowering ZFS' recordsize to 64k to match MAXBSIZE, but that >> didn't help NFS-writing (unless sync is disabled, that is). >>> If this SSD is dedicated to the ZIL and is one known to have good write >> performance, >>> it should help, but in your case the SSD seems to be the bottleneck. >> It is a chunk of an older SSD, that also houses the OS. But it is >> usually idle, because executables and libraries are cached in the >> abundant RAM. I've seen it do 90+Mb/s (sequential)... Please be more careful with units (but don't use MiB's; I should killfile that). 90 Mbits/s is still slow. >> I just tried removing ZIL from the receiving pool -- to force direct >> writes -- but it didn't help the case, where the writes go over NFS. > > I assume you mean you removed the SLOG from the pool, in which case you > most definitely still have a ZIL, its now located on the pool itself. > Assuming you still have sync=standard I would venture a guess that writes > over NFS would now be measured in KB/s. > >> However, the local writes -- with reads from NFS -- went from the 56Mb/s >> I was seeing earlier to 90Mb/s!.. 56 to 90 is not a large difference. I think you mentioned factor of 10 differences earlier. >> There is got to be a better way to do this -- preferably, some >> self-tuning smarts... Thanks again. Yours, >> > There is no getting around the performance impact of a synchronous > operation, whether its NFS or a database log. If you don't believe me hop > on your favorite Windows box, bring up the device manager and disable the > write cache on its drive then run some benchmark supporting sync writes. > One way to lessen the performance impact is to decrease the latency of > writes, which is why SSD SLOGs help so much. Which brings me to my next > point.. But nfs doesn't do sync writes. As pointed out earlier in this threads, it does cached writes that is not very different from what other file systems do. It writes up to wcommitsize bytes per file and then commits them. The default value for wcommitsize is undocumented but according to the source code it is sqrt(hibufspace) * 256. This gives about 2.5MB on i386 with 1GB RAM and 17MB on amd64 with 24GB RAM. This is not very large, unless it is actually per-file and there is a backlog of many files with this much uncommitted data -- then it is too large. In most file systems, the corresponding limit is per-fs or per-system. On freefall, vfs.zfs.dirty_data_max is 2.5GB and vfs.hidirtybuffers is 26502. 2.5GB seems too high to me. It would take 25 seconds to drain if it is for a single disk that can do 100MB/s. 26502 is too high. It is 1.6GB with the maximum block size of 64K, and it can easily be for a single disk that is much slower than 100MB/s. I often see buffer cache delays of several seconds for backlogs of just a few MB on a slow (DVD) disk. When nfs commits the data, it has to do a sync write. Since wcommitsize is large, this shouldn't be very slow the file is small so it never reaches anywhere near size wcommitsize. (nfs apparently suffers from the same design errors as the buffer cache. Everthing is per-file or per-vnode, so there is no way to combine reads or writes even if reads are ahead and writes are long delayed. Caching in drives makes this problem not as large as it was 20-30 years ago, but it takes extra i/o's for the small i/o's and some drives haave too low an i/o's for their caching to help much). The implementation might still be fairly stupid and wait for the sync write to complete. This is what seems to happen with ffs for the server fs. With most mistunings, I get about half of the server speed for nfs (25MB/s). The timing with wcommitsize = 25MB might be: accumulate 25MB and send it to the server at line rate. My network can only do about 70MB/sec so this takes 0.35 seconds. Then wait for the server to do a sync write. My server can only do about 47MB/s so this takes 0.53 seconds. Stall writes on the client waiting for the server to confirm the commit. Total time 0.88 seconds or 28MB/s. Observed throughput more like 25MB/s. With everything async, I get 39MB/s today and 44MB/s with slightly diffenty configurations on other days. 2 interesting points turned up or were confirmed in my tests today: - async on the server makes little difference for large files. It was slightly slower if anything. This is because the only i/o that I tested today was a case that I am ususally not interested in -- large writes to a single file. In this case, almost all of the writes are sync for the commit. The possible reasons for async being slightly slower for committing are: - a larger backlog - bugs in vfs clustering -- some of its async conditions seem to be backwards. - when the server is mounted fully sync, writing on the client is faster than on the server, even with the small application buffer size of 512 on the client and a larger but not maximal buffer size on the server! This is because writes on the client are basically cached. They are combined on the server up to a big wcommitsize and done with a big sync write, while on the sever if the application writes 512 at a time it gets sync writes 512 at a time (plus pre-reads of the fs block size at a time, but only 1 of these per multiple 512-writes). It is easy to have a stupider implementation. E.g., when nfs commits, on the server don't give this any priority and get around to it 5-30 seconds later. Or give it some priority but put it behind the local backlog of 2.5GB or so. Give this priority too, but it still takes a long time since it is so large. Don't tell the client about the progress you are making (I think the nfs protocol doesnt have any partially-committed states). Maybe zfs is too smart about caching and it interacts badly with nfs, and ffs interacts better because it is not so smart. (I don't even use ffs with soft updates because they are too smart.) It is not so easy to have a better implementation, though protocols like zmodem and tcp have had one for 30-40 years. Just stream small writes to the server as fast as you can and let it nack them as fast as it prefers to commit them to stable storage (never ack, but negative ack for a problem). Then if you want to commit a file, tell the server to give the blocks for that file priority but don't wait for it to finish before writing more. Give some priority hints to minimize backlogs. Changing wcommitsize between 8K and 200MB for testing with a 128MB file made suprisingly little difference here. > SSDs are so fast for three main reasons: low latency, large dram buffers, > and parallel workloads. Only one of these is of any benefit (latency) as a > SLOG. Unfortunately that particular metric is not usually advertised in > consumer SSDs where the benchmarks they use to tout 90,000 random write > iops consist of massively concurrent, highly compressible, short lived > bursts of data. Add that drive as a SLOG and the onboard dram may as well > not even exist, and queue depths count for nothing. It will be lucky to > pull 2,000 IOPS. Once you start adding in ZFS features like checksums and > compression, or network latency in the case of NFS that 2,000 number starts > to drop even more. Latency seems to be unimportant for a big commit. It is important for lots of smaller commits if the client (kernel or application) needs to wait for just one of them. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160105143542.X1191>