From owner-freebsd-bugs@FreeBSD.ORG Mon Jun 10 01:10:01 2013 Return-Path: Delivered-To: freebsd-bugs@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id A2FC2224 for ; Mon, 10 Jun 2013 01:10:01 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id 93F72124D for ; Mon, 10 Jun 2013 01:10:01 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id r5A1A0A6076380 for ; Mon, 10 Jun 2013 01:10:00 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id r5A1A0FM076378; Mon, 10 Jun 2013 01:10:00 GMT (envelope-from gnats) Date: Mon, 10 Jun 2013 01:10:00 GMT Message-Id: <201306100110.r5A1A0FM076378@freefall.freebsd.org> To: freebsd-bugs@FreeBSD.org Cc: From: Bruce Evans Subject: Re: kern/178997: Heavy disk I/O may hang system X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: Bruce Evans List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 10 Jun 2013 01:10:01 -0000 The following reply was made to PR kern/178997; it has been noted by GNATS. From: Bruce Evans To: Klaus Weber Cc: freebsd-gnats-submit@FreeBSD.org Subject: Re: kern/178997: Heavy disk I/O may hang system Date: Mon, 10 Jun 2013 11:00:29 +1000 (EST) On Mon, 10 Jun 2013, Klaus Weber wrote: > On Tue, Jun 04, 2013 at 07:09:59AM +1000, Bruce Evans wrote: >> On Fri, 31 May 2013, Klaus Weber wrote: This thread is getting very long, and I will only summarize a couple of things that I found last week here. Maybe more later. o Everything seems to be working as well as intended (not very well) except in bufdaemon and friends. Perhaps it is already fixed there. I forgot to check which version of FreeBSD you are using. You may be missing some important fixes. There were some by kib@ a few months ago, and some by jeff@ after this thread started. I don't run any version of FreeBSD new enough to have these, and the version that I run also doesn't seem to have any serious bugs in bufdaemon. It just works mediocrely. o Writing in blocks of size less than the fs block size, as bonnie normally does, gives much the same rewriting effect as bonnie does explicitly, because the system is forced to read each block before doing a partial write to it. This at best doubles the amount of i/o and halves the throughput of the writes. o There are some minor bugs in the read-before-write system code, and some differences in the "much the same" that may be important in some cases: - when bonnie does the read-before-write, the system normally uses cluster_read() on the read descriptor and thus uses the system's idea of sequentiality on the read descriptor. For both ffs and msdosfs, this normally results in a reading small cluster (up to the end of the current read) followed by async read-ahead of a larger cluster or 2. The separate clusters improve latency but reduce performance. scottl@ recently committed a sysctl vfs.read_min for avoiding the earlier splitting. Using it made some interesting but ultimately unimportant differences to the 2-bonnie problem. - when the system does the read-before-write, ffs normally uses cluster_read() on the write descriptor and thus uses the system's idea of sequentially on the write descriptor. ffs doesn't know the correct amount to read in this case, and it always asks for MAXBSIZE, which is both too small and too large. This value is the amount that should be read synchronously. It is too large since the normal amount is the application's block size which is normally smaller, and it is too small since MAXBSIZE is only half of the max cluster size. The correct tradeoff of latency vs throughput is even less clear than for a user read, and further off from being dynamic. msdosfs doesn't even use cluster_read() for this (my bad). It uses plain bread(). This gives very low performance when the block size is small. So msdosfs worked much better in the 2-bonnie benchmark than for rewrites generated by dd just writing with a small block size and conv=notrunc. After fixing thus, msdosfs worked slightly better than ffs in all cases. - whoever does the read-before-write, cluster reading tends to generate a bad i/o pattern. I saw patterns like the following (on ~5.2 where the max cluster size is only 64K, after arranging to mostly use this size): file1: read 64K offset 0 file1: read ahead 64K offset 64K file2: read 64K offset 0 file1: read ahead 64K offset 64K file1: write 64K offset 0 file1: read 64K offset 128K file1: read ahead 64K offset 192K file2: write 64K offset 0 The 2 files make the disk seek a lot, and the read-and-read-ahead gives even more seeks to get back to the write position. My drives are old and have only about 2MB of cache. Seeks with patterns like the above are apparently just large enough to break the drives' caching. OTOH, if I use your trick of mounting with -noclusterw, the seeks are reduced signficantly and my throughput increases by almost a factor of 2, even though this gives writes of only 16K. Apparently the seeks are reduced just enough for the drives' caches to work well. I think the same happens for you. Your i/o system is better, but it only takes a couple of bonnies and perhaps the read pointers getting even further ahead of the write pointers to defeat the drive's caching. Small timing differences probably allow the difference to build up. Mounting with -noclusterw also gives some synchronization that will prevent this buildup. - when the system does the read-before-write, the sequential heuristic isn't necessarily clobbered, but it turns out that the clobbering gives the best possible behaviour, except for limitations and bugs in bufdaemon!... > [... good stuff clipped] > So it really seems that clustering does provide performance benefits, > but the RAID controller seems to able to able to make up for the lack > of clustering (either because clustering is disabled, or because it > does not work effectively due to interspersed reads and seeks on the > same file descriptor). Yes, the seek pattern caused by async-but-not-long-delayed writes (whether done by cluster_write() a bit later or bawrite() directly) combined with reading far ahead (whether done explicitly or implicitly) is very bad even for 1 file, but can often be compensated for by caching in the drives. With 2 files or random writes on 1 file it is much worse, but appaerently mounting with -noclusterw limits it enough for the drives to compensate in the case of 2 bonnies. I think the best we can do in general is delay writes as long as possible and then schedule them perfectly. But scheduling them perfectly is difficult and only happens accidentally. >>> I am now looking at vfs_cluster.c to see whether I can find which part >>> is responsible for letting numdirtybuffers raise without bounds and >>> why only *re* writing a file causes problems, not the initial >>> writing. Any suggestions on where to start looking are very welcome. >> >> It is very complicated, but it was easy to find its comments saying that >> it tries not to force out the writes for non-seqential accesses. I >> am currently trying the following workarounds: > > I have decided to start testing with only a single change from the > list of changes you provided: > >> % diff -u2 vfs_cluster.c~ vfs_cluster.c >> % @@ -726,8 +890,13 @@ >> % * are operating sequentially, otherwise let the buf or >> % * update daemon handle it. >> % + * >> % + * Algorithm changeback: ignore seqcount here, at least for >> % + * now, to work around readers breaking it for writers. It >> % + * is too late to start ignoring after write pressure builds >> % + * up, since not writing out here is the main source of the >> % + * buildup. >> % */ >> % bdwrite(bp); >> % - if (seqcount > 1) >> % - cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, >> vp->v_clen + 1); >> % + cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen + >> 1); >> % vp->v_clen = 0; >> % vp->v_cstart = lbn + 1; > > And sure enough, a kernel with only this one line change[*] is able to > handle reading and re-writing a file through a single file descriptor > just fine, good performance, no hangs, vfs.numdirtybuffers remains > low: After more testing, I found that this was almost perfectly backwards for my hardware! I think for your hardware it allows the drives to compensate, much like with -noclusterw but with a slightly improved throughput due to the larger writes. But with my drives, it mostly just gives more seeks. After changing this back and being more careful with the comparisons, I found that best results are obtained (in ~5.2) by letting numdirtybuffers build up. The breakage of the sequential heuristic cause the above to never force out the cluster immediately for the 2-bonnie case. I get similar behaviour by always using delayed writes in ffs_write(). This might depend on setting B_CLUSTEROK in more cases, so that the clustering always gets done later. Typical throughputs for me: - my drives can do 55MB/sec max and get 48 for writing 1 file with large blocks using dd - 48 drops to half of 20-24 with read-before-write for 1 file. That's a 4-fold reduction. One half is for the doubled i/o and the other half is for the seeks. - half of 20-24 drops to half of 10-12 with 2 files and read-before-write of each, in the best case. That's an 8-fold reduction. Another factor of 2 is apparently lost to more seeks. - half of 10-12 drops to half of 5-6, as in the previous point but in the worst case. That's a 16-fold reduction. The worst case is with my modification above. It maximizes the seeks. My original idea for a fix (in the above diff) gave this case. It gave almost perfect clustering and almost no buildup of numdirtybuffers, but throughput was still worst. (My drives can do 16K blocks at full speed provided the blocks are contiguous, so they don't benifit much from clustering except for its side effect of reducing seeks to other blocks in between accessing the contiguous ones.) Some of this typical behaviour is not very dependent on block sizes. The drives become seek-bound, and anything that doubles the number of seeks halves the throughput. Bruce