From owner-freebsd-bugs@FreeBSD.ORG Mon Jun 3 21:20:03 2013 Return-Path: Delivered-To: freebsd-bugs@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 4A7AA8E9 for ; Mon, 3 Jun 2013 21:20:03 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id 3B83B1828 for ; Mon, 3 Jun 2013 21:20:03 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.7/8.14.7) with ESMTP id r53LK27R040880 for ; Mon, 3 Jun 2013 21:20:02 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.7/8.14.7/Submit) id r53LK2qJ040677; Mon, 3 Jun 2013 21:20:02 GMT (envelope-from gnats) Date: Mon, 3 Jun 2013 21:20:02 GMT Message-Id: <201306032120.r53LK2qJ040677@freefall.freebsd.org> To: freebsd-bugs@FreeBSD.org Cc: From: Bruce Evans Subject: Re: kern/178997: Heavy disk I/O may hang system X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: Bruce Evans List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 Jun 2013 21:20:03 -0000 The following reply was made to PR kern/178997; it has been noted by GNATS. From: Bruce Evans To: Klaus Weber Cc: freebsd-gnats-submit@freebsd.org, freebsd-bugs@freebsd.org Subject: Re: kern/178997: Heavy disk I/O may hang system Date: Tue, 4 Jun 2013 07:09:59 +1000 (EST) On Fri, 31 May 2013, Klaus Weber wrote: > Sorry for the late reply, testing took longer than expected. This is very interesting. It will take me a long time to complete testing and replying too. Maybe more later. > (I have combined your replies from separate mails into one, and > reordered some of the text.) Very readable now. > ... > I have now repeated the tests with several variations of block- and > fragment sizes. In all cases, I did two tests: > > 1) dd if=/dev/zero of=/mnt/t1/100GB-1.bin bs=100m count=1000 > 2) bonnie++ -s 64g -n 0 -f -D -d /mnt/t1 > bonnie++ -s 64g -n 0 -f -D -d /mnt/t2 > > The dd is simply to give a rough idea of the performance impact of the > fs parameters, with the two bonnie++ processes, I was mainly > interested in performance and hangs when both bonnie++ processes are > in their "Rewriting" phase. I have also tested variations where the > block:fragment ratio does not follow the 8:1 recommendation. > > 64/8k, kernel unpatched: > dd: 1218155712 bytes/sec > bonnie++: around 300 MB/sec, then drops to 0 and system hangs My hardware can barely do 40 MB/sec, but under FreeBSD-5.2 (modified), there is at most a 50% drop with 2 bonnies and no tuning, and no hangs. Except it hung in my debugging code for understanding a related problem. I had this code enabled for a month or 2 and had forgotten about it. Only 2 bonnies caused enough load to reach it. > 32/4k, kernel unpatched: > dd: 844188881 bytes/sec > bonnie++: jumps between 25 and 900 MB/sec, no hang > > 16/2k, kernel unpatched: > dd: 517996142 bytes/sec > bonnie++: mostly 20-50 MB/sec, with 3-10 second "bursts" of > 400-650 MB/sec, no hang The non-hangs might be due to BKVASIZE not being a problem. > So while a file system created with the current (32/4k) or old (16/2k) > defaults does prevent the hangs, it also reduces the sequential write > performance to 70% and 43% of an 64/8k fs. I think bufdaemon and other write-pressure handling code just don't work as well as upper layers for clustering and ordering the writes. They are currently broken, as shown by the hangs. Clustering shouldn't be generating so much pressure. I can explain why it does (see below). But random i/o can easily generate the same pressure, so bufdaemon etc. need to handle it. Also, pressure should be the usual case for async mounts, and bufdaemon etc. should be able to do clustering and ordering _better_ than upper layers, since they have more blocks to work with. cluster_write() intentionally leaves as much as possible to bufdaemon in order to potentially optimize it. > In all cases, the vfs.numdirtybuffers count remained fairly small as > long as the bonnie++ processes were writing the testfiles. It rose to > vfs.hidirtybuffers (slowler with only one process in "Rewriting", much > faster when both processes are rewriting). The slow buildup is caused by fragmentation. If the files happen to be laid out contiguously, then cluster_write() acts like bawrite(), but better. With your 64K-blocks and FreeBSD's 128K max cluster size, it normally combines just 2 64K-blocks to create a 128K-cluster, and does 1 write instead of 2. After this write, both methods have the leave number of dirty buffers (none for data, but a couple for metadata). In more detail, the write pattern with bonnie's 8K-blocks is: write; bdwrite ... write; bdwrite (7 bdwrites to the same buffer) write; cluster_write or bawrite (buffer is now full, and dirty iff clusterw) write; bdwrite ... write; bdwrite (7 bdwrites to the next buffer) write; cluster_write or bawrite (2 buffers are now full and both are clean) However, with fragmentation, cluster_write() "leaks" a dirty buffer although it is full, while bawrite() just writes all full buffers. The leakage is because cluster_write() wants physical contiguity. This is too much to ask for. The leak may be fast if the file system is very fragmented. I happened to test on a file system which was almost full and thus very fragmented. Two test files of size 256M each had 452 extents (counting indirect blocks as an extent although the might be contiguous). The buildup with my test program (which tries to produce bonnie's i/o pattern) was often to the same magic number (something like 768, a little higher than the number of extents). This much leakage was enough to make clustering a pessimization, although it worked to combine 4 blocks in most cases (my fs block size is 16K and my max cluster size is 64K). The disks are old, so 16K is as good an i/o size as 64K for them. But smaller sizes wouldn't be. >>> 00-04-57.log:vfs.numdirtybuffers: 52098 >>> 00-05-00.log:vfs.numdirtybuffers: 52096 >>> [ etc. ] >> >> This is a rather large buildup and may indicate a problem. Try reducing >> the dirty buffer watermarks. > > I have tried to tune system parameters as per your advice, in an attempt > to get a 64/8k fs running stable and with reasonable write performance. > (dd: results omitted for brevity, all normal for 64/8k fs) I got mostly worse behaviour by tuning things related to dirtybuffers. > [... You managed to fix the hangs, at a cost of too much performance. ] > By testing with 3, 4 and 5 bonnie++ processes running simultaneously, > I found that > (vfs.dirtybufthresh) * (number of bonnie++ process) must be slightly > less than vfs.hidirtybuffers for reasonable performance without hangs. > > vfs.numdirtybuffers rises to > (vfs.dirtybufthresh) * (number of bonnie++ process) > and as long as this is below vfs.hidirtybuffers, the system will not > hang. I wonder why it is linear in the number of processes. I saw indications of similar behaviour (didn't test extensively). 1 bonnie created about 2/3 of vfs.hidirtybuffers and 2 bonnies saturated at vfs.hidirtybuffers. This is with vfs.hidirtbuffers much smaller than yours. > ... > I was not sure whether to disable the "bawrite(bp);" in the else part > as well. Here is what I used for the next test (in ffs_write): > > } else if (vm_page_count_severe() || > buf_dirty_count_severe() || > (ioflag & IO_ASYNC)) { > bp->b_flags |= B_CLUSTEROK; > bawrite(bp); Note that this bawrite() is used when hidirtybuffers is reached. This is OK for this buffer if this buffer is large, but it doesn't stop the flood -- the system stays saturated at hidirtybuffs until bonnie stops writing. This has a bad effect on other parts of the system, e.g., if they are writing small buffers. > /* KWKWKW } else if (xfersize + blkoffset == fs->fs_bsize) { > if ((vp->v_mount->mnt_flag & MNT_NOCLUSTERW) == 0) { > bp->b_flags |= B_CLUSTEROK; > cluster_write(vp, bp, ip->i_size, seqcount); > } else { > bawrite(bp); > } KWKWKW */ OK. I tried a bdwrite() for the async mount case. Async mounts should give delayed writes always, but it is a bugfeature that they give delayed writes for critical metadata but normally async() writes for data (via cluster_write()). Using bdwrite here let me control the behaviour using a combination of mount flags. It became clearer that the bugfeature is not just a bug. Without cluster_write() forcing out writes for async mounts, the write pressure is much the same as from bonnie rewrite. > During the dd tests, iostat shows a weird, sawtooth-like behavior: > 64.00 26730 1670.61 0 0 12 3 85 > ... Some bad interaction of bufdaemon with the writers. I believe there are locking problems, but didn't know anything about them until recently. FreeBSD-5.2 is simpler and seems to have fewer of them. I got a deadlock in it by putting a sleep waiting for the write pressure to clear in the above. This gave deadlock, since bufdaemon is apparently prevented from doing anything while we hold a lock. > ... > After reverting the source change, I have decided to try mounting the > file system with "-o noclusterr,noclusterw", and re-test. This is > equivalent to disabling only the if-part of the expression in the > source snippet above. Good idea. > dd: 1206199580 bytes/sec > bonnie++: 550-700 MB/sec, no hang > > During the tests, vfs.numdirtybuffers remains low, lo/hidirtybuffers > and dirtybufthresh are at their defaults: > vfs.dirtybufthresh: 46925 > vfs.hidirtybuffers: 52139 > vfs.lodirtybuffers: 26069 > vfs.numdirtybuffers: 15 > > So it looks like you were spot-on by suspecting cluster_write(). Not for the reasons that I suspected it! > Further tests confirmed that "-o noclusterw" is sufficient to prevent > the hangs and provide good performance. "-o noclusterr" on its own > makes no difference; the system will hang. I thought -noclusterw would be worse until now... > I have also tested with BKVASIZE set to 65536. As you explained, this > reduced the number of buffers: > vfs.dirtybufthresh: 11753 > vfs.hidirtybuffers: 13059 > vfs.lodirtybuffers: 6529 > > dd results remain unchanged from a BKVASIZE of 16k. bonnie++'s iostats > with 2 process in "Rewriting..." jump between 70 and 800 MB/sec and > numdirtybuffers reaches max: > > vfs.numdirtybuffers: 13059 > > Even though numdirtybuffers reaches hidirtybuffers, the system does > not hang, but performance is not very good. ... I hoped BKVASIZE would be the problem, but it is at most a secondary problem. > Here is the relevant part from bonnie++'s source (in C++): > ... > So you are correct: bonnie++ re-reads the file that was created > previously in the "Writing intelligently..." phase in blocks, modifies > one byte in the block, and writes the block back. > > Something in this specific workload is triggering the huge buildup of > numdirtybuffers when write-clustering is enabled. I can explain this. Readers and writers share the offset for the sequential heuristic (it is per-open-file), so cluster_write() cannot tell that bonnie's writes are sequential. The sequential_heuristic() function sees the offset moving back and forth, which normally means random access, so it tells cluster_write() that the access is random, allthiugh all bonnie seeks back after each read() so that all the writes() are sequential. cluster_write() rarely if ever pushes out random writes, so the write pressure builds up to a saturation value almost instantly. Then all subsequent writes work even more poorly due to the pressure. Truly random writes would give similar buildup of pressure. > I am now looking at vfs_cluster.c to see whether I can find which part > is responsible for letting numdirtybuffers raise without bounds and > why only *re* writing a file causes problems, not the initial > writing. Any suggestions on where to start looking are very welcome. It is very complicated, but it was easy to find its comments saying that it tries not to force out the writes for non-seqential accesses. I am currently trying the following workarounds: % diff -u2 vfs_cluster.c~ vfs_cluster.c % --- vfs_cluster.c~ Tue Apr 6 20:11:55 2004 % +++ vfs_cluster.c Tue Jun 4 04:53:52 2013 % @@ -593,4 +714,6 @@ % } % % +int buf_dirty_count_large(void); % + % /* % * Do clustered write for FFS. This is like buf_dirty_count_severe() except it has a lower threshold. It doesn't really help. % @@ -630,4 +753,44 @@ % vp->v_lasta = vp->v_clen = vp->v_cstart = vp->v_lastw = 0; % % + /* % + * XXX the 13 year old changes in the algorithm to use seqcount % + * are too agressive. Avoid them by faking sequential access if % + * we are under write pressure. This causes us to write out % + * clusters in more cases (in particular, when clusters are % + * complete). This doesn't really help. It's too late to start faking after pressure has built up. % + * % + * XXX we should probably use a more conservative condition for % + * write pressure than is currently available. Try not to leave % + * anything to bufdaemon, since it can too easily fail to clear % + * write pressure because the processes creating the write % + * pressure have the vnode locked. E.g., use lodirtybuffers % + * instead of hidirtybuffers in buf_dirty_count_severe(), and % + * also check dirtybufthesh. I use lodirtybuffers as a default. This is still too high, and almost any value reduces performance. % + * % + * XXX seqcount itself is broken. It gets cleared by reads of % + * the same open file, even if the application seeks back after % + * reading so that writes are perfectly sequential. This case % + * is probably unimportant, but bonnie does it. As describe differently above. % + * % + * XXX I think that if bonnie didn't seek back then the write % + * pressure would be the same (provided the bonnie write size is % + * smaller than the fs block size). Then the writes wouldn't be % + * sequential, so a fixed seqcount wouldn't help force them out. % + * But they need to be forced out under write pressure, and I % + * don't see anything to do it. Check this. % + * % + * XXX the writing system shouldn't collapse when we get this % + * slightly wrong. Why doesn't bdwrite() force things when % + * we don't? It has some code to do so. It is still better % + * for us to force out full clusters than to leave it to % + * the writing system to possibly force out sub-buffers. % + * % + * XXX try killing async under the same conditions. This is too late to reduce pressure from async too. % + */ % + if (vm_page_count_severe() || buf_dirty_count_large()) { % + seqcount = 2; % + async = 0; % + } % + % if (vp->v_clen == 0 || lbn != vp->v_lastw + 1 || % (bp->b_blkno != vp->v_lasta + btodb(lblocksize))) { % @@ -719,4 +882,5 @@ % } else { /* Wait for rest of cluster */ % vp->v_cstart = lbn; % + /* XXX problem if we're async and orig maxclen = 0. */ % bdwrite(bp); % } This case for async breaks the bugfeature of not honoring it. The bugfeature often prevents buildup of exessive pressure, but some cases arrive here and then it doesn't. This is the only old change I have in cluster_write(). % @@ -726,8 +890,13 @@ % * are operating sequentially, otherwise let the buf or % * update daemon handle it. % + * % + * Algorithm changeback: ignore seqcount here, at least for % + * now, to work around readers breaking it for writers. It % + * is too late to start ignoring after write pressure builds % + * up, since not writing out here is the main source of the % + * buildup. % */ % bdwrite(bp); % - if (seqcount > 1) % - cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen + 1); % + cluster_wbuild_wb(vp, lblocksize, vp->v_cstart, vp->v_clen + 1); % vp->v_clen = 0; % vp->v_cstart = lbn + 1; This works fairly well. Changing the algorithm back in all cases reduces performance, but here we have usually finished with the buffer. Even if seqcount is fixed so that it isn't clobbered by reads, it might be right to ignoe it here, so as to reduce write pressure, until write pressure is handled better. I used the following program to generate i/o like bonnie's: %%% #include #include #include #include #include #include #define IOSIZE 8192 static char buf[IOSIZE] __aligned(4096); int main(int argc, char **argv) { ssize_t n; int ifd, ofd; if (argc != 2 && argc != 3) errx(1, "usage: ./prog file1 [file2]"); ifd = open(argv[1], O_RDWR); if (ifd < 0) err(1, "open"); if (argc == 3) ofd = open(argv[2], O_RDWR); else ofd = ifd; if (ifd < 0) err(1, "open"); for (;;) { n = read(ifd, buf, IOSIZE); if (n == 0) { printf("finished: ioffset %#jx; ooffset = %#jx\n", (intmax_t)lseek(ifd, 0, SEEK_CUR), (intmax_t)lseek(ofd, 0, SEEK_CUR)); exit(0); } if (n < 0) err(1, "read error"); if (n != IOSIZE) errx(1, "short read %d", (int)n); if (ofd == ifd) lseek(ofd, -IOSIZE, SEEK_CUR); n = write(ofd, buf, IOSIZE); if (n < 0) err(1, "write error"); if (n != IOSIZE) errx(1, "short write %d", (int)n); } } %%% Usage: create a test file zz of a suitable size (the program normally doesn't extend it); then ./prog zz does read/write like bonnie; ./prog zz zz does the same with independent file positions. Bruce