Date: Wed, 22 Mar 2000 02:45:16 +0000 From: Paul Richards <paul@originative.co.uk> To: Richard Wendland <richard@netcraft.com> Cc: Alfred Perlstein <bright@wintelcom.net>, Poul-Henning Kamp <phk@critter.freebsd.dk>, Matthew Dillon <dillon@apollo.backplane.com>, current@FreeBSD.ORG, fs@FreeBSD.ORG Subject: Re: FreeBSD random I/O performance issues Message-ID: <38D833BC.A082DF09@originative.co.uk> References: <200003220022.AAA28786@ns0.netcraft.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Richard Wendland wrote: > I spent a bit of time analysing these results when I first saw them. I don't think it has anything to do with the cache, it has to do with how we write out blocks. > One interesting observation is that for non sync, async or noclusterw > mounts ~8750 I/O operations are done, which is 7/8ths of the 10,000 > writes. If I change the program to use 16 blocks there are ~9375 > I/O operations which is 15/16ths of the 10,000 writes. Guessing, > this is as if writes are forced for all blocks but one. This is due to a quirk of the clustering algorithm. See below or my previous email. > With async filesystem mounts very little I/O occurs, and with > noclusterw there are ~10,000 operations matching the number of > writes. > > With sync it's ~20,000 operations matching the total of reads & > writes. This demonstrates another aspect of the bug, sync behaviour > should cause 10,000 operations; the reads aren't being cached. This isn't quite true. It's 20,000 *write* operations. I put this down to the mtime update for each write doubling the number of actual write operations. No read operations take place, the data *does* come out of the cache. There's nothing wrong with reading as far as I can tell. > Another aspect of this issue is the effect of changing the seek > blocksize, and write blocksize, by 1 byte each way from 8192, thus > doing block unaligned I/O. In some cases this changes the amount > of I/O recorded by getrusage to zero, and drops elapse time from > half a minute or so to less than 1 second. > > Thanks to Paul Richard for noticing this. I've not spent much time > researching this, so can only present my small set of measurements. > To do these tests you have to recompile my test program each time eg > > gcc -O4 -DBLOCKSIZE=8191 -DWRITESIZE=8193 seekreadwrite.c This is because of the fact that if the filesystem block is full it is written immediately, or rather the clustering code is called immediately. The rationale is that a full block isn't likely to be written to again so it might as well be pushed out to disk. Richard's program deliberately writes full blocks, which is apparently what db does, so it always forces a write to take place. Given the behaviour of db it might be more sensible to remove this feature and just mark full blocks dirty the same as other blocks since it's likely that they will be written to again shortly if the db record is written to frequently. The clustering code has a bug in that an old cluster is not pushed out if the block no is 0 because the code that would do so never gets reached. if (lbn == 0) vp->v_lasta = vp->v_clen = vp->v_cstart = vp->v_lastw = 0; if (vp->v_clen == 0 || lbn != vp->v_lastw + 1 || (bp->b_blkno != vp->v_lasta + btodb(lblocksize))) { maxclen = vp->v_mount->mnt_iosize_max / lblocksize - 1; if (vp->v_clen != 0) { /* * Next block is not sequential. * * If we are not writing at end of file, the process * seeked to another point in the file since its last * write, or we have reached our maximum cluster size, * then push the previous cluster. Otherwise try * reallocating to make it sequential. */ ............ In Richard's program the next block is never sequential so the previous cluster is always pushed *except* that when the program seeks back to block zero the "if (vp->v_clen != 0)" fails and a new cluster is started without pushing out the previously started one. That dirty block in the previous cluster then hangs around until it is flushed as dirty blocks normally would be. It is the combination of this clustering behaviour and the fact that the program always writes full blocks that causes the 8750 writes below. Since the blocks are full file system blocks rather than mark them dirty they are immediately passed to the clustering code, because they are never in sequence the clustering code always starts a new cluster and flushes the previous one except for 1 in every 8 blocks that doesn't happen because when block 0 is written the previous cluster is not pushed out but hangs around. The end result is that 7/8 blocks get written immediately which is 8750/10000 writes. When the write size drops below the filesystem block size then the clustering code never gets called because the buffers are just marked dirty and cached. I think if we fixed the issue of writing out full blocks this behviour would stop but I also think the clustering code could do with a fix. It should at least check to see if there is a cluster being built when the blockno is 0 and push it out. Possibly though it'd be better to not push out clusters of only one block and just leave them in the cache. > > Sorry it's that crude. These results are from a FreeBSD > 2.2.7-RELEASE+CAM, ccd stripe=2 (PII 400MHz, 512MB) system, > though exactly the same pattern is apparent with 3.4-STABLE. > "****" indicate sub-second "zero I/O" results. > > BLOCKSIZE WRITESIZE csh 'time' output > > 8191 8191 0.0u 1.5s 0:34.10 4.6% 5+186k 0+7500io 0pf+0w > 8191 8192 0.0u 1.3s 0:31.52 4.5% 5+178k 0+7500io 0pf+0w > 8191 8193 0.0u 1.4s 0:32.63 4.4% 5+189k 0+7500io 0pf+0w > > 8192 8191 0.0u 0.7s 0:01.97 37.5% 8+178k 0+0io 0pf+0w **** > 8192 8192 0.0u 1.3s 0:39.30 3.4% 7+196k 0+8750io 0pf+0w > 8192 8193 0.0u 1.3s 0:40.09 3.4% 5+187k 0+8750io 0pf+0w > > 8193 8191 0.0u 1.4s 0:46.22 3.2% 5+192k 0+8750io 0pf+0w > 8193 8192 0.0u 1.6s 0:40.48 4.0% 5+182k 0+8750io 0pf+0w > 8193 8193 0.0u 1.5s 0:40.57 3.8% 5+175k 0+8750io 0pf+0w > > 8191 4095 0.0u 1.2s 0:33.79 3.6% 5+193k 0+7500io 0pf+0w > 8191 4096 0.0u 1.2s 0:34.00 3.8% 5+190k 0+7500io 0pf+0w > 8191 4097 0.0u 1.1s 0:33.58 3.6% 4+165k 0+7500io 0pf+0w > > 8192 4095 0.0u 0.5s 0:00.76 75.0% 5+189k 0+0io 0pf+0w **** > 8192 4096 0.0u 0.5s 0:00.58 100.0% 5+183k 0+0io 0pf+0w **** > 8192 4097 0.0u 0.5s 0:00.74 78.3% 5+181k 0+0io 0pf+0w **** > > 8193 4095 0.0u 0.6s 0:01.00 67.0% 5+177k 0+0io 0pf+0w **** > 8193 4096 0.0u 0.6s 0:01.05 63.8% 5+179k 0+0io 0pf+0w **** > 8193 4097 0.0u 0.6s 0:01.02 66.6% 5+183k 0+0io 0pf+0w **** > > Any views gratefully received. A fix would be much better :-) > > Test program source, including compile & run instructions, is > available at: > > http://www.netcraft.com/freebsd/random-IO/seekreadwrite.c > > Detailed notes on the test system configurations are at: > > http://www.netcraft.com/freebsd/random-IO/results-notes.txt > > Thanks, > Richard > - > Richard Wendland richard@netcraft.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?38D833BC.A082DF09>