Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 22 Mar 2000 02:45:16 +0000
From:      Paul Richards <paul@originative.co.uk>
To:        Richard Wendland <richard@netcraft.com>
Cc:        Alfred Perlstein <bright@wintelcom.net>, Poul-Henning Kamp <phk@critter.freebsd.dk>, Matthew Dillon <dillon@apollo.backplane.com>, current@FreeBSD.ORG, fs@FreeBSD.ORG
Subject:   Re: FreeBSD random I/O performance issues
Message-ID:  <38D833BC.A082DF09@originative.co.uk>
References:  <200003220022.AAA28786@ns0.netcraft.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Richard Wendland wrote:
> 

I spent a bit of time analysing these results when I first saw them. I
don't think it has anything to do with the cache, it has to do with how
we write out blocks.

> One interesting observation is that for non sync, async or noclusterw
> mounts ~8750 I/O operations are done, which is 7/8ths of the 10,000
> writes.  If I change the program to use 16 blocks there are ~9375
> I/O operations which is 15/16ths of the 10,000 writes.  Guessing,
> this is as if writes are forced for all blocks but one.

This is due to a quirk of the clustering algorithm. See below or my
previous email.

> With async filesystem mounts very little I/O occurs, and with
> noclusterw there are ~10,000 operations matching the number of
> writes.
> 
> With sync it's ~20,000 operations matching the total of reads &
> writes.  This demonstrates another aspect of the bug, sync behaviour
> should cause 10,000 operations; the reads aren't being cached.

This isn't quite true. It's 20,000 *write* operations. I put this down
to the mtime update for each write doubling the number of actual write
operations. No read operations take place, the data *does* come out of
the cache. There's nothing wrong with reading as far as I can tell.
  
> Another aspect of this issue is the effect of changing the seek
> blocksize, and write blocksize, by 1 byte each way from 8192, thus
> doing block unaligned I/O.  In some cases this changes the amount
> of I/O recorded by getrusage to zero, and drops elapse time from
> half a minute or so to less than 1 second.
> 
> Thanks to Paul Richard for noticing this.  I've not spent much time
> researching this, so can only present my small set of measurements.
> To do these tests you have to recompile my test program each time eg
> 
>         gcc -O4 -DBLOCKSIZE=8191 -DWRITESIZE=8193 seekreadwrite.c

This is because of the fact that if the filesystem block is full it is
written immediately, or rather the clustering code is called
immediately. The rationale is that a full block isn't likely to be
written to again so it might as well be pushed out to disk. Richard's
program deliberately writes full blocks, which is apparently what db
does, so it always forces a write to take place. Given the behaviour of
db it might be more sensible to remove this feature and just mark full
blocks dirty the same as other blocks since it's likely that they will
be written to again shortly if the db record is written to frequently.

The clustering code has a bug in that an old cluster is not pushed out
if the block no is 0 because the code that would do so never gets
reached.

if (lbn == 0)
        vp->v_lasta = vp->v_clen = vp->v_cstart = vp->v_lastw = 0;


if (vp->v_clen == 0 || lbn != vp->v_lastw + 1 ||
        (bp->b_blkno != vp->v_lasta + btodb(lblocksize))) {
        maxclen = vp->v_mount->mnt_iosize_max / lblocksize - 1;
        if (vp->v_clen != 0) {
            /*
             * Next block is not sequential.
             *
             * If we are not writing at end of file, the process
             * seeked to another point in the file since its last
             * write, or we have reached our maximum cluster size,
             * then push the previous cluster. Otherwise try
             * reallocating to make it sequential.
             */

         ............

In Richard's program the next block is never sequential so the previous
cluster is always pushed *except* that when the program seeks back to
block zero the
"if (vp->v_clen != 0)" fails and a new cluster is started without
pushing out the previously started one. That dirty block in the previous
cluster then hangs around until it is flushed as dirty blocks normally
would be.

It is the combination of this clustering behaviour and the fact that the
program always writes full blocks that causes the 8750 writes below.
Since the blocks are full file system blocks rather than mark them dirty
they are immediately passed to the clustering code, because they are
never in sequence the clustering code always starts a new cluster and
flushes the previous one except for 1 in every 8 blocks that doesn't
happen because when block 0 is written the previous cluster is not
pushed out but hangs around.  The end result is that 7/8 blocks get
written immediately which is 8750/10000 writes.

When the write size drops below the filesystem block size then the
clustering code never gets called because the buffers are just marked
dirty and cached.

I think if we fixed the issue of writing out full blocks this behviour
would stop but I also think the clustering code could do with a fix. It
should at least check to see if there is a cluster being built when the
blockno is 0 and push it out. Possibly though it'd be better to not push
out clusters of only one block and just leave them in the cache.

> 
> Sorry it's that crude.  These results are from a FreeBSD
> 2.2.7-RELEASE+CAM, ccd stripe=2 (PII 400MHz, 512MB) system,
> though exactly the same pattern is apparent with 3.4-STABLE.
> "****" indicate sub-second "zero I/O" results.
> 
> BLOCKSIZE   WRITESIZE   csh 'time' output
> 
> 8191        8191        0.0u 1.5s 0:34.10 4.6% 5+186k 0+7500io 0pf+0w
> 8191        8192        0.0u 1.3s 0:31.52 4.5% 5+178k 0+7500io 0pf+0w
> 8191        8193        0.0u 1.4s 0:32.63 4.4% 5+189k 0+7500io 0pf+0w
> 
> 8192        8191        0.0u 0.7s 0:01.97 37.5% 8+178k 0+0io 0pf+0w    ****
> 8192        8192        0.0u 1.3s 0:39.30 3.4% 7+196k 0+8750io 0pf+0w
> 8192        8193        0.0u 1.3s 0:40.09 3.4% 5+187k 0+8750io 0pf+0w
> 
> 8193        8191        0.0u 1.4s 0:46.22 3.2% 5+192k 0+8750io 0pf+0w
> 8193        8192        0.0u 1.6s 0:40.48 4.0% 5+182k 0+8750io 0pf+0w
> 8193        8193        0.0u 1.5s 0:40.57 3.8% 5+175k 0+8750io 0pf+0w
> 
> 8191        4095        0.0u 1.2s 0:33.79 3.6% 5+193k 0+7500io 0pf+0w
> 8191        4096        0.0u 1.2s 0:34.00 3.8% 5+190k 0+7500io 0pf+0w
> 8191        4097        0.0u 1.1s 0:33.58 3.6% 4+165k 0+7500io 0pf+0w
> 
> 8192        4095        0.0u 0.5s 0:00.76 75.0% 5+189k 0+0io 0pf+0w    ****
> 8192        4096        0.0u 0.5s 0:00.58 100.0% 5+183k 0+0io 0pf+0w   ****
> 8192        4097        0.0u 0.5s 0:00.74 78.3% 5+181k 0+0io 0pf+0w    ****
> 
> 8193        4095        0.0u 0.6s 0:01.00 67.0% 5+177k 0+0io 0pf+0w    ****
> 8193        4096        0.0u 0.6s 0:01.05 63.8% 5+179k 0+0io 0pf+0w    ****
> 8193        4097        0.0u 0.6s 0:01.02 66.6% 5+183k 0+0io 0pf+0w    ****
> 
> Any views gratefully received.  A fix would be much better :-)
> 
> Test program source, including compile & run instructions, is
> available at:
> 
>         http://www.netcraft.com/freebsd/random-IO/seekreadwrite.c
> 
> Detailed notes on the test system configurations are at:
> 
>         http://www.netcraft.com/freebsd/random-IO/results-notes.txt
> 
> Thanks,
>         Richard
> -
> Richard Wendland                                richard@netcraft.com


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?38D833BC.A082DF09>