Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 18 Jan 2013 09:47:11 +0100
From:      Stefan Esser <se@freebsd.org>
To:        freebsd-hackers@freebsd.org
Subject:   Re: stupid UFS behaviour on random writes
Message-ID:  <50F90C0F.5010604@freebsd.org>
In-Reply-To: <103826787.2103620.1358463687244.JavaMail.root@erie.cs.uoguelph.ca>
References:  <103826787.2103620.1358463687244.JavaMail.root@erie.cs.uoguelph.ca>

next in thread | previous in thread | raw e-mail | index | archive | help
Am 18.01.2013 00:01, schrieb Rick Macklem:
> Wojciech Puchar wrote:
>> create 10GB file (on 2GB RAM machine, with some swap used to make sure
>> little cache would be available for filesystem.
>>
>> dd if=/dev/zero of=file bs=1m count=10k
>>
>> block size is 32KB, fragment size 4k
>>
>>
>> now test random read access to it (10 threads)
>>
>> randomio test 10 0 0 4096
>>
>> normal result on such not so fast disk in my laptop.
>>
>> 118.5 | 118.5 5.8 82.3 383.2 85.6 | 0.0 inf nan 0.0 nan
>> 138.4 | 138.4 3.9 72.2 499.7 76.1 | 0.0 inf nan 0.0 nan
>> 142.9 | 142.9 5.4 69.9 297.7 60.9 | 0.0 inf nan 0.0 nan
>> 133.9 | 133.9 4.3 74.1 480.1 75.1 | 0.0 inf nan 0.0 nan
>> 138.4 | 138.4 5.1 72.1 380.0 71.3 | 0.0 inf nan 0.0 nan
>> 145.9 | 145.9 4.7 68.8 419.3 69.6 | 0.0 inf nan 0.0 nan
>>
>>
>> systat shows 4kB I/O size. all is fine.
>>
>> BUT random 4kB writes
>>
>> randomio test 10 1 0 4096
>>
>> total | read: latency (ms) | write: latency (ms)
>> iops | iops min avg max sdev | iops min avg max
>> sdev
>> --------+-----------------------------------+----------------------------------
>> 38.5 | 0.0 inf nan 0.0 nan | 38.5 9.0 166.5 1156.8 261.5
>> 44.0 | 0.0 inf nan 0.0 nan | 44.0 0.1 251.2 2616.7 492.7
>> 44.0 | 0.0 inf nan 0.0 nan | 44.0 7.6 178.3 1895.4 330.0
>> 45.0 | 0.0 inf nan 0.0 nan | 45.0 0.0 239.8 3457.4 522.3
>> 45.5 | 0.0 inf nan 0.0 nan | 45.5 0.1 249.8 5126.7 621.0
>>
>>
>>
>> results are horrific. systat shows 32kB I/O, gstat shows half are
>> reads
>> half are writes.
>>
>> Why UFS need to read full block, change one 4kB part and then write
>> back, instead of just writing 4kB part?
> 
> Because that's the way the buffer cache works. It writes an entire buffer
> cache block (unless at the end of file), so it must read the rest of the block into
> the buffer, so it doesn't write garbage (the rest of the block) out.

Without having looked at the code or testing:

I assume using O_DIRECT when opening the file should help for that
particular test (on kernels compiled with "options DIRECTIO").

> I'd argue that using an I/O size smaller than the file system block size is
> simply sub-optimal and that most apps. don't do random I/O of blocks.
> OR
> If you had an app. that does random I/O of 4K blocks (at 4K byte offsets),
> then using a 4K/1K file system would be better.

A 4k/1k file system has higher overhead (more indirect blocks) and
is clearly sub-obtimal for most general uses, today.

> NFS is the exception, in that it keeps track of a dirty byte range within
> a buffer cache block and writes that byte range. (NFS writes are byte granular,
> unlike a disk.)

I should be easy to add support for a fragment mask to the buffer
cache, which allows to identify valid fragments. Such a mask should
be set to 0xff for all current uses of the buffer cache (meaning the
full block is valid), but a special case could then be added for writes
of exactly one or multiple fragments, where only the corresponding
valid flag bits were set. In addition, a possible later read from
disk must obviously skip fragments for which the valid mask bits are
already set.
This bit mask could then be used to update the affected fragments
only, without a read-modify-write of the containing block.

But I doubt that such a change would improve performance in the
general case, just in random update scenarios (which might still
be relevant, in case of a DBMS knowing the fragment size and using
it for DB files).

Regards, STefan



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50F90C0F.5010604>