Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 18 Jan 2013 10:45:48 -0500 (EST)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Stefan Esser <se@freebsd.org>
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: stupid UFS behaviour on random writes
Message-ID:  <2006598795.2120288.1358523948712.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <50F90C0F.5010604@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Stefan Esser wrote:
> Am 18.01.2013 00:01, schrieb Rick Macklem:
> > Wojciech Puchar wrote:
> >> create 10GB file (on 2GB RAM machine, with some swap used to make
> >> sure
> >> little cache would be available for filesystem.
> >>
> >> dd if=/dev/zero of=file bs=1m count=10k
> >>
> >> block size is 32KB, fragment size 4k
> >>
> >>
> >> now test random read access to it (10 threads)
> >>
> >> randomio test 10 0 0 4096
> >>
> >> normal result on such not so fast disk in my laptop.
> >>
> >> 118.5 | 118.5 5.8 82.3 383.2 85.6 | 0.0 inf nan 0.0 nan
> >> 138.4 | 138.4 3.9 72.2 499.7 76.1 | 0.0 inf nan 0.0 nan
> >> 142.9 | 142.9 5.4 69.9 297.7 60.9 | 0.0 inf nan 0.0 nan
> >> 133.9 | 133.9 4.3 74.1 480.1 75.1 | 0.0 inf nan 0.0 nan
> >> 138.4 | 138.4 5.1 72.1 380.0 71.3 | 0.0 inf nan 0.0 nan
> >> 145.9 | 145.9 4.7 68.8 419.3 69.6 | 0.0 inf nan 0.0 nan
> >>
> >>
> >> systat shows 4kB I/O size. all is fine.
> >>
> >> BUT random 4kB writes
> >>
> >> randomio test 10 1 0 4096
> >>
> >> total | read: latency (ms) | write: latency (ms)
> >> iops | iops min avg max sdev | iops min avg max
> >> sdev
> >> --------+-----------------------------------+----------------------------------
> >> 38.5 | 0.0 inf nan 0.0 nan | 38.5 9.0 166.5 1156.8 261.5
> >> 44.0 | 0.0 inf nan 0.0 nan | 44.0 0.1 251.2 2616.7 492.7
> >> 44.0 | 0.0 inf nan 0.0 nan | 44.0 7.6 178.3 1895.4 330.0
> >> 45.0 | 0.0 inf nan 0.0 nan | 45.0 0.0 239.8 3457.4 522.3
> >> 45.5 | 0.0 inf nan 0.0 nan | 45.5 0.1 249.8 5126.7 621.0
> >>
> >>
> >>
> >> results are horrific. systat shows 32kB I/O, gstat shows half are
> >> reads
> >> half are writes.
> >>
> >> Why UFS need to read full block, change one 4kB part and then write
> >> back, instead of just writing 4kB part?
> >
> > Because that's the way the buffer cache works. It writes an entire
> > buffer
> > cache block (unless at the end of file), so it must read the rest of
> > the block into
> > the buffer, so it doesn't write garbage (the rest of the block) out.
> 
> Without having looked at the code or testing:
> 
> I assume using O_DIRECT when opening the file should help for that
> particular test (on kernels compiled with "options DIRECTIO").
> 
> > I'd argue that using an I/O size smaller than the file system block
> > size is
> > simply sub-optimal and that most apps. don't do random I/O of
> > blocks.
> > OR
> > If you had an app. that does random I/O of 4K blocks (at 4K byte
> > offsets),
> > then using a 4K/1K file system would be better.
> 
> A 4k/1k file system has higher overhead (more indirect blocks) and
> is clearly sub-obtimal for most general uses, today.
> 
Yes, but if the sysadmin knows that most of the I/O is random 4K blocks,
that's his specific case, not a general use. Sorry, I didn't mean to
imply that a 4K file system was a good choice, in general.

> > NFS is the exception, in that it keeps track of a dirty byte range
> > within
> > a buffer cache block and writes that byte range. (NFS writes are
> > byte granular,
> > unlike a disk.)
> 
> I should be easy to add support for a fragment mask to the buffer
> cache, which allows to identify valid fragments. Such a mask should
> be set to 0xff for all current uses of the buffer cache (meaning the
> full block is valid), but a special case could then be added for
> writes
> of exactly one or multiple fragments, where only the corresponding
> valid flag bits were set. In addition, a possible later read from
> disk must obviously skip fragments for which the valid mask bits are
> already set.
> This bit mask could then be used to update the affected fragments
> only, without a read-modify-write of the containing block.
> 
> But I doubt that such a change would improve performance in the
> general case, just in random update scenarios (which might still
> be relevant, in case of a DBMS knowing the fragment size and using
> it for DB files).
> 
> Regards, STefan
> _______________________________________________
> freebsd-hackers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to
> "freebsd-hackers-unsubscribe@freebsd.org"
Yes. And for some I/O patterns the fragment change would degrade performance. You mentioned
that a later read might have to skip fragments with the valid bit. I think
this would translate to doing multiple reads for the other fragments, in practice.
Also, when an app. goes to write a partial fragment, that fragment would have to be read in and
this could result in several reads of fragments instead of one read for the entire block.
It's the old "OS doesn't have a crystal ball that predicts future I/O activity".

Btw, although I did a "dirty byte range" for NFS for the buffer cache
ages (late 1980s) ago, it is also a performance hit for certain cases.
The linker/loaders love to write random sized chucks to files. For the
NFS code, if the new write isn't contiguous with the old one, a synchronous
write of the old dirty byte range is forced to the server. I have a patch that
replaces the single byte range with a list in order to avoid this synchronous write,
but it has not made it into head. (I hope to do so someday, after more
testing and when I figure out all the implications of changing "struct buf"
for the rest of the system.)

Of course, if someone codes up a patch for the "fragment bits" for the buffer
cache code, it might be an interesting enhancement. (One trick here is that
the code will need to know the sector size for the underlying storage system,
so it can size the fragment. I'm not sure if that can reliably be acquired
from the drivers these days?)

rick





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2006598795.2120288.1358523948712.JavaMail.root>