Date: Fri, 18 Jan 2013 09:47:11 +0100 From: Stefan Esser <se@freebsd.org> To: freebsd-hackers@freebsd.org Subject: Re: stupid UFS behaviour on random writes Message-ID: <50F90C0F.5010604@freebsd.org> In-Reply-To: <103826787.2103620.1358463687244.JavaMail.root@erie.cs.uoguelph.ca> References: <103826787.2103620.1358463687244.JavaMail.root@erie.cs.uoguelph.ca>
next in thread | previous in thread | raw e-mail | index | archive | help
Am 18.01.2013 00:01, schrieb Rick Macklem: > Wojciech Puchar wrote: >> create 10GB file (on 2GB RAM machine, with some swap used to make sure >> little cache would be available for filesystem. >> >> dd if=/dev/zero of=file bs=1m count=10k >> >> block size is 32KB, fragment size 4k >> >> >> now test random read access to it (10 threads) >> >> randomio test 10 0 0 4096 >> >> normal result on such not so fast disk in my laptop. >> >> 118.5 | 118.5 5.8 82.3 383.2 85.6 | 0.0 inf nan 0.0 nan >> 138.4 | 138.4 3.9 72.2 499.7 76.1 | 0.0 inf nan 0.0 nan >> 142.9 | 142.9 5.4 69.9 297.7 60.9 | 0.0 inf nan 0.0 nan >> 133.9 | 133.9 4.3 74.1 480.1 75.1 | 0.0 inf nan 0.0 nan >> 138.4 | 138.4 5.1 72.1 380.0 71.3 | 0.0 inf nan 0.0 nan >> 145.9 | 145.9 4.7 68.8 419.3 69.6 | 0.0 inf nan 0.0 nan >> >> >> systat shows 4kB I/O size. all is fine. >> >> BUT random 4kB writes >> >> randomio test 10 1 0 4096 >> >> total | read: latency (ms) | write: latency (ms) >> iops | iops min avg max sdev | iops min avg max >> sdev >> --------+-----------------------------------+---------------------------------- >> 38.5 | 0.0 inf nan 0.0 nan | 38.5 9.0 166.5 1156.8 261.5 >> 44.0 | 0.0 inf nan 0.0 nan | 44.0 0.1 251.2 2616.7 492.7 >> 44.0 | 0.0 inf nan 0.0 nan | 44.0 7.6 178.3 1895.4 330.0 >> 45.0 | 0.0 inf nan 0.0 nan | 45.0 0.0 239.8 3457.4 522.3 >> 45.5 | 0.0 inf nan 0.0 nan | 45.5 0.1 249.8 5126.7 621.0 >> >> >> >> results are horrific. systat shows 32kB I/O, gstat shows half are >> reads >> half are writes. >> >> Why UFS need to read full block, change one 4kB part and then write >> back, instead of just writing 4kB part? > > Because that's the way the buffer cache works. It writes an entire buffer > cache block (unless at the end of file), so it must read the rest of the block into > the buffer, so it doesn't write garbage (the rest of the block) out. Without having looked at the code or testing: I assume using O_DIRECT when opening the file should help for that particular test (on kernels compiled with "options DIRECTIO"). > I'd argue that using an I/O size smaller than the file system block size is > simply sub-optimal and that most apps. don't do random I/O of blocks. > OR > If you had an app. that does random I/O of 4K blocks (at 4K byte offsets), > then using a 4K/1K file system would be better. A 4k/1k file system has higher overhead (more indirect blocks) and is clearly sub-obtimal for most general uses, today. > NFS is the exception, in that it keeps track of a dirty byte range within > a buffer cache block and writes that byte range. (NFS writes are byte granular, > unlike a disk.) I should be easy to add support for a fragment mask to the buffer cache, which allows to identify valid fragments. Such a mask should be set to 0xff for all current uses of the buffer cache (meaning the full block is valid), but a special case could then be added for writes of exactly one or multiple fragments, where only the corresponding valid flag bits were set. In addition, a possible later read from disk must obviously skip fragments for which the valid mask bits are already set. This bit mask could then be used to update the affected fragments only, without a read-modify-write of the containing block. But I doubt that such a change would improve performance in the general case, just in random update scenarios (which might still be relevant, in case of a DBMS knowing the fragment size and using it for DB files). Regards, STefan
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50F90C0F.5010604>