Date: Wed, 2 Feb 2005 15:04:45 -0800 (PST) From: Matthew Dillon <dillon@apollo.backplane.com> To: Mike Tancsa <mike@sentex.net> Cc: freebsd-performance@freebsd.org Subject: Re: FreeBSD 5.3 I/O Performance / Linux 2.6.10 and dragonfly Message-ID: <200502022304.j12N4jNu003211@apollo.backplane.com> References: <20050130120437.93214.qmail@web26810.mail.ukl.yahoo.com> <6.2.1.2.0.20050201113451.0313ee20@64.7.153.2> <6.2.1.2.0.20050201193210.0489e6f8@64.7.153.2> <6.2.1.2.0.20050202170217.02d509e0@64.7.153.2>
next in thread | previous in thread | raw e-mail | index | archive | help
:> I can figure some things out. Clearly the BSD write numbers are dropping :> at a block size of 2048 due to vfs.write_behind being set to 1. : :Interesting, I didnt know of this. I really should re-read tuning(8). What :are the dangers of setting it to zero? There are three issues here. First is how much of the buffer cache you want to allow a single application to monopolize. Second is our historically terrible filesystem syncer and buffer cache dirty page management. Third is the fact that we even *HAVE* a buffer cache for reads that the system should be extracting directly out of the VM object. If you turn off write_behind a single application (the benchmark) can monopolize the buffer cache and greatly reduce the cache performance of other applications. So e.g. on a large system doing lots of things you would want to leave this on (in its current incarnation). The idea behind the write-behind code is to flush out data blocks when enough data is present to be reasonably efficient to the disk. Right now that is approximately 64KB of data but 'small writes' do not trigger the clustering code, hence the 2K transition you are seeing. The write-behind code also depresses the priority of the underlying VM pages allowing them to be reused more quickly relative to other applications running in the system, the idea being that data written in large blocks is unlikely to be read again any time soon. The second issue is our historically terrible filesystem syncer. The write_behind greatly reduces the burden on the buffer cache and makes it work better. If you turn it off, applications other then the benchmark trying to use the system will probably get pretty sludgy due to blockages in the buffer cache created by the benchmark. In FreeBSD-5 the vnode dirty/clean buffer list is now a splay tree, which is an improvement over what we had before but the real issue with the filesystem syncer is the fact that it tries to write out every single dirty buffer associated with a file all at once. What it really needs to do (and OpenBSD or NetBSD does this) is only write out up to X (1) megabytes of data, remember where it left off, and then proceed to the next dirty file. The write_behind code really needs to be replaced with something integrated into a filesystem syncer (as described above). That is, it should detect the existance of a large amount of sequential dirty data and it should kick another thread to flush it out synchronously, but it should not try to do it itself asynchronously. The big problem with trying to buffer that much data asynchronously is that you wind up blocking on the disk device when the file is removed because so much I/O is marked 'in progress'. The data set size should be increased from 64KB to 1MB as well. If the flushing can be done correctly it should be possible to have a good implementation of write_behind WITHOUT impacting cache performance. The third issue is the fact that we even have a buffer cache for things like read() that would be better served going directly to the VM object. I suspect that cache performance could be increased by a huge amount by having the file->read go directly to the VM object instead of recursing through 8 subroutine levels, instantiating, and garbage collecting the buffer cache. :> clearly, Linux is not bothering to write out ANY data, and then able to :> take advantage of the fact that the test file is being destroyed by :> iozone (so it can throw away the data rather then write it out). This :> skews the numbers to the point where the benchmark doesn't even come :> close :> to reflecting reality, though I do believe it points to an issue with :> the BSDs ... the write_behind heuristic is completely out of date now :> and needs to be reworked. : :http://www.iozone.org is what I was using to test with. Although right :now, the box I am trying to put together is a Samba and NFS server for :mostly static web content. : :In the not too distant future, a file server for IMAP/POP3 front ends. I :think postmark does a good job at simulating that. : :Are there better benchmarks / methods of testing that would give a more :fair comparison that you know of? I know all benchmarks have many caveats, :but I am trying to approach this somewhat methodically. I am just about to :start another round of testing with nfs using multiple machines pounding :the one server. I was just going to run postmark on the 3 clients machines :(starting out at the same time). Boy, I just don't know. Benchmarks have their uses, but the ones that simulate more then one processs accessing the disk are almost certainly more realistic the ones like iozone which just run a single process and do best when they are allowed to monopolizing the entire system. Bonnie is probably more accurate then iozone, it at least tries a lot harder to avoid side effects from prior tests. :Ultimately I dont give a toss if one is 10% or even 20% better than the :other. For that money, a few hundred dollars in RAM and CPU would change :that. We are mostly a BSD shop so I dont want to deploy a LINUX box for :25% faster disk I/O. But if the differences are far more acute, I need to :perhaps take a bit more notice. : :> The read tests are less clear. iozone runs its read tests just after :> it runs its write tests. so filesystem syncing and write flushing is :> going to have a huge effect on the read numbers. I suspect that this :> is skewing the results across the spectrum. In particular, I don't :> see anywhere near the difference in cache-read performance between :> FreeBSD-5 and DragonFly. But I guess I'll have to load up a few test :> boxes myself and do my own comparisons to figure out what is going on. :> Well, the 4-way explains the cache performance on the read tests at least. What you are seeing is the BGL removal in FreeBSD-5 verses DFly. Try it on a UP machine, though, and I'll bet the numbers will be reversed. In anycase, the #1 issue that should be on both our plates is fixing up the filesystem syncer and modernizing the write_behind code. -Matt Matthew Dillon <dillon@backplane.com>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200502022304.j12N4jNu003211>