Date: Fri, 31 May 1996 18:11:18 +1700 (MST) From: Terry Lambert <terry@lambert.org> To: jmb@freefall.freebsd.org (Jonathan M. Bresler) Cc: terry@lambert.org, davidg@Root.COM, rashid@rk.ios.com, jgreco@solaria.sol.net, hackers@FreeBSD.org Subject: Re: Breaking ffs - speed enhancement? Message-ID: <199606010111.SAA19301@phaeton.artisoft.com> In-Reply-To: <199605312249.PAA15899@freefall.freebsd.org> from "Jonathan M. Bresler" at May 31, 96 03:49:43 pm
next in thread | previous in thread | raw e-mail | index | archive | help
> > That they are writen out during sync instead of being LRU'ed out is > > *the* problem. The sync does not have a sufficiently long cycle > > time for decent write-gathering. > > [caveat: i am getting deeper into filesystems but have a long > way to go yet. please add a grain of salt to these comments.] > > how is our current file system different from that used in > 4.2BSD? i assume that we are now using extent based clustering > similar to that described by McVoy and Kleiman in "Extent-like > Performance of a Unix File System". if so then increasing the > update interval should have significant benefits in reducing > the I/O soaked by inodes (metadata) updates. would this give > us a quasi-asych filesystem? one where we could delay update > to fit our tolerance for letting the metedata "hang out in > the wind"? The difference is that the VM cache has been unified. One of the consequences of this is that the cache locality model has changed. Instead of a buffer cache that caches blocks from a given device, the cached pages are hung off the vnode -- if the pages are in fact data pages from an on disk file. The difference is pretty subtle. The pages for FS data which isn't actually data pages from a file end up being in one of three states: dirty pending write by sync or clean pending discard by sync or locked for use (wired/pinned/whatever) and not subject to sync. Only pages actually representing file data have any real cache persistance... At least that's the way I'm reading the code. We *really* need to have design documentation as to what is affected by whom when. Really, for a depth-first traversal, I'm going to be hitting the access time on directory inodes in a geometrically decreasing frequency relative to where the traversal starts as we go higher in the actualy FS tree. The liklihood that I won't be able to combine several "mark for update + update" or "update" operations on a given on disk inode into a dsingle write increases inversely to depth in the tree. Probably, I want to LRU the now-"dirty" inode data to the front of the LRU each time it is touched to delay the actual write until the operation that I'm doing that causes multiple updates is done. Unfortunately, the sync is going to write dirty pages, not simply reduce the LRU contents down to a floating age marker in the LRU ("all items in the LRU older than this amount must be written"). There's also the problem of write clustering. It's likely that several inodes in the same cylinder group will want to be written as one write to save on actual I/O; so clustering requirements may want to overrride LRU requirements in some cases; I don't know if this should take the form of lame insertion of ready-to-be-written LRU items in cluster order (delaying them), or if it should be time demotion in the LRU for an adjacent item. I guess it's whether you take LRU position as meaning "these blocks *want* to be written", or as "these blocks *don't* want to be written". Clearly, data integrity *wants* them written and cache utility *doesn't* want them written. I haven't really done experimentation on which policy is best; I suppose it would depend on ratio of cache to disk, and frequency of access compared to proposed LRU latency. John Dyson is surely more qualified to comment on this than I am. Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199606010111.SAA19301>