From owner-freebsd-hackers Wed Oct 25 15:19:54 2000 Delivered-To: freebsd-hackers@freebsd.org Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134]) by hub.freebsd.org (Postfix) with ESMTP id B1CF537B479; Wed, 25 Oct 2000 15:19:48 -0700 (PDT) Received: (from daemon@localhost) by smtp04.primenet.com (8.9.3/8.9.3) id OAA01187; Wed, 25 Oct 2000 14:57:58 -0700 (MST) Received: from usr01.primenet.com(206.165.6.201) via SMTP by smtp04.primenet.com, id smtpdAAA1taOK2; Wed Oct 25 14:51:44 2000 Received: (from tlambert@localhost) by usr01.primenet.com (8.8.5/8.8.5) id OAA05765; Wed, 25 Oct 2000 14:54:42 -0700 (MST) From: Terry Lambert Message-Id: <200010252154.OAA05765@usr01.primenet.com> Subject: Re: vm_pageout_scan badness To: dillon@earth.backplane.com (Matt Dillon) Date: Wed, 25 Oct 2000 21:54:42 +0000 (GMT) Cc: tlambert@primenet.com (Terry Lambert), bright@wintelcom.net (Alfred Perlstein), ps@FreeBSD.ORG, hackers@FreeBSD.ORG In-Reply-To: <200010251642.e9PGguj26737@earth.backplane.com> from "Matt Dillon" at Oct 25, 2000 09:42:56 AM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > :Consider that a file with a huge number of pages outstanding > :should probably be stealing pages from its own LRU list, and > :not the system, to satisfy new requests. This is particularly > :true of files that are demanding resources on a resource-bound > :system. > > This isn't exactly what I was talking about. The issue in regards to > the filesystem syncer is that it fsync()'s an entire file. If > you have a big file (e.g. a USENET news history file) the > filesystem syncer can come along and exclusively lock it for > *seconds* while it is fsync()ing it, stalling all activity on > the file every 30 seconds. This seems like a broken (non)use of _SYNC parameters, but I definitely remember now about the FreeBSD breakage in the dirty page sync case not knowing what pages should be sync'ed or not, in the mmap region sync case of msync() degrading to fsync(). I guess O_WRITESYNC or msync() fixing is not an option? > The current VM system already does a good job in allowing files > to stealing pages from themselves. The sequential I/O detection > heuristic depresses the priority of pages as they are read making > it more likely for them to be reused. Since sequential I/O tends > to be the biggest abuser of file cache, the current FreeBSD > algorithms work well in real-life situations. We also have a few > other optimizations to reuse pages in there that I had added a year > or so ago (or fixed up, in the case of the sequential detection > heuristic). The biggest abuser that I have seen of this is actually not sequential. It is a linker that mmap()'s the object files, and then seeks all over creation to do the link, forcing all other pages out of core. I think the assumption that this is a sequential access problem, instead of a more general problem, is a bad one (FWIW, building per vnode working set quotas fixed the problem with the linker being antagonisitic). > One of the reasons why Yahoo uses MAP_NOSYNC so much (causing > the problem that Alfred has been talking about) is because the > filesystem syncer is 'broken' in regards to generating > unnecessarily long stalls. It doesn't stall when it should? 8-) 8-). I think this is a case of needing to eventually pay the piper for the music being played. If the pages are truly anonymous, then they don't need sync'ed; if they aren't, then they do need sync'ed. It sounds to me that if they are seeing long stalls, it's the msync() bug with not being able to tell what's dirty and what's clean... > Personally speaking, I would much rather use MAP_NOSYNC anyway, > even with a fixed filesystem syncer. MAP_NOSYNC pages are not > restricted by the size of the filesystem buffer cache, I see this as a bug in the non-MAP_NOSYNC case in FreeBSD's use vnodes as synonyms for vm_object_t's. I really doubt, though, that they are exceeding the maximum file size with a mapping; if not, then the issue is tuning. The limits on the size of the FS buffer cache are arbitrary; it should be possible to relax them. Again, I think the biggest problem here is historical, and it derives from the ability to dissociate a vnode with pages still hung off it from the backing inode (a cache bust). I suspect that if they increased the size of the ihash cache, they would see much better characteristics. My personal preference would be to not dissociate valid but clean pages from the reference object, until absolutely necessary. An easy fix for this would be to allow the FS to own the vnodes, not have a fixed size pool, and have a struct like: struct ufs_vnode { struct vnode; struct ufs_in_core_inode; }; And pass that around as if it were just a vnode, giving it back the the VFS that owned it, instead of using a system reclaim method, in order to reclaim it. Then if an ihash reclaim was wanted, it would have to free up the vnode resources to get it. Using high and low watermarks, instead of a fixed pool would complete the picture (the use of a fixed per-FS ihash pool in combination with a high/low watermarked per-system vnode pool is part of what causes the problem in the first place; an analytical mechanic or electronics buff would call this a classic case of "impedence mismatch"). > so you can have a whole lot more dirty pages in the system > then you would normally be able to have. E.g. they are working around an arbitrary, and wrong-for-them, administrative limit, instead of changing it. Bletch. > This 'feature' has had the unfortunate side effect of screwing > up the pageout daemon's algorithms, but that's fixable. I think the idea of a fixed limit on the FS buffer cache is probably wrong in the first place; certainly, there must be high and low reserves, but: |----------------------------------------------| all of memory |---------------------------------| FS allowed use |-------------------------------------| non-FS allowed use |------------| non-FSreserve |--------| FS reserve ...in other words, a reserve-based system, rather than a limit based system. It would result in the same effect, without damaging natural limits for any given loading chracteristics, arrived at through hysteresis effects, right? NB: The reserve sizes in the diagram are vastly exagerated to keep them from being just vertical bars. It has always amazed me that the system limits natural load as if all systems were to be used as general purpose systems by interactive users; you could achieve that effect by making larger reserves for interactive use, but you can't achieve the opposite effect by diddling administrative limits that aren't already predicated on a reserve model. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message