From owner-freebsd-hackers  Wed Oct 25 15:19:54 2000
Delivered-To: freebsd-hackers@freebsd.org
Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134])
	by hub.freebsd.org (Postfix) with ESMTP
	id B1CF537B479; Wed, 25 Oct 2000 15:19:48 -0700 (PDT)
Received: (from daemon@localhost)
	by smtp04.primenet.com (8.9.3/8.9.3) id OAA01187;
	Wed, 25 Oct 2000 14:57:58 -0700 (MST)
Received: from usr01.primenet.com(206.165.6.201)
 via SMTP by smtp04.primenet.com, id smtpdAAA1taOK2; Wed Oct 25 14:51:44 2000
Received: (from tlambert@localhost)
	by usr01.primenet.com (8.8.5/8.8.5) id OAA05765;
	Wed, 25 Oct 2000 14:54:42 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <200010252154.OAA05765@usr01.primenet.com>
Subject: Re: vm_pageout_scan badness
To: dillon@earth.backplane.com (Matt Dillon)
Date: Wed, 25 Oct 2000 21:54:42 +0000 (GMT)
Cc: tlambert@primenet.com (Terry Lambert),
	bright@wintelcom.net (Alfred Perlstein), ps@FreeBSD.ORG,
	hackers@FreeBSD.ORG
In-Reply-To: <200010251642.e9PGguj26737@earth.backplane.com> from "Matt Dillon" at Oct 25, 2000 09:42:56 AM
X-Mailer: ELM [version 2.5 PL2]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

> :Consider that a file with a huge number of pages outstanding
> :should probably be stealing pages from its own LRU list, and
> :not the system, to satisfy new requests.  This is particularly
> :true of files that are demanding resources on a resource-bound
> :system.
> 
>     This isn't exactly what I was talking about.  The issue in regards to
>     the filesystem syncer is that it fsync()'s an entire file.  If
>     you have a big file (e.g. a USENET news history file) the 
>     filesystem syncer can come along and exclusively lock it for
>     *seconds* while it is fsync()ing it, stalling all activity on
>     the file every 30 seconds.

This seems like a broken (non)use of _SYNC parameters, but I
definitely remember now about the FreeBSD breakage in the
dirty page sync case not knowing what pages should be sync'ed
or not, in the mmap region sync case of msync() degrading to
fsync().

I guess O_WRITESYNC or msync() fixing is not an option?


>     The current VM system already does a good job in allowing files
>     to stealing pages from themselves.  The sequential I/O detection
>     heuristic depresses the priority of pages as they are read making
>     it more likely for them to be reused.  Since sequential I/O tends
>     to be the biggest abuser of file cache, the current FreeBSD
>     algorithms work well in real-life situations.  We also have a few
>     other optimizations to reuse pages in there that I had added a year
>     or so ago (or fixed up, in the case of the sequential detection
>     heuristic).

The biggest abuser that I have seen of this is actually not
sequential.  It is a linker that mmap()'s the object files,
and then seeks all over creation to do the link, forcing all
other pages out of core.

I think the assumption that this is a sequential access problem,
instead of a more general problem, is a bad one (FWIW, building
per vnode working set quotas fixed the problem with the linker
being antagonisitic).


>     One of the reasons why Yahoo uses MAP_NOSYNC so much (causing
>     the problem that Alfred has been talking about) is because the
>     filesystem syncer is 'broken' in regards to generating
>     unnecessarily long stalls.

It doesn't stall when it should?  8-) 8-).  I think this is a
case of needing to eventually pay the piper for the music
being played.  If the pages are truly anonymous, then they
don't need sync'ed; if they aren't, then they do need sync'ed.

It sounds to me that if they are seeing long stalls, it's the
msync() bug with not being able to tell what's dirty and what's
clean...


>     Personally speaking, I would much rather use MAP_NOSYNC anyway,
>     even with a fixed filesystem syncer.   MAP_NOSYNC pages are not
>     restricted by the size of the filesystem buffer cache,

I see this as a bug in the non-MAP_NOSYNC case in FreeBSD's use
vnodes as synonyms for vm_object_t's.  I really doubt, though,
that they are exceeding the maximum file size with a mapping; if
not, then the issue is tuning.  The limits on the size of the
FS buffer cache are arbitrary; it should be possible to relax
them.

Again, I think the biggest problem here is historical, and it
derives from the ability to dissociate a vnode with pages
still hung off it from the backing inode (a cache bust).  I
suspect that if they increased the size of the ihash cache,
they would see much better characteristics.  My personal
preference would be to not dissociate valid but clean pages
from the reference object, until absolutely necessary.  An
easy fix for this would be to allow the FS to own the vnodes,
not have a fixed size pool, and have a struct like:

	struct ufs_vnode {
		struct vnode;
		struct ufs_in_core_inode;
	};

And pass that around as if it were just a vnode, giving it back
the the VFS that owned it, instead of using a system reclaim
method, in order to reclaim it.  Then if an ihash reclaim was
wanted, it would have to free up the vnode resources to get it.

Using high and low watermarks, instead of a fixed pool would
complete the picture (the use of a fixed per-FS ihash pool in
combination with a high/low watermarked per-system vnode pool
is part of what causes the problem in the first place; an
analytical mechanic or electronics buff would call this a
classic case of "impedence mismatch").


>     so you can have a whole lot more dirty pages in the system
>     then you would normally be able to have.

E.g. they are working around an arbitrary, and wrong-for-them,
administrative limit, instead of changing it.  Bletch.


>     This 'feature' has had the unfortunate side effect of screwing
>     up the pageout daemon's algorithms, but that's fixable.

I think the idea of a fixed limit on the FS buffer cache is
probably wrong in the first place; certainly, there must be
high and low reserves, but:

|----------------------------------------------| all of memory
             |---------------------------------| FS allowed use
|-------------------------------------|          non-FS allowed use
|------------|                                   non-FSreserve
                                      |--------| FS reserve

...in other words, a reserve-based system, rather than a limit
based system.

It would result in the same effect, without damaging natural limits
for any given loading chracteristics, arrived at through hysteresis
effects, right?  NB: The reserve sizes in the diagram are vastly
exagerated to keep them from being just vertical bars.

It has always amazed me that the system limits natural load as
if all systems were to be used as general purpose systems by
interactive users; you could achieve that effect by making larger
reserves for interactive use, but you can't achieve the opposite
effect by diddling administrative limits that aren't already
predicated on a reserve model.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message