Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 1 Dec 2000 11:44:04 +0100 (CET)
From:      News History File User <newsuser@free-pr0n.netscum.dk>
To:        hackers@freebsd.org, dillon@earth.backplane.com
Cc:        usenet@tdk.net, soren@wasabisystems.com
Subject:   Re: vm_pageout_scan badness
Message-ID:  <200012011044.eB1Ai4353062@newsmangler.inet.tele.dk>

next in thread | raw e-mail | index | archive | help
Long ago, it was written here on 25 Oct 2000 by Matt Dillon:

> :Consider that a file with a huge number of pages outstanding
> :should probably be stealing pages from its own LRU list, and
> :not the system, to satisfy new requests.  This is particularly
> :true of files that are demanding resources on a resource-bound
> :system.
> :...
> :                                       Terry Lambert
> :                                       terry@lambert.org
> 
>     This isn't exactly what I was talking about.  The issue in regards to
>     the filesystem syncer is that it fsync()'s an entire file.  If
>     you have a big file (e.g. a USENET news history file) the 
>     filesystem syncer can come along and exclusively lock it for
>     *seconds* while it is fsync()ing it, stalling all activity on
>     the file every 30 seconds.
[...]
>     One of the reasons why Yahoo uses MAP_NOSYNC so much (causing the problem
>     that Alfred has been talking about) is because the filesystem
>     syncer is 'broken' in regards to generating unnecessarily long stalls.
> 
>     Personally speaking, I would much rather use MAP_NOSYNC anyway, even with
>     a fixed filesystem syncer.   MAP_NOSYNC pages are not restricted by
>     the size of the filesystem buffer cache, so you can have a whole
>     lot more dirty pages in the system then you would normally be able to
>     have.  This 'feature' has had the unfortunate side effect of screwing
>     up *THWACK*

Yeah, no kidding -- here's what I see it screwing up.  First, some
background:

I've built three news machines, two transit boxen and one reader box,
with recent INN k0dez, and 4.2-STABLE of a few days ago (having tested
NetBSD, more on that later), and a brief detour into 5-current.

The two transit boxes have somewhere on the order of ~400MB memory
or less; the amount I've put in the reader box has increased up to a
Gig as I try to figure out what's happening.  I'm using the MAP_NOSYNC
on the history database files on all to try to get the NetBSD performance
of not hitting history, and I've made a couple other minor tweaks to
use mmap where the INN history code probably should, but doesn't.

Everything starts out well, where the history disk is beaten at startup
but as time passes, the time taken to do lookups and writes drops down
to near-zero levels, and the disk gets quiet.  And actually, the transit
machines stay that way, while the reader machine gives me problems after
some time.

What I notice is that the amount of memory used keeps increasing, until
it's all used, and the Free amount shown by `top' drops to a meg or so.
Cache and Buf get a bit, but most of it is Active.  Far more than is
accounted for by the processes.

Now, what happens on the reader machine is that after some time of the
Active memory increasing, it runs out and starts to swap out processes,
and the timestamps on the history database files (.index and .hash, this
is the md5-based history) get updated, rather than remaining at the
time INN is started.  Then the rapid history times skyrocket until it
takes more than 1/4 of the time.  I don't see this on the transit boxen
even after days of operation.

Now, what happens when I stop INN and everything news-related is that
some memory is freed up, but still, there can be, say, 400MB still
reported as Active.  More when I had a full gig in this machine to
try to keep it from swapping, all of which got used...

Then, when I reboot the machine, it gives the kernel messages about
syncing disks; done, and then suddenly the history drive light goes
on and it starts grinding for five minutes or so, before the actual
reboot happens.

No history activity happens when I shut down INN normally, which should
free the MAP_NOSYNC'ed pages and make them available to be written to
disk before rebooting, maybe.


I'm also running BerkeleyDB for the reader overview on this machine,
and I just discovered that I had applied MAP_NOSYNC to an earlier
release, but the library linked in had not had this -- I just fixed
that and am running that way now (and see a noticeable improvement)
so now when I reboot, I may see both the overview database disk and
the history disk get some pre-reboot activity, if what I think is
happening really is happening.

What I think is happening, based on these observations, is that the
data from the history hash files (less than 100MB) gets read into
memory, but the updates to it are not written over the data to be
replaced -- it's simply appended to, up to the limit of the available
memory.  When this limit is reached on the transit machines, then
things stabilize and old pages get recycled (but still, more memory
overall is used than the size of the actual file).

I'm guessing that additional activity of the reader machine causes
jumps in memory usage not seen on the transit machines, that is enough
to force some of the unwritten dirty pages to be written to the
history file, as a few megs of swap get used, which is why it does
not stabilize as `nicely' as the transit machines.


Now, something I contemplated -- it seems that Bad Undesirable Things
happen as soon as I start to actually swap, that I'd prefer to avoid.
What I'm wondering is if I can avoid this by adjusting some of the
values I see in `top' for Cache, Buf, and most importantly, Free.
May I ask where (in which source file) these ratios or limits or
whatever are set?  I'm hoping I can up the `Free' limit to a few
dozen megs to give headroom before actual swapping happens, since
now the Free value is a meg or two out of, oh, a gig available...


Anyway, once this happens, performance sucks rocks, the history
drive light is enough to read by (or should I say, it keeps me from
getting much-needed sleep), and apparently only a reboot can free
up memory for better purposes.

I've also only a small margin of memory headroom on the transit
machines, but much more on the reader machine, that can benefit
from cache far more, in case this makes any difference.  But I think
I also saw this steady increase when I first started with something
like 256M.



I just now noticed that you made a patch available just over a month
ago; I'm not sure if it would affect what I'm seeing here at all, or
if it's already in the recent source I've built.


And, in an earlier message in this thread, concerning something
related but different as far as I can make out:

>     Ouch.  The original VM code assumed that pages would not often be
>     ripped out from under the pageadaemon, so it felt free to restart
>     whenever.  I think you are absolutely correct in regards to the
>     clustering code causing nearby-page ripouts.
> 
>     I don't have much time available, but let me take a crack at the
>     problem tonight.  I don't think we want to add another workaround to
>     code that already has too many of them.  The solution may be
>     to create a dummy placemarker vm_page_t and to insert it into the pagelist
>     just after the current page after we've locked it and decided we have
>     to do something significant to it.  We would then be able to pick the
>     scan up where we left off using the placemarker.
> 
>     This would allow us to get rid of the restart code entirely, or at least
>     devolve it back into its original design (i.e. something that would not
>     happen very often).  Since we already have cache locality of reference for
>     the list node, the placemarker idea ought to be quite fast.
> 
>     I'll take a crack at implementing the openbsd (or was it netbsd?) partial
>     fsync() code as well, to prevent the update daemon from locking up large
>     files that have lots of dirty pages for long periods of time.

My experience has been with NetBSD.  Whether or not OpenBSD has this
as well, I cannot say -- no experience.

Up until a day or so ago, NetBSD hasn't had a Unified Buffer Cache,
so you only had a fixed percentage of memory available for cache and
anything above that was free.  This is how I saw the amount of free
memory decrease, first at a rapid rate after I started news and the
history drive light got a workout, then slower with time, until it
would stabilize after some hours (I could speed this up by forcing
history lookups with the last few thousand message IDs).  Only when
you gave a `sync' command, or otherwise closed the history database,
would the data get flushed out to disk.  It didn't have the problem
I'm seeing with FreeBSD of using more memory than the size of the
file.  (The flush to disk is still somewhat random, so it seems, as
is the shutdown-time FreeBSD draining, from the time it takes and the
grinding noise I hear).  In other words, at least this much of the
code works the way I'd like to see MAP_NOSYNC work by keeping track
of the dirty pages.  (or something, it's not like I have any idea
what I'm talking about here, I'm just a clueless beginner)


The big drawback, a fatal problem even, with the NetBSD k0deZ is
that apparently new data isn't immediately available for a read.
I hadn't noticed any real problem with this on our transit machine
when it was NetBSD, but I wasn't looking as closely as when I used
it for a reader and started to see decidedly wrong data.

This failure happened in both the random-access history database,
where a lookup that used the hash file would fail for a few minutes,
and then after some number of seconds to minutes it would succeed,
and also with the sequentially-written article largefiles, where a
seek to the offset would return old data for some minutes, later
returning the correct article.  This wasn't always consistent, as
some articles would be written and delivered without problems, but
the number of error messages I was seeing was disturbingly high,
with irregular bursts.

(The INN k0de could be missing some call needed for the particular
NetBSD flavour of mmap, #defines that are revealed by the config.h
file but which don't seem to do much in the source apart from updating
the active file.  Again, I have no idea what I'm talking about, or who
is at fault.  Whatever, it was definitely b0rken and Not Much Fun.)

However, I can easily add enough memory to a NetBSD machine to hold
two copies of the history hash tables (for expire) and a bit more
needed by the system, and never come closer to touching swap than I
would expect, so this much it's definitely doing right.  Unfortunately,
the FreeBSD way of dealing with MAP_NOSYNC'ed history files (and,
when I reboot, I'll see if the BerkeleyDB file causes disk activity
too) doesn't appear to be much of an improvement over the every-30-
second freezes (though the lockups may be shorter if more frequent)
one one of the three machines, so I demand a refund.  Or something.


disclaimer:  i really don't know what I'm talking about, so be gentle
when flaming me, thanks
(reply-to header is valid)
barry bouwsma, thwarted in all my attempts to build a good readerbox



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200012011044.eB1Ai4353062>