Date: Sat, 2 Dec 2000 06:25:25 +0100 (CET) From: News History File User <newsuser@free-pr0n.netscum.dk> To: hackers@freebsd.org, dillon@earth.backplane.com Cc: usenet@tdk.net Subject: Re: vm_pageout_scan badness Message-ID: <200012020525.eB25PPQ92768@newsmangler.inet.tele.dk> In-Reply-To: <200012011918.eB1JIol53670@earth.backplane.com> References: <200012011918.eB1JIol53670@earth.backplane.com>
next in thread | previous in thread | raw e-mail | index | archive | help
> :> Personally speaking, I would much rather use MAP_NOSYNC anyway, > even with > :... > :Everything starts out well, where the history disk is beaten at startup > :but as time passes, the time taken to do lookups and writes drops down > :to near-zero levels, and the disk gets quiet. And actually, the transit > :... > :What I notice is that the amount of memory used keeps increasing, until > :it's all used, and the Free amount shown by `top' drops to a meg or so. > :Cache and Buf get a bit, but most of it is Active. Far more than is > :accounted for by the processes. > > This is to be expected, because the dirty MAP_NOSYNC pages will not > be written out until they are forced out, or by msync(). I just discovered the user command `fsync' which has revealed a few things to me, clearing up some mysteries. Also, I've watched more closely the pattern of what happens to the available memory following a fresh boot... At the moment, this (reader) machine has been up for half a day, with performance barely able to keep up with a full feed (but starting to slip as the overnight burst of binaries is starting), but at last look, history lookups and writes are accounting for more than half (!) of the INN news process time, with available idle time being essentially zero. So... > :Now, what happens on the reader machine is that after some time of the > :Active memory increasing, it runs out and starts to swap out processes, > :and the timestamps on the history database files (.index and .hash, this > :is the md5-based history) get updated, rather than remaining at the > :time INN is started. Then the rapid history times skyrocket until it > :takes more than 1/4 of the time. I don't see this on the transit boxen > :even after days of operation. > > Hmm. That doesn't sound right. Free memory should drop to near zero, > but then what should happen is the pageout daemon should come along > and deactivate a big chunk of the 'active' pages... so you should > see a situation where you have, say, 200MB worth of active pages > and 200MB worth of inactive pages. After that the pageout daemon > should start paging out the inactive pages and increasing the 'cache'. > The number of 'free' pages will always be near zero, which is to be > expected. But it should not be swapping out any process. Here is what I noticed while watching the `top' values for Active, Inactive, and Free following this last boot (I didn't pay any attention to the other fields to notice any wild fluctuations there, next time maybe), on this machine with 512MB of RAM, if it reveals anything: Following the boot, things start out with plenty of memory Free, and something like 4MB Active, which seems reasonable to me. Then I start things. As is to be expected, INN increases in size as it does history lookups and updates, and the amount of memory shown as Active tracks this, more or less. But what's happening to the Free value! It's going down at as much as 4MB per `top' interval. Or should I say, what is happening to the Inactive value -- it's constantly increasing, and I observe a rapid migration of all the Free memory to Inactive, until the value of Inactive peaks out at the time that Free drops to about 996k, beyond which it changes little. None of the swap space has been touched yet. As soon as the value for Free hits bottom and that of Inactive has reached a max, now the migration happens from Inactive to Active -- until this point, the value of Active has been roughly what I would expect to see, given the size of the history hash/index files, and the BerkeleyDB file I'm now using MAP_NOSYNC as well for a definite improvement in overview access times. Anyway, I don't remember what values exactly I was seeing for Free and Inactive or Active, since I was just watching for general trends, but I seem to recall Active being ~100MB, and Inactive somewhat more. (Are you saying above that this Inactive value should be migrating to Cache, which I'm not seeing, rather than to Active, which I do see? If so, then hmmm.) Now memory is drifting at a fairly rapid pace from Inactive (the meaning of which I'm not exactly clear about, although there's some explanation in the `top' man page that hasn't quite clicked into understanding yet), over to the Active field, at something like 2MB or so per `top' interval. Free remains close to 1MB, but Active is constantly growing, although no processes are clearly taking up any of this, apart from INN which only accounts for around 100MB at this time, and isn't increasing at the rate of increase of Active memory. Anyway, the Active field continues to increase as Inactive decreases until finally Inactive bottoms out, down from several hundred MB to a one or two digit MB value (I don't remember exactly), while Active has increased to almost 400MB. This is something like 20 minutes after the reboot, and now the first bit of swap gets hit. However, the value of Active has hit its peak and resting value, ~400MB with 512MB RAM, and I recall it being about 800-something with a full GB; Inactive varies some number of MB either side of 10MB, Free stays near 1MB, Cache seems to be between 10 and 20MB, Buf is about 60, and wired is around 76MB. The amount of swap has increased over time, from half a meg where it was for some time after being hit, up to 18MB now. It periodically sees some activity. The RES size of innd is ~120MB, it has a 108MB .hash and 72MB .index file, both NOSYNC. I'm considering recompiling INN and BerkeleyDB without the MAP_NOSYNC to see what the reference level of history and overview access time and lockups during updates are, as well as the values of memory usage in that case, to see if the suspicions I have are correct or if I need to look elsewhere. > The actual amount of 'free' memory in the system is actually 'free+cache' > pages. > > :Now, what happens when I stop INN and everything news-related is that > :some memory is freed up, but still, there can be, say, 400MB still > :reported as Active. More when I had a full gig in this machine to > :... > : > :Then, when I reboot the machine, it gives the kernel messages about > :syncing disks; done, and then suddenly the history drive light goes > :on and it starts grinding for five minutes or so, before the actual > :reboot happens. > > Right. This is to be expected. You have a lot of dirty pages > in the system due to the use of MAP_NOSYNC that have to be flushed > out. Yep, doing research and reading man pages and watching closely *after* I sent the last mail, not before, has opened my eyes to a few things. And as I noted, manually `fsync /news/db/history.index/hash'-ing has helped too. Still doesn't mean I know what I'm doing, but... > :No history activity happens when I shut down INN normally, which should > :free the MAP_NOSYNC'ed pages and make them available to be written to > :disk before rebooting, maybe. > > MAP_NOSYNC pages are not flushed when the referencing program exits. > They stick around until they are forced out. You can flush them > manually by using a mmap()/msync() combination. i.e. an msync() prior > to munmap()ing (from INND only) ought to do it. Oh well, somehow I had the idea (from the mmap manpage?) that when all programs referencing them exited, they would, or might, be flushed. Apparently the problem I noted with NetBSD requires use of msync(), now that the multiple-article files are getting mmap'ed, and same with the history. Which will probably kill performance there, until their mmap/caches gets fixed. So forget I mentioned that in the last mail. Now, I haven't tried the userland `fsync' while the Active memory was at sane levels, but what I saw on a quiet system where I had shut down innd was that of about 300MB Active, the fsync of both history files would free only about 100MB. > :What I think is happening, based on these observations, is that the > :data from the history hash files (less than 100MB) gets read into > :memory, but the updates to it are not written over the data to be > :replaced -- it's simply appended to, up to the limit of the available > :memory. When this limit is reached on the transit machines, then > :things stabilize and old pages get recycled (but still, more memory > :overall is used than the size of the actual file). > > It doesn't append... the pages are reused. The set of 'active' > pages in the VM system is effectively the set of all files accessed > for the entire system, not just MAP_NOSYNC pages. If you are only > MAP_NOSYNC'ing 100MB worth of pages, then only 100MB worth of pages > will be left unflushed. Yeah, that's what I learned. So my idea about how the amount of RAM that would appear as Active increases to fill as much of the available memory turned out to be wrong. I still can't explain that. I wish I hadn't returned the rest of the GB of RAM I had, just to see how the numbers are on a system with far more headroom. The sizes of the two NOSYNC'ed history files are as above (~180MB total), and I believe the particular BerkeleyDB file is <18MB. Meaning a total of maximum <200MB unflushed data now. Which is half of what is shown as Active. Hmmm. > Is it possible that history file rewriting is creating an issue? Doesn't > INN rewrite the history file every once in a while to clear out old > garbage? I'm not up on the latest INN. In normal operation, no -- the text file is append-only (the text file isn't used for lookups with the MD5-based hashing), and expire, which I'm running manually, rewrites the hash files -- leading to a mysterious lack of space today when I attempted to run both expire and makedbz (a variant of makehistory), and apparently some reader processes or some daemons still had the old inodes open, until suddenly in one swell foop, some 750MB was freed up -- far more than I expected to see, so I should probably look into this space usage sometime... This shouldn't be a problem the way I'm running things now. I haven't run an expire process since the last reboot to observe things closely. > :I'm guessing that additional activity of the reader machine causes > :jumps in memory usage not seen on the transit machines, that is enough > :to force some of the unwritten dirty pages to be written to the > :history file, as a few megs of swap get used, which is why it does > :not stabilize as `nicely' as the transit machines. > > This makes sense... the amount of swap that gets used is critical. > If we are talking about only a few megabytes, then your system is > *not* swapping significantly, it is simply swapping out completely > idle pages from things like idle getty's and such. This is a good > thing. The disk activity would thus be mostly due to MAP_NOSYNC pages > being written out. Yeah, but that disk activity seems to be the cause for the history timings in INN taking from (now) 20% up to (earlier) 50% of the available time, which isn't nice. I mean, if I have a GB of silicon to toss at the thing, you think I'd be able to bribe the VM system into keeping the NOSYNC pages where I want 'em, like the way they are during the first umpteen minutes following a reboot... > :Anyway, once this happens, performance sucks rocks, the history > :drive light is enough to read by (or should I say, it keeps me from > :getting much-needed sleep), and apparently only a reboot can free > :up memory for better purposes. > > Well, the performance sucking part means something is not working > as designed. The question is what. Tell ya what -- I'll bring things down after I send this (I'm writing this from the console), use `fsync' on the three files and observe disk activity and `top', I'll put an INN without the MAP_NOSYNC in place (but leave BerkeleyDB as is for now), and I'll see just how low I can get Active on a quiet system that had been running to be, before I reboot. Then I'll try un-NOSYNC'ing BerkeleyDB as well, and using that as reference, to see just which end I'm talking out of. If I see a big difference without NOSYNC, I'll scream bloody murder. If I don't, I'll also be screaming but you won't hear it since I'll be eating my old socks at the same time. Maybe then I'll figure out what else I can do... > Here is what I would recommend. First, I would use 'systat -vm 1' > and carefully examine the pageout/swapout activity. If the SWAP PAGER > has no significant activity then we can discard it as a possible problem. > If the VN PAGER has significant activity, then this is what we need > to focus on. Swap pager mostly idle. VN pager constantly from ~10 to ~30 `in' (count), pages in varying wildly from ~20 to ~150. Or more. Column `out' sees some activity, usually a single digit once every few seconds, with bursts of two- or three-digit numbers maybe once or twice every ~10-odd seconds. > I would try changing the pageout and VM cache parameters. Do NOT mess > with the VM free parameters! Try changing the vm.v_cache_min and > vm.v_cache_max parameters. For example, increase vm.v_cache_max to > widen the hysteresis. You can slo try changing vm.pageout_algorithm > from 0 to 1 (this is not likely to have much of an effect), and you > can also try increasing vm.max_page_launder, e.g. from 32 to > 100 (much larger would not have any effect). Finally, you can > try increasing the vm.v_inactive_target. Do not increase the > vm.v_free_target. Thanks, I may not do anything with these right now, unless you tell me that everything else I'm observing is peachy-keen-groovy... > Last thing: Using MAP_NOSYNC has a well known problem when used to > fill 'holes' in files. That is, if the history file is being appended > to by calling ftruncate(), but the new space is not write()n to and > instead is dirtied via the mmap, you will have a serious fragmentation > problem with the file. In order to avoid this problem any file appends > should occur using write() if possible, or the newly allocated space in > the file should be filled with zero's using write() prior to being > random-accessed by mmap() (which might be easier to implement). Hmmm. If I'm reading the k0de right, the new files are in fact created with ftruncate() when they are initialized, but then, in theory, you should be moving these files into place, and starting INN, which would read from (and mmap) these (nominally) fixed-length files. After this, apart from overflow conditions (which I'll have to look at), the hash/ index files should not be appended, but new data inserted, by, hmmm, memcpy or something... I don't think this is the case here -- the append-only text file is not NO_SYNC'ed (I don't even think it's mmap'ed, but ICBW) -- the hash and index files are fixed size hash files (well, nominally fixed, until they overflow, which I tried to avoid by doing the makedbz), and the files, once initialized, are read off the disk -- not sure about that step of initialization (makedbz/makehistory)... Must wake up and read source. thanks! barry bouwsma To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200012020525.eB25PPQ92768>