FreeBSD Mail Archives

Date:      Sun, 3 Dec 2000 08:00:33 +0100 (CET)
From:      News History File User <newsuser@free-pr0n.netscum.dk>
To:        hackers@freebsd.org, dillon@earth.backplane.com
Cc:        usenet@tdk.net
Subject:   Re: vm_pageout_scan badness
Message-ID:  <200012030700.eB370XJ22476@newsmangler.inet.tele.dk>
In-Reply-To: <200012021904.eB2J4An63970@earth.backplane.com>
References:  <200012011918.eB1JIol53670@earth.backplane.com> <200012020525.eB25PPQ92768@newsmangler.inet.tele.dk> <200012021904.eB2J4An63970@earth.backplane.com>

> :but at last look, history lookups and writes are accounting for more
> :than half (!) of the INN news process time, with available idle time
> :being essentially zero.  So...
> 
>     No idle time?  That doesn't sound like blocked I/O to me, it sounds
>     like the machine has run out of cpu.

Um, I knew I'd be unclear somehow.   The machine itself (with 2 CPUs)
has plenty of idle time -- `top' reports typically 70-80% idle, and
INN takes from 20-40% of CPU (being SMP, a process like `perl' locked
to one CPU will appear around 98%, unlike a certain other OS that will
show this percentage for the system total, rather than for a particular
CPU).

What I mean is that the INN process timer, which is basically Joe Greco's
timer that wraps key functions with start/stop timer calls, showing
where INN spends much of its time, is showing little to no idle time
(meaning it couldn't take more articles in no matter how hard I push
them).  Let me show you the timer stats from the time I started things
not long ago on this reader machine, where it's taking in backlogs:


Dec  3 04:33:47 crotchety innd: ME time 300449 idle 376(4577)
 all times in milliseconds: elapsed time^^^^^^=5min ^^^idle time (numbers
in parentheses are number of calls; only significant in calls like
artwrite to show how many articles were actually written to spool,
hiswrite to show how many unique articles were received over this
time period, and hishave to show how many history lookups were done)
 artwrite 52601(6077) artlink 0(0) hiswrite 40200(7035) hissync 11(14)
          ^^^ 53 seconds writing articles   ^^ 40 seconds updating history
 sitesend 647(12154) artctrl 2297(308) artcncl 2288(308) hishave 38857(26474)
                                39 seconds doing history lookups ^^
 hisgrep 70(111) artclean 12264(6930) perl 13819(6838) overv 112176(6077)
 python 0(0) ncread 13818(21287) ncproc 284413(21287) 

Dec  3 04:38:48 crotchety innd: ME time 301584 idle 406(5926) artwrite 55774(6402) artlink 0(0) hiswrite 25483(7474) hissync 15(15) sitesend 733(12805) artctrl 1257(322) artcncl 1245(321) hishave 22114(28196) hisgrep 90(38) artclean 12757(7295) perl 14696(7191) overv 136855(6402) python 0(0) ncread 14446(23235) ncproc 284767(23235) 

(as time passes and more of the MAP_NOSYNC file is in memory, the time
needed for history writes/lookups drops)
[...]
Dec  3 04:58:49 crotchety innd: ME time 300047 idle 566(6272) artwrite 59850(6071) artlink 0(0) hiswrite 11630(6894) hissync 33(14) sitesend 692(12142) artctrl 324(244) artcncl 320(244) hishave 13614(24312) hisgrep 0(77) artclean 13232(6800) perl 14531(6727) overv 156723(6071) python 0(0) ncread 15116(23838) ncproc 281745(23838) 
Dec  3 05:03:49 crotchety innd: ME time 300018 idle 366(5936) artwrite 56956(6620) artlink 0(0) hiswrite 8850(7749) hissync 7(15) sitesend 760(13240) artctrl 255(160) artcncl 255(160) hishave 9944(25198) hisgrep 0(31) artclean 13441(7753) perl 15605(7620) overv 164223(6620) python 0(0) ncread 14783(24123) ncproc 282791(24123) 

Most of the time is spent on the BerkeleyDB overview now.  This is
probably because some reader is giving repeated commands pounding
the overview database.  <clickety-click>  That reader's IP now has a
different gateway address, and won't be bothering me for a while.

Now, for a reference, here are the timings on a transit-only machine
with no readers, after it's been running for a while:


Dec  3 05:22:09 news-feed69 innd: ME time 300000 idle 91045(91733)
                     a reasonable amount of idle time ^^
 artwrite 48083(2096) artlink 0(0) hiswrite 1639(2096) hissync 33(11)
                                            ^^^^
 sitesend 4291(12510) artctrl 0(0) artcncl 0(0) hishave 1600(30129)
                                                        ^^^^
 hisgrep 0(0) artclean 25591(2121) perl 79(2096) overv 0(0) python 0(0)
 ncread 69798(147925) ncproc 108624(147919) 

A total of just over 3 seconds out of every 300 seconds spent on
history activity.  That's reflected by the timestamps on the NOSYNC'ed
history database (index/hash) files you see here:

-rw-rw-r--  1 news  news  436206889 Dec  3 05:22 history
-rw-rw-r--  1 news  news         67 Dec  3 05:22 history.dir
-rw-rw-r--  1 news  news   81000000 Dec  1 01:55 history.hash
-rw-rw-r--  1 news  news   54000000 Nov 30 22:49 history.index

However, the timings shown by `top' here show from 10 to 20% idle CPU
time, even though INN itself has capacity to do more work.


The problem is that I'm not seeing this on the reader box.  Or if I
do see it, it doesn't last long.  The timestamps on the above files
are pretty much current, in spite of the files being NOSYNC'ed.


> :As is to be expected, INN increases in size as it does history lookups
> :and updates, and the amount of memory shown as Active tracks this,
> :more or less.  But what's happening to the Free value!  It's going
> :down at as much as 4MB per `top' interval.  Or should I say, what is
> :happening to the Inactive value -- it's constantly increasing, and I
> :observe a rapid migration of all the Free memory to Inactive, until
> :the value of Inactive peaks out at the time that Free drops to about
> :996k, beyond which it changes little.  None of the swap space has
> :been touched yet.
> :
> :As soon as the value for Free hits bottom and that of Inactive has
> :reached a max, now the migration happens from Inactive to Active --
> :until this point, the value of Active has been roughly what I would
> :expect to see, given the size of the history hash/index files, and
> :the BerkeleyDB file I'm now using MAP_NOSYNC as well for a definite
> :improvement in overview access times.
> 
>     Hmm.  An increasing 'inactive' most often occurs when a program
>     is reading a file sequentially.  It sounds like most of the inactive
>     pages are probably due to reader requests from the spool.

Either that, or the access to the BerkeleyDB overview data -- I think
I see the same poor performance even when I'm not allowing readers.
Or maybe not.  Even if the performance doesn't suck, the timestamps
on the files get updated periodically.


I took a look at all three machines with NOSYNC, and noted that the
one doing best (with the `ls' of history above) has vm source from
26.Nov, while the reader machine doing poorly had source files from
only 25.Nov -- a third machine with 256MB RAM that is doing pretty
well, but could do better, has source from 21.Nov.  So the machines
were not as closely in sync as I thought.

Just in case, I rebuilt the reader machine so that the vm sourcefiles
are current as of 02.Dec to get your latest changes.  I still see the
same less-than-ideal performance, although perhaps it's not quite as
bad as earlier.  But in general, all the memory migrates to Inactive
and from there to Active within 15 minutes after startup.  And after
shutdown, and `fsync'ing the three MAP_NOSYNC'ed files, there still
remains about 300MB as Active, while just under 400MB while it was running.

More observations that may or may not mean anything -- before rebooting,
I timed the `fsync' commands on the 108MB and 72MB history files, as
well as the 18MB BerkeleyDB file, as well as watching to see how much
of the Active memory was released by the commands.  It looked as if,
for a given size file, about 1/3 of its size was released from Active.
In other words, on an idle system, `fsyncing' the 108MB file that should
have been pretty much completely dirty, except that probably much of
the dirt was being written to disk since it's not working as well as
I would like, caused the Active `top' value to drop about 30MB.

The time taken to do the `fsync' was around one minute for the two
history files.  And around 1 second for the BerkeleyDB file...

Thing is, when I had rebooted (and saw the disk activity), I'm sure
it took much longer, like 5 minutes or so total.  Could be true, if
I hadn't been flushing data in operation and there were more dirty
pages needing to be written.

Another confusing thing (and one time, when I didn't know what was
going on, I thought the machine had hung and powered it off) is that
dumping the data to disk at reboot time happens after the `done' for
the `sycning filesystems' message.  Perhaps, if there is data to be
written, you might make note of this, writing something like `flushing
whatever cache' to the console, just so people know something is in
fact happening and It is now^Ht safe to turn off your computer.  I had
no disk activity light at the time, so...  well, I was young and foolish


If what you say, that data being read, is responsible for the NOSYNC
data getting flushed to disk, then it seems like someone's priorities
are a bit, well, wrong.  The way I see it, by giving the MAP_NOSYNC
flag, I'm sort of asking for preferential treatment, kinda like mlock,
even though that's not available to me as `news' user.

It seems like this MAP_NOSYNC'ed data should not be flushed if it's
possible to get at some other data instead, most likely the recently-
read Inactive data.  Or something.  Anything.  Now, if necessary, you
want to dump the NOSYNC data if you absolutely can't squeeze something
out of the Inactive, (even Active?  why else does it take nearly 400MB
on this machine?) or cached data.


So, I only have a small number of readers now (it's a party night in
the Real World and at least some people have lives) and the timer stats
have dropped to the numbers I expect to see, and the history drive is
relatively quiet too:


Dec  3 05:38:50 crotchety innd: ME time 300000 idle 30790(280652)
 artwrite 101349(3788) artlink 0(0) hiswrite 2089(3970) hissync 31(8)
      2 seconds to update the history data... ^^^^
 sitesend 459(7576) artctrl 131(51) artcncl 128(51) hishave 2777(50732)
        50 thousand history lookups in less than 3 seconds. ^^^^
 hisgrep 0(39) artclean 29821(4017) perl 16754(3958) overv 24979(3788)
 python 0(0) ncread 51580(326712) ncproc 198077(326712) 

Dec  3 05:53:50 crotchety innd: ME time 300006 idle 34595(293538)
 artwrite 98618(3534) artlink 0(0) hiswrite 2240(3775) hissync 68(8)
                                            ^^^^
 sitesend 426(7068) artctrl 6(46) artcncl 5(46) hishave 596(11155)
    11 thousand lookups in half a second?  Fine with me ^^^
 hisgrep 0(15) artclean 29846(3783) perl 16481(3744) overv 26877(3534)
 python 0(0) ncread 52778(332500) ncproc 192343(332500) 



Now, to see why I was complaining earlier, look at the timings I was
seeing when I had a lot more readers:

Dec  2 19:29:42 crotchety innd: ME time 301953 idle 12098(86714)
 artwrite 59150(2740) artlink 0(0) hiswrite 67383(3165) hissync 55(6)
                                            ^^^^^
 sitesend 1832(5480) artctrl 335(34) artcncl 335(34) hishave 40609(8383)
                                                             ^^^^^
 hisgrep 64(19) artclean 20805(3224) perl 12700(3150) overv 38256(2740)
 python 0(0) ncread 30335(113433) ncproc 252639(113433) 
Dec  2 19:44:42 crotchety innd: ME time 300038 idle 9646(54390)
 artwrite 48827(2738) artlink 0(0) hiswrite 75731(3058) hissync 56(6)
                                            ^^^^^
 sitesend 1885(5476) artctrl 3203(85) artcncl 3201(85) hishave 46630(7721)
                                                               ^^^^^
 hisgrep 17(11) artclean 16708(3065) perl 11078(2984) overv 48362(2738)
 python 0(0) ncread 24070(73708) ncproc 261521(73708) 
Dec  2 19:54:46 crotchety innd: ME time 304435 idle 6508(48280)
 artwrite 51075(2854) artlink 0(0) hiswrite 63032(3265) hissync 55(7)
              A minute for history updates? ^^
 sitesend 3460(5708) artctrl 5587(154) artcncl 5587(154) hishave 53840(9316)
               And a minute for history lookups too?  No thanks  ^^
 hisgrep 760(71) artclean 17761(3205) perl 12072(3182) overv 51383(2854)
 python 0(0) ncread 24904(69630) ncproc 268419(69630) 



> :lack of space today when I attempted to run both expire and makedbz (a
> :variant of makehistory), and apparently some reader processes or some
> :daemons still had the old inodes open, until suddenly in one swell foop,
> :some 750MB was freed up -- far more than I expected to see, so I should
> 
>     Woa.  750MB?  There are only two things that can cause that:
>     * A large (500+ MB) file is deleted after having previously been
>       mmap()'d.  (or the process holding the last open descriptor to
>       the file, after deletion, now exits).
> 
>     If I remember INN right, there is a situation that can occur here... the
>     reader processes open up the history file in order to implement a certain
>     NNTP commands.  I'm trying to remember which one... I think its one of
>     search commands.  Fubar... anyone remember which NNTP command opens
>     up the history file?  In anycase, I remember at BEST I had to completely
>     disable that command when running INN because it caused long-running
>     reader processes to keep a descriptor open on now-deleted history files.
>     When you do an expire run which replaces the history file, the original
>     (now deleted) history file may still be open by those reader processes.

Yeah, I realized yesterday morning, after a bit of chocolate, that
that was probably what I was seeing.  I'm in desperate need of sleep
to think straight -- I'm having to reread your messages about five
times after answering them and still finding new things.  Urk.

There were two readers at the time -- myself, and a cow-orker of mine.
I fetch messages by `ARTICLE <msg-ID>' which of course hits the
history file.  I don't remember offhand if nnrpd initially grabs
ahold of the file and doesn't let go.  There have been a number of
changes so that there may well be cases where the history file is
held open by processes now where it hadn't been at the time you had
your hands dirtied by it.  So this is almost certainly an unrelated
thing to the absence of available memory I was seeing.


Now, back to the sometimes-awful MAP_NOSYNC performance -- there are
a few differences in the way it's running now, where it behaves itself
nicely, and the way I had run it before:

* The kernel has your commits from 26.Nov up through 02.Dec.
* I have only five readers now, not 50, since starting.
* I didn't start INN immediately after boot this time.  Instead,
  INN had been running for a while, then I shut it down, ran
  expire by hand, and then started INN after other processes had
  had a chance to fill up the cache with whatever, if this would
  make any difference.

Time will tell how much the second factor has to do with the stellar
performance I see now.



>     abusers, by the way, which try to download feeds via their reader 
>     access) are enabled.  I would immediately research this... look for 
>     reader processes that have hung around too long and try killing them,
>     then see if that clears out some memory.

As far as system memory is concerned, I've killed all nnrpd processes
when I shut things down, so all news-related processes have stopped.
It hasn't made much difference in the `top' Active value.


>     There will also be a serious file fragmentation issue using MAP_NOSYNC
>     in the expire process.  You can probably use MAP_NOSYNC safely in the
>     INND core, but don't use it to rebuild the history file in the expire
>     process.

Ah, okay, thanks.  I'll check into this.  It used to be that there
would be problems if you defined MMAP for INN proper years ago, and
that define was used when building expire, so one had to manually
fudge makefiles or something to build expire without mmap.  I think
this is much simpler now and by default mmap isn't used in expire.
If so, this might be okay.  However, I've added the MAP_NOSYNC to the
MAP__ARGS in lib/dbz.c, so I really need to check expire.

The places where one isn't dealing with a fixed history database
size are --
* creating the initial database files, or rebuilding them (makedbz,
  which used to be the function of makehistory)
* overflows in the hash table, where the fixed database files grow
  in size as additional tables are added -- these additional tables
  are not incore, though, last I looked
* expire, as you note, creating new databases a la makedbz

(Weird, I just ran news.daily, which invoked expireover and should have
invoked expire, but it didn't seem to happen.  So the good timings I'm
seeing now are not on a potentially fragmented file.  Hmmm.)


Thanks for your help, if performance stays Good, or if it starts to
fall through the floor again, you'll hear about it...

barry bouwsma, will watch news swervers like a hawk for food


--

HI!  ME AGA1N!!!
As Murphy would have it, shortly after I finished writing the above,
history performance started to suck, even with six readers, and the
history drive light is getting a beating.  I've killed off readers
except for myself, of course, to see if things improve, but here are
recent timings so you can see why my eyes well with tears:

Dec  3 07:28:51 crotchety innd: ME time 300047 idle 16801(186367)
 artwrite 85200(3320) artlink 0(0) hiswrite 31568(3581) hissync 44(8)
                                            ^^^^^
 sitesend 399(6640) artctrl 5935(340) artcncl 5928(340) hishave 31975(11904)
                                                                ^^^^^
 hisgrep 4259(289) artclean 25148(3567) perl 14955(3530) overv 21593(3320)
 python 0(0) ncread 39087(226379) ncproc 229931(226379) 

Oh well.  Up from a couple seconds of history activity to more than
one out of every five minutes...  *sigh*  (gack, now up to 90 seconds)

I would say that something needs to be done to avoid losing those
pages that have been MAP_NOSYNC'ed, assuming this isn't already
being done.  They appear to be treated as just any other page now,
which isn't such a good idea.  Perhaps kind of a weak mlock, maybe.
If I'm not completely mistaken.

Anyway, now the two history times are taking 140+ out of 300 seconds
and getting worse.  Without readers.  So I'm going to send this off
before things get worse and do something, like sleep, maybe...



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200012030700.eB370XJ22476>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation