From owner-freebsd-hackers Tue Dec 5 21:19:53 2000 From owner-freebsd-hackers@FreeBSD.ORG Tue Dec 5 21:19:48 2000 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from crotchety.newsbastards.org (netcop.newsbastards.org [193.162.153.124]) by hub.freebsd.org (Postfix) with ESMTP id 49C7D37B400 for ; Tue, 5 Dec 2000 21:19:47 -0800 (PST) Received: (from news@localhost) by crotchety.newsbastards.org (8.11.1/8.11.1) id eB65JS910042; Wed, 6 Dec 2000 06:19:28 +0100 (CET) (envelope-from newsluser@free-pr0n.netscum.dk) Date: Wed, 6 Dec 2000 06:19:28 +0100 (CET) Message-Id: <200012060519.eB65JS910042@crotchety.newsbastards.org> X-Authentication-Warning: crotchety.newsbastards.org: news set sender to newsluser@free-pr0n.netscum.dk using -f Reply-To: freebsd-user@netscum.dk From: News History File User To: Matt Dillon Cc: hackers@freebsd.org, usenet@tdk.net Subject: Re: vm_pageout_scan badness References: <200012011918.eB1JIol53670@earth.backplane.com> <200012020525.eB25PPQ92768@newsmangler.inet.tele.dk> <200012021904.eB2J4An63970@earth.backplane.com> <200012030700.eB370XJ22476@newsmangler.inet.tele.dk> <200012040053.eB40rnm69425@earth.backplane.com> <200012050545.eB55jL453889@crotchety.newsbastards.org> In-Reply-To: <200012050545.eB55jL453889@crotchety.newsbastards.org> Sender: newsluser@free-pr0n.netscum.dk Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Howdy, I'm going to breach all sorts of ethics in the worst way by following up to my own message, just to throw out some new info... 'kay? Matt wrote, and I quote -- : > However, I noticed something interesting! Of course I clipped away the interesting Thing, but note the following that I saw... : INN after adding the memory, I did a `cp -p' on both the history.hash : and history.index files, just to start fresh and clean. It didn't seem [...] : > There is an easy way to test file fragmentation. Kill off everything : > and do a 'dd if=history of=/dev/null bs=32k'. Do the same for : > history.hash and history.index. Look at the iostat on the history : > drive. Specifically, do an 'iostat 1' and look at the KB/t (kilobytes : > per transfer). You should see 32-64KB/t. If you see 8K/t the file : > is severely fragmented. Go through the entire history file(s) w/ dd... : : Okay, I'm doing this: The two hash-type files give me between 9 and : 10K/t; the history text file gives me more like 60KB/t. Hmmm. It's Now, remember what Matt wrote, that partially-cached data played havoc with read-ahead. That is apparently what I was seeing here, pulling some bit of data off the disk proper, but then pulling a chunk of data that was cached, and so on. I figured that out as I attempted to copy one of the files to create an unfragmented copy to test transfer size and saw the expected 64K (well DUH, that was the write size), and then attempted to `dd' these to /dev/null and saw ... no disk activity. The file was in cache. Bummer. Oh well, I had to reboot anyway for some reason, and did so. Immediately after reboot I `dd'ed the two database files and got the expected 64K/t of an unfragmented file. I also made copies of them just to push their contents into memory, because... : The actual history lookups and updates that matter are all done within : the memory taken up by the .index and .hash files. So, by keeping : them in memory, one doesn't need to do any disk activity at all for : lookups, and updates, well, so long as you commit them to the disk at : shutdown, all should be okay. That's what I'm attempting to achieve. : These lookups and updates are bleedin' expensive when disk activity : rears its ugly head. : : Not to worry, I'm going to keep plugging to see if there is a way for : me to lock these two files into memory so that they *stay* there, just : to prove whether or not that's a significant performance improvement. : I may have to break something, but hey... I b0rked something. I `fixed' the mlock operation to allow a lowly user such as myself to use it, just as proof of concept. (I still need to do a bit of tuning, I can see, but hey, I got results) So I attempt to pass all the madvise suggestions I can for both the history.index and .hash files, and then I attempt to mlock both of them. I don't get a failure, although the history.hash file (108MB) doesn't quite achieve the desired results -- I do see Good Things with the smaller history.index (72MB and don't remind me that 1MB really isn't 1000000bytes). Anyway, the number of `Wired' Megs in `top' is up from 71MB to 200+, and after some hours of operation, look at the timestamps of the two database files (the .n.* files are those I copied after reboot, and serve as a nice reference for when I started things) -rw-rw-r-- 1 news news 755280213 Dec 5 19:05 history -rw-rw-r-- 1 news news 57 Dec 5 19:05 history.dir -rw-rw-r-- 1 news news 108000000 Dec 5 19:05 history.hash -rw-rw-r-- 1 news news 72000000 Dec 5 08:44 history.index -rw-rw-r-- 1 news news 108000000 Dec 5 08:43 history.n.hash -rw-rw-r-- 1 news news 72000000 Dec 5 08:44 history.n.index So, okay, history.hash still sees disk activity, but look at a handful of INN timer stats following the boot: The last two stats with the default vm k0deZ before restart: Dec 5 08:30:40 crotchety innd: ME time 301532 idle 28002(120753) artwrite 70033(2853) artlink 0(0) hiswrite 49396(3097) hissync 28(6) ^^^^^ sitesend 460(5706) artctrl 296(25) artcncl 295(25) hishave 32016(8923) ^^^^^ hisgrep 45(10) artclean 20816(3150) perl 12536(3082) overv 29927(2853) python 0(0) ncread 33729(152735) ncproc 227796(152735) 80 seconds of 300 spent on history activity... urk... on a steady-state system with a few readers that had been running for some hours. Dec 5 08:35:37 crotchety innd: ME time 300052 idle 16425(136209) artwrite 77811(2726) artlink 0(0) hiswrite 35676(2941) hissync 28(6) sitesend 571(5450) artctrl 454(41) artcncl 451(41) hishave 33311(7392) hisgrep 55(14) artclean 22778(3000) perl 14137(2914) overv 28516(2726) python 0(0) ncread 38832(172145) ncproc 226513(172145) [REB00T] Dec 5 08:59:32 crotchety innd: ME time 300059 idle 62840(189385) artwrite 68361(5580) artlink 0(0) hiswrite 8782(6567) hissync 104(12) ^^^^ sitesend 784(11160) artctrl 40(112) artcncl 38(112) hishave 5371(24250) ^^^^ hisgrep 0(35) artclean 20274(6510) perl 16160(6490) overv 59032(5580) python 0(0) ncread 31477(214268) ncproc 192853(214268) Hmmm, 15 seconds to do twice as many lookups. Normally the times start out after a restart pretty high and drop as the accessed pages are pulled into memory, so long as they don't get dumped back to disk. Dec 5 09:04:32 crotchety innd: ME time 300069 idle 33289(253926) artwrite 94561(3510) artlink 0(0) hiswrite 1758(3897) hissync 1(8) sitesend 472(7020) artctrl 26(42) artcncl 26(42) hishave 23394(39303) ^^^^^ hisgrep 0(25) artclean 28223(3892) perl 16704(3880) overv 19391(3510) python 0(0) ncread 44259(300338) ncproc 203424(300338) I'll go out on a limb and claim that the much higher time for the history lookups here is because the on-disk copy of history.hash is getting updated and this activity causes needed pages to be written out and read back in. Dec 5 09:09:32 crotchety innd: ME time 300031 idle 31795(270764) artwrite 99006(3474) artlink 0(0) hiswrite 1653(3914) hissync 23(8) sitesend 512(6948) artctrl 57(136) artcncl 56(136) hishave 13301(34866) hisgrep 1(54) artclean 29608(3863) perl 16995(3832) overv 20565(3474) python 0(0) ncread 47222(319176) ncproc 201221(319176) Dec 5 09:14:32 crotchety innd: ME time 300020 idle 31741(275290) artwrite 101160(3596) artlink 0(0) hiswrite 1608(3941) hissync 0(8) sitesend 512(7192) artctrl 58(121) artcncl 57(121) hishave 8062(29428) hisgrep 1(75) artclean 29083(3905) perl 17264(3895) overv 22364(3596) python 0(0) ncread 47022(322682) ncproc 201300(322682) Dec 5 09:19:32 crotchety innd: ME time 300000 idle 36150(258630) artwrite 95591(3410) artlink 0(0) hiswrite 4945(3799) hissync 1(8) sitesend 468(6820) artctrl 47(89) artcncl 46(89) hishave 6242(25413) hisgrep 0(47) artclean 28568(3831) perl 16320(3757) overv 22914(3410) python 0(0) ncread 47191(302273) ncproc 195326(302273) Still, as the history times fall, I'm starting to chew on the backlogs that I hadn't been able to take in before. Skipping a bit of time... Dec 5 10:24:33 crotchety innd: ME time 300017 idle 15118(112875) artwrite 74383(3305) artlink 0(0) hiswrite 11005(3837) hissync 42(8) sitesend 595(6610) artctrl 1751(359) artcncl 1713(359) hishave 13512(10514) hisgrep 6(50) artclean 25837(3573) perl 15504(3528) overv 69173(3305) python 0(0) ncread 44935(151577) ncproc 227542(151577) Dec 5 10:29:33 crotchety innd: ME time 300118 idle 24125(179647) artwrite 85188(3535) artlink 0(0) hiswrite 9523(4021) hissync 11(8) sitesend 543(7070) artctrl 672(398) artcncl 663(398) hishave 15373(11343) hisgrep 3(188) artclean 28574(3819) perl 16549(3811) overv 36233(3535) python 0(0) ncread 48899(223058) ncproc 210626(223058) Not optimal timings, but decent -- compare 25 seconds devoted to history activity with as much as two minutes that I had been seeing previously, when the smaller history.index file was being scribbled to on disk too. Another thing is that there is still the noticeable time when the system is dumping data to the history drive every 30-some seconds, and related history access freezes up. Previously, this had taken from, oh, four to eight seconds, but now it's one to two seconds in duration. Still not as perfect as if I could convince the history.hash file not to be flushed to disk as well, but it certainly shows that it is a performance killer when one does have to do this history update activity. Dec 5 12:24:33 crotchety innd: ME time 300075 idle 7271(44156) artwrite 66376(2715) artlink 0(0) hiswrite 1607(3018) hissync 104(7) sitesend 594(5430) artctrl 17(29) artcncl 17(29) hishave 4170(16315) hisgrep 0(8) artclean 22926(3002) perl 13387(2997) overv 117031(2715) python 0(0) ncread 41060(75619) ncproc 243805(75619) Whoa, how did that happen. I can live with it... Dec 5 17:59:37 crotchety innd: ME time 300001 idle 39108(186389) artwrite 78561(3271) artlink 0(0) hiswrite 15946(4020) hissync 11(9) sitesend 700(6542) artctrl 307(88) artcncl 303(88) hishave 18674(13309) hisgrep 14(24) artclean 23868(3966) perl 15008(3956) overv 35097(3271) python 0(0) ncread 41240(217018) ncproc 204095(217018) And an extreme in the other direction. To recap, the difference here is that by cheating, I was able to mlock one of the two files (the behaviour I was hoping to be able to achieve through first MAP_NOSYNC alone, then in combination with MADV_WILLNEED to keep all the pages in memory so much as possible) and achieve a much improved level of performance -- I'm able to catch up on backlogs from a full feed that had built up during the time I wasn't cheating -- by using memory for the history database files rather than for general filesystem caching. I even have spare capacity! Woo. The mlock man page refers to some system limit on wired pages; I get no error when mlock()'ing the hash file, and I'm reasonably sure I tweaked the INN source to treat both files identically (and on the other machines I have running, the timestamps of both files remains pretty much unchanged). I'm not sure why I'm not seeing the desired results here with both files (maybe some call hidden somewhere I haven't located yet), but I hope you can see the improvements so far. I even let abusive readers pound on me. Well, for a while 'til I got tired of 'em. I still don't know for certain if the disk updates I am seeing are slow because they aren't sorted well, or if they're random pages and not a sequential set, given that I hope I've ruled out fragmentation of the database files. I still maintain that in the case of a true MADV_RANDOM madvise'd file, any attempts to clean out `unused' pages are ill-advised, or if they're needed, anything other than freeing of sequential pages results in excess disk activity that gains nothing, if it's the case that this is not how it's done, due to the nature of random access. Yeah, hacking the vm source to allow me to mlock() isn't kosher, but I wanted to test a theory. Doing so probably requires a few more tweaks in the INN source to handle expiry, so it seems, so I'd rather the vm subsystem do this for me automagically with the right invocation of the suitable mmap/madvise operations, if this is reasonable. ThaNk U 4 uR atTenT1oN aNd pLz haVe a NICE DaY bArRy b0uWsmA, tElE dAnmArk IntErnEt nEws k0dEhaX0r To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message