From owner-freebsd-stable@freebsd.org Sat Dec 26 15:10:14 2015 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E9570A52DB8 for ; Sat, 26 Dec 2015 15:10:13 +0000 (UTC) (envelope-from smithi@nimnet.asn.au) Received: from sola.nimnet.asn.au (paqi.nimnet.asn.au [115.70.110.159]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 830451F3C; Sat, 26 Dec 2015 15:10:12 +0000 (UTC) (envelope-from smithi@nimnet.asn.au) Received: from localhost (localhost [127.0.0.1]) by sola.nimnet.asn.au (8.14.2/8.14.2) with ESMTP id tBQFA0OA012299; Sun, 27 Dec 2015 02:10:00 +1100 (EST) (envelope-from smithi@nimnet.asn.au) Date: Sun, 27 Dec 2015 02:09:59 +1100 (EST) From: Ian Smith To: Tom Evans cc: FreeBSD Stable , Alexander Motin , Jeremy Chadwick , Miroslav Lachman <000.fbsd@quip.cz> Subject: Re: stable/10: high load average when box is idle In-Reply-To: Message-ID: <20151106212735.U12989@sola.nimnet.asn.au> References: <20151027050508.GA7612@icarus.home.lan> <20151104025748.F10372@sola.nimnet.asn.au> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 26 Dec 2015 15:10:14 -0000 On Wed, 4 Nov 2015 14:31:35 +0000, Tom Evans wrote: > On Tue, Nov 3, 2015 at 5:47 PM, Ian Smith wrote: > > Like yourself, I think this is far from 'cosmetic' as is oft suggested, > > especially in some fairly ill-informed forum posts but also various list > > posts. I've been watching load averages since OS/2 through FreeBSD 2.2 > > till present and some Linux systems, and have never before seen anything > > like this, especially on virtually entirely idle systems. While the LAs > > reported during (say) make -j4 buildworld appear more same, who knows? > > > > I'm not suggesting that I think there's any sort of performance hit from > > this; so far I don't, but 'cosmetic' suggests that it doesn't matter .. I'm afraid I can't resist being undoubtedly too thorough with this post, but it does summarise quite a lot of time I've spent this year, and I'm feeling a little in need of celebrating an apparent win as a result .. > Have you read mav's explanation on the PR? It certainly seems a valid > explanation of why it is probably a cosmetic issue - that is, we could > wake up more to get a truly perfect load average, but why bother? I've read it over and again. I think it may well explain some relative 'fuzziness' in LAs over perhaps a 0.1 to 0.3 range that I see now since finding - more truthfully, stumbling upon - and implementing a fix for the major issue I've had and been working on in bursts over many months. Being that on recent stable/9, 2.4GHz Core2Duo, ~99.99% idle after boot, just sshd, powerd, ntpd and hald; showing worst-seen extremes, commonly seen range and more or less steady at 1, 5, 15 min loadavg respectively: eventtimer mode 1min 5 min 15min HPET one-shot 0.4-0.85 0.5-0.75 0.6-0.7 HPET periodic 0.99-1.00 (!) 0.99-0.99 to 0.99 LAPIC one-shot 0.0-0.05 0.0-0.02 0.0-0.0 LAPIC periodic 0.0-0.05 0.0-0.02 0.0-0.0 I run a task that's 100% cpu-bound (well, about 7 I/O reads in the first few milliseconds) that is configured to run for 15.5 minutes, in bg so I can watch and take measurements throughout. It uses 100% of one CPU (so scheduled as ~50% per cpu) for that long. With LAPIC eventtimer, 1min LAs soon reach 0.99-1.00, with 5min LA getting there in ~15 minutes. If I run 2 instances, each runs in 15.5 minutes (plus ~5 seconds) and both cpus show 100% for those tasks, 1min LA soon reaches 1.99-2.00. If I run 4 instances, on its 2 cpus, 1min LAs soon reach 3.99-4.00, and 5min LAs reach 3.99 after 15-20 minutes. Since 4 CPU-bound tasks want to run on 2 cpus, each runs for ~15.5m cpu time, so total runtime for each is ~31 minutes, by which time 15min LA is heading for 4 also. In other words, load averages shown with LAPIC are pretty much exactly what one would (and should) expect for 1, 2 or 4 longrunning cpu-bound tasks, true even in periodic mode, which has almost as few interrupts and context switches as with HPET (but without its valuable C3 usage) I should add that with HPET one-shot mode, 2 such tasks report ~2.4 LA, while 4 tasks report ~4.4 LA, so there's definitely an increment there. However switching back to HPET soon reverts to 'normal' behaviour, which over long idles (hours) show 0.66 or 0.67 15min LA as 'average average'. HPET in periodic mode, which has helped some using 10.x w/ LAPIC, here only rapidly pins LAs to 0.99-1.00 at idle, before and after more load. Both of these HPET reported LAs just smelled like bad maths. I spent way too much time down the (upper parts of) the rabbit hole/s with little to show; I should have asked more questions but wanted to suss it out .. > ISTM, the problem reports are of the form "My server is idle and it is > reporting a non-idle load average below 1", and not "My server is > supposedly idle, but is doing actual work causing the load average to > be higher than it should be". Is there any evidence that we have the > latter and not the former? Er yes, it is the former, not the latter. And not that it's shown as below 1, as much as it's shown above 0.00 - ie, something > 0.005 - when yes, genuinely idle below that level, top showing 100.00% idle on both cpu idle tasks, and say 100.0% and 99.8% idle each from cp_times (-SCHP) So I was wondering if I really had to post my hundreds of ks of logs and running reports of the above to properly demonstrate what was happening here, when I thought I'd try cpuset(1) to pin the abovementioned task to one cpu, rather than letting the scheduler share it out over both, and so first up I tried one of the examples (except for 1 of 2 cpus): # cpuset -l 0 -s 1 And lo! the previously 0.66-ish HPET LAs headed towards 0.00, taking around 5 minutes to get there. Wow. So I ran some more tests using abovementioned tasks, and sure enough, LAs as good as with LAPIC. Then after a while I reenabled CPU1 .. # cpuset -l 0-1 -s 1 # cpuset -g pid -1 mask: 0, 1 .. fully expecting LAs to climb back to 'normal' HPET behaviour .. but they did not! They stayed where they were, basically 0.00 0.00 0.00 in such idleness, and appropriately like 0.1 - 0.2 with X fired up, idle. Reboot. Straight away log in as root: # cpuset -l 0 -s 1 # cpuset -l 0-1 -s 1 # ie: -l "0-`sysctl -n kern.smp.maxid`" LAs headed back toward 0.00, taking around 5 minutes. Wow again .. Adding the above cpuset commands to (a new) /etc/rc.local, not even the sleep between that I thought may be needed. Reboot. 0.00 LAs. Whee! So, is this going to be a fix that Affects Only Me? I'm keen to hear. Yes I will update the PR, after I'm quite sure I'm not hallucinating :) Importantly, will doing that for those using LAPIC eventtimer who had unexpectedly high LAs make any difference? Perhaps so if it's the first eventtimer configured at boot that has - in some cases - such problem? Or perhaps this only works for 9.x? I have no 10.x system currently, and could easily have missed some commit that picked up on $whatever. Current hypothesis: some variable/s are getting improperly initialised at boot, but are (somehow?) getting properly re-initialised on changing cpuset to 1 then back to 2 cpus - though I've no idea how, or by what. Again, apologies for a short story writ too long. cheers, Ian