From owner-freebsd-stable@freebsd.org  Sat Dec 26 15:10:14 2015
Return-Path: <owner-freebsd-stable@freebsd.org>
Delivered-To: freebsd-stable@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id E9570A52DB8
 for <freebsd-stable@mailman.ysv.freebsd.org>;
 Sat, 26 Dec 2015 15:10:13 +0000 (UTC)
 (envelope-from smithi@nimnet.asn.au)
Received: from sola.nimnet.asn.au (paqi.nimnet.asn.au [115.70.110.159])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 830451F3C;
 Sat, 26 Dec 2015 15:10:12 +0000 (UTC)
 (envelope-from smithi@nimnet.asn.au)
Received: from localhost (localhost [127.0.0.1])
 by sola.nimnet.asn.au (8.14.2/8.14.2) with ESMTP id tBQFA0OA012299;
 Sun, 27 Dec 2015 02:10:00 +1100 (EST)
 (envelope-from smithi@nimnet.asn.au)
Date: Sun, 27 Dec 2015 02:09:59 +1100 (EST)
From: Ian Smith <smithi@nimnet.asn.au>
To: Tom Evans <tevans.uk@googlemail.com>
cc: FreeBSD Stable <freebsd-stable@freebsd.org>,
 Alexander Motin <mav@freebsd.org>, Jeremy Chadwick <jdc@koitsu.org>,
 Miroslav Lachman <000.fbsd@quip.cz>
Subject: Re: stable/10: high load average when box is idle
In-Reply-To: <CAFHbX1+QT5E=wgDQ5SsCLZgVWeGxody_dVucbmMgdBNtukZEsQ@mail.gmail.com>
Message-ID: <20151106212735.U12989@sola.nimnet.asn.au>
References: <20151027050508.GA7612@icarus.home.lan>
 <20151104025748.F10372@sola.nimnet.asn.au>
 <CAFHbX1+QT5E=wgDQ5SsCLZgVWeGxody_dVucbmMgdBNtukZEsQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-stable>, 
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable/>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 26 Dec 2015 15:10:14 -0000

On Wed, 4 Nov 2015 14:31:35 +0000, Tom Evans wrote:
 > On Tue, Nov 3, 2015 at 5:47 PM, Ian Smith <smithi@nimnet.asn.au> wrote:
 > > Like yourself, I think this is far from 'cosmetic' as is oft suggested,
 > > especially in some fairly ill-informed forum posts but also various list
 > > posts.  I've been watching load averages since OS/2 through FreeBSD 2.2
 > > till present and some Linux systems, and have never before seen anything
 > > like this, especially on virtually entirely idle systems.  While the LAs
 > > reported during (say) make -j4 buildworld appear more same, who knows?
 > >
 > > I'm not suggesting that I think there's any sort of performance hit from
 > > this; so far I don't, but 'cosmetic' suggests that it doesn't matter ..

I'm afraid I can't resist being undoubtedly too thorough with this post, 
but it does summarise quite a lot of time I've spent this year, and I'm 
feeling a little in need of celebrating an apparent win as a result ..

 > Have you read mav's explanation on the PR? It certainly seems a valid
 > explanation of why it is probably a cosmetic issue - that is, we could
 > wake up more to get a truly perfect load average, but why bother?

I've read it over and again.  I think it may well explain some relative 
'fuzziness' in LAs over perhaps a 0.1 to 0.3 range that I see now since 
finding - more truthfully, stumbling upon - and implementing a fix for 
the major issue I've had and been working on in bursts over many months.

Being that on recent stable/9, 2.4GHz Core2Duo, ~99.99% idle after boot, 
just sshd, powerd, ntpd and hald; showing worst-seen extremes, commonly 
seen range and more or less steady at 1, 5, 15 min loadavg respectively:

eventtimer	mode		1min		5 min		15min
HPET		one-shot	0.4-0.85	0.5-0.75	0.6-0.7
HPET		periodic	0.99-1.00 (!)	0.99-0.99	to 0.99

LAPIC		one-shot	0.0-0.05	0.0-0.02	0.0-0.0
LAPIC		periodic	0.0-0.05	0.0-0.02	0.0-0.0

I run a task that's 100% cpu-bound (well, about 7 I/O reads in the first 
few milliseconds) that is configured to run for 15.5 minutes, in bg so I 
can watch and take measurements throughout.  It uses 100% of one CPU (so 
scheduled as ~50% per cpu) for that long.  With LAPIC eventtimer, 1min 
LAs soon reach 0.99-1.00, with 5min LA getting there in ~15 minutes.

If I run 2 instances, each runs in 15.5 minutes (plus ~5 seconds) and 
both cpus show 100% for those tasks, 1min LA soon reaches 1.99-2.00.

If I run 4 instances, on its 2 cpus, 1min LAs soon reach 3.99-4.00, and 
5min LAs reach 3.99 after 15-20 minutes.  Since 4 CPU-bound tasks want 
to run on 2 cpus, each runs for ~15.5m cpu time, so total runtime for 
each is ~31 minutes, by which time 15min LA is heading for 4 also.

In other words, load averages shown with LAPIC are pretty much exactly 
what one would (and should) expect for 1, 2 or 4 longrunning cpu-bound 
tasks, true even in periodic mode, which has almost as few interrupts 
and context switches as with HPET (but without its valuable C3 usage)

I should add that with HPET one-shot mode, 2 such tasks report ~2.4 LA, 
while 4 tasks report ~4.4 LA, so there's definitely an increment there.
 
However switching back to HPET soon reverts to 'normal' behaviour, which 
over long idles (hours) show 0.66 or 0.67 15min LA as 'average average'.

HPET in periodic mode, which has helped some using 10.x w/ LAPIC, here 
only rapidly pins LAs to 0.99-1.00 at idle, before and after more load.

Both of these HPET reported LAs just smelled like bad maths. I spent way 
too much time down the (upper parts of) the rabbit hole/s with little to 
show; I should have asked more questions but wanted to suss it out ..

 > ISTM, the problem reports are of the form "My server is idle and it is
 > reporting a non-idle load average below 1", and not "My server is
 > supposedly idle, but is doing actual work causing the load average to
 > be higher than it should be". Is there any evidence that we have the
 > latter and not the former?

Er yes, it is the former, not the latter.  And not that it's shown as 
below 1, as much as it's shown above 0.00 - ie, something > 0.005 - when 
yes, genuinely idle below that level, top showing 100.00% idle on both 
cpu idle tasks, and say 100.0% and 99.8% idle each from cp_times (-SCHP)

So I was wondering if I really had to post my hundreds of ks of logs and 
running reports of the above to properly demonstrate what was happening 
here, when I thought I'd try cpuset(1) to pin the abovementioned task
to one cpu, rather than letting the scheduler share it out over both, 
and so first up I tried one of the examples (except for 1 of 2 cpus):

# cpuset -l 0 -s 1

And lo! the previously 0.66-ish HPET LAs headed towards 0.00, taking 
around 5 minutes to get there.  Wow.  So I ran some more tests using 
abovementioned tasks, and sure enough, LAs as good as with LAPIC.

Then after a while I reenabled CPU1 ..

# cpuset -l 0-1 -s 1
# cpuset -g
pid -1 mask: 0, 1

.. fully expecting LAs to climb back to 'normal' HPET behaviour .. but 
they did not!  They stayed where they were, basically 0.00 0.00 0.00 in 
such idleness, and appropriately like 0.1 - 0.2 with X fired up, idle.

Reboot.  Straight away log in as root:
 # cpuset -l 0 -s 1
 # cpuset -l 0-1 -s 1		# ie: -l "0-`sysctl -n kern.smp.maxid`"
LAs headed back toward 0.00, taking around 5 minutes.  Wow again ..

Adding the above cpuset commands to (a new) /etc/rc.local, not even the
sleep between that I thought may be needed.  Reboot.  0.00 LAs.  Whee!

So, is this going to be a fix that Affects Only Me?  I'm keen to hear.  
Yes I will update the PR, after I'm quite sure I'm not hallucinating :)

Importantly, will doing that for those using LAPIC eventtimer who had 
unexpectedly high LAs make any difference?  Perhaps so if it's the first 
eventtimer configured at boot that has - in some cases - such problem?

Or perhaps this only works for 9.x?  I have no 10.x system currently, 
and could easily have missed some commit that picked up on $whatever.

Current hypothesis: some variable/s are getting improperly initialised 
at boot, but are (somehow?) getting properly re-initialised on changing 
cpuset to 1 then back to 2 cpus - though I've no idea how, or by what.

Again, apologies for a short story writ too long.

cheers, Ian