From owner-freebsd-hackers@FreeBSD.ORG Wed Sep 3 20:06:27 2014 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 402C7C5E; Wed, 3 Sep 2014 20:06:27 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id DAE621D8A; Wed, 3 Sep 2014 20:06:26 +0000 (UTC) Received: from ralph.baldwin.cx (pool-173-70-85-31.nwrknj.fios.verizon.net [173.70.85.31]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id C932BB968; Wed, 3 Sep 2014 16:06:25 -0400 (EDT) From: John Baldwin To: freebsd-hackers@freebsd.org Subject: Re: stopped processes using cpu? Date: Wed, 03 Sep 2014 16:05:49 -0400 Message-ID: <1567020.dfiLmFunn8@ralph.baldwin.cx> User-Agent: KMail/4.10.5 (FreeBSD/10.0-STABLE; KDE/4.10.5; amd64; ; ) In-Reply-To: <201408201138.40228.jhb@freebsd.org> References: <1408540626.1150.1.camel@revolution.hippie.lan> <201408201138.40228.jhb@freebsd.org> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Wed, 03 Sep 2014 16:06:25 -0400 (EDT) Cc: Ian Lepore , Allan Jude X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 03 Sep 2014 20:06:27 -0000 On Wednesday, August 20, 2014 11:38:40 AM John Baldwin wrote: > On Wednesday, August 20, 2014 9:17:06 am Ian Lepore wrote: > > On Tue, 2014-08-19 at 18:45 -0700, Tim Kientzle wrote: > > > On Aug 19, 2014, at 12:28 PM, Allan Jude = wrote: > > > > On 2014-08-19 15:21, Dieter BSD wrote: > > > >> 8.2 on amd64 > > > >> Top(1) with no arguments reports that some firefox processes a= re > > > >> using >=20 > cpu >=20 > > > >> dispite being stopped (via kill -stop pid) for at least severa= l > > > >> hours. > > > >> Adding -C doesn't change the numbers. Ps(1) reports the same.= > > > >> Interestingly, a firefox that isn't stopped is (correctly?) re= ported > > > >> as > > > >> using 0 cpu. The 100% idle should be correct, but who knows. > > > >>=20 > > > >> last pid: 51932; load averages: 0.07, 0.99, 1.42 up 14+19:02= :56 >=20 > 08:48:28 >=20 > > > >> 267 processes: 1 running, 138 sleeping, 128 stopped > > > >> CPU: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 1= 00% > > > >> idle > > > >> Mem: 1665M Active, 653M Inact, 240M Wired, 95M Cache, 372M Buf= , 815M >=20 > Free >=20 > > > >> Swap: 8965M Total, 560K Used, 8965M Free > > > >>=20 > > > >> PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU= > > > >> COMMAND > > > >>=20 > > > >> 44188 a 9 44 0 303M 187M STOP 113:19 13.43= % >=20 > firefox-bin >=20 > > > >> 92986 b 11 44 0 164M 62848K STOP 0:18 5.03= % >=20 > firefox-bin >=20 > > > >> 16507 c 11 44 0 189M 88976K STOP 0:13 0.24= % >=20 > firefox-bin >=20 > > > >> 2265 root 1 44 0 248M 193M select 625:38 0.00%= Xorg > > > >> 51271 d 10 44 0 233M 128M ucond 12:12 0.00= % >=20 > firefox-bin >=20 > > > >> _______________________________________________ > > > >> freebsd-hackers@freebsd.org mailing list > > > >> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > > > >> To unsubscribe, send any mail to "freebsd-hackers- >=20 > unsubscribe@freebsd.org" >=20 > > > > I wonder if jhb@'s new top code solves this. He adjusted the wa= y CPU > > > > usage is tracked to be more responsive, and not based on averag= es > > >=20 > > > I wonder if jhb@=E2=80=99s new top code fixes the whacky WCPU val= ues we=E2=80=99ve been >=20 > seeing on FreeBSD/ARM. (1713% CPU is a little hard to believe on a s= ingle- > core board ;-). >=20 > > > Tim > >=20 > > *Fixes* it? I've been under the impression those changes caused it= . I > > certainly never saw 1000%+ numbers in top until very recently. >=20 > Yes, if it's a recent change then mine are to blame. In both cases t= he > numbers are imprecise. The older code still in stable@ (as in the OP= ), > takes a long time to ramp up and down. So in this case the processes= are > stopped (no, there's no rootkit), but the scheduler takes a long time= to > factor that into its decayed %CPU computation. >=20 > In the "new" code, the problem is that fetching the kinfo_proc and th= e > current timestamp for that kinfo_proc is not atomic. I have thought > about "fixing" that by embedding a new timeval in kinfo_proc that is > stamped with the time the individual kinfo_proc is generated. This w= ould > (I believe) alleviate the noise in the new code as the delta in wallt= ime > at the "bottom" of the ratio would then correspond to the delta in ru= ntime > on the "top". >=20 > However, trying to store a timeval in kinfo_proc is quite tricky as a= ll the > available fields are things like ints and longs. I could perhaps spl= it it > up into two longs which is kind of fugly. Another option would be to= just > generate a single long that holds raw nanoseconds uptime and store th= at > (wrapping would be ok since I would only care about deltas). So I tried this and the results aren't a lot better. I think the probl= em now=20 is that rufetch() doesn't force an update of the target thread's stats = to "now" (the way getrusage() does for curthread). Because the idle threa= d runs constantly when idle, it is especially prone to this imprecision. I'm = not=20 sure of a good way to fix this. Having a per-thread timestamp that was= =20 updated each time the runtime was updated would help for a currently-ru= nning thread perhaps. Another option would be to use an IPI (ewww) to force currently running threads to update their runtime when the sysctl runs.= That seems a bit expensive though. (I might at least try it to see if it do= es resolve it to verify my understanding of the issue.) --=20 John Baldwin