FreeBSD Mail Archives

Date:      Wed, 03 Sep 2014 16:05:49 -0400
From:      John Baldwin <jhb@freebsd.org>
To:        freebsd-hackers@freebsd.org
Cc:        Ian Lepore <ian@freebsd.org>, Allan Jude <allanjude@freebsd.org>
Subject:   Re: stopped processes using cpu?
Message-ID:  <1567020.dfiLmFunn8@ralph.baldwin.cx>
In-Reply-To: <201408201138.40228.jhb@freebsd.org>
References:  <CAA3ZYrAzpxpFNST5ZT-zHvk4Gg38w-yH1dTQj53Fp_rM-hohaA@mail.gmail.com> <1408540626.1150.1.camel@revolution.hippie.lan> <201408201138.40228.jhb@freebsd.org>

On Wednesday, August 20, 2014 11:38:40 AM John Baldwin wrote:
> On Wednesday, August 20, 2014 9:17:06 am Ian Lepore wrote:
> > On Tue, 2014-08-19 at 18:45 -0700, Tim Kientzle wrote:
> > > On Aug 19, 2014, at 12:28 PM, Allan Jude <allanjude@freebsd.org> =
wrote:
> > > > On 2014-08-19 15:21, Dieter BSD wrote:
> > > >> 8.2 on amd64
> > > >> Top(1) with no arguments reports that some firefox processes a=
re
> > > >> using
>=20
> cpu
>=20
> > > >> dispite being stopped (via kill -stop pid) for at least severa=
l
> > > >> hours.
> > > >> Adding -C doesn't change the numbers.  Ps(1) reports the same.=

> > > >> Interestingly, a firefox that isn't stopped is (correctly?) re=
ported
> > > >> as
> > > >> using 0 cpu.  The 100% idle should be correct, but who knows.
> > > >>=20
> > > >> last pid: 51932;  load averages:  0.07, 0.99, 1.42 up 14+19:02=
:56
>=20
> 08:48:28
>=20
> > > >> 267 processes: 1 running, 138 sleeping, 128 stopped
> > > >> CPU:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  1=
00%
> > > >> idle
> > > >> Mem: 1665M Active, 653M Inact, 240M Wired, 95M Cache, 372M Buf=
, 815M
>=20
> Free
>=20
> > > >> Swap: 8965M Total, 560K Used, 8965M Free
> > > >>=20
> > > >>  PID USERNAME  THR PRI NICE   SIZE    RES STATE    TIME   WCPU=

> > > >>  COMMAND
> > > >>=20
> > > >> 44188 a           9  44    0   303M   187M STOP   113:19 13.43=
%
>=20
> firefox-bin
>=20
> > > >> 92986 b          11  44    0   164M 62848K STOP     0:18  5.03=
%
>=20
> firefox-bin
>=20
> > > >> 16507 c          11  44    0   189M 88976K STOP     0:13  0.24=
%
>=20
> firefox-bin
>=20
> > > >> 2265 root        1  44    0   248M   193M select 625:38  0.00%=
 Xorg
> > > >> 51271 d          10  44    0   233M   128M ucond   12:12  0.00=
%
>=20
> firefox-bin
>=20
> > > >> _______________________________________________
> > > >> freebsd-hackers@freebsd.org mailing list
> > > >> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> > > >> To unsubscribe, send any mail to "freebsd-hackers-
>=20
> unsubscribe@freebsd.org"
>=20
> > > > I wonder if jhb@'s new top code solves this. He adjusted the wa=
y CPU
> > > > usage is tracked to be more responsive, and not based on averag=
es
> > >=20
> > > I wonder if jhb@=E2=80=99s new top code fixes the whacky WCPU val=
ues we=E2=80=99ve been
>=20
> seeing on FreeBSD/ARM.  (1713% CPU is a little hard to believe on a s=
ingle-
> core board ;-).
>=20
> > > Tim
> >=20
> > *Fixes* it?  I've been under the impression those changes caused it=
.  I
> > certainly never saw 1000%+ numbers in top until very recently.
>=20
> Yes, if it's a recent change then mine are to blame.  In both cases t=
he
> numbers are imprecise.  The older code still in stable@ (as in the OP=
),
> takes a long time to ramp up and down.  So in this case the processes=
 are
> stopped (no, there's no rootkit), but the scheduler takes a long time=
 to
> factor that into its decayed %CPU computation.
>=20
> In the "new" code, the problem is that fetching the kinfo_proc and th=
e
> current timestamp for that kinfo_proc is not atomic.  I have thought
> about "fixing" that by embedding a new timeval in kinfo_proc that is
> stamped with the time the individual kinfo_proc is generated.  This w=
ould
> (I believe) alleviate the noise in the new code as the delta in wallt=
ime
> at the "bottom" of the ratio would then correspond to the delta in ru=
ntime
> on the "top".
>=20
> However, trying to store a timeval in kinfo_proc is quite tricky as a=
ll the
> available fields are things like ints and longs.  I could perhaps spl=
it it
> up into two longs which is kind of fugly.  Another option would be to=
 just
> generate a single long that holds raw nanoseconds uptime and store th=
at
> (wrapping would be ok since I would only care about deltas).

So I tried this and the results aren't a lot better.  I think the probl=
em now=20
is that rufetch() doesn't force an update of the target thread's stats =
to
"now" (the way getrusage() does for curthread).  Because the idle threa=
d runs
constantly when idle, it is especially prone to this imprecision.  I'm =
not=20
sure of a good way to fix this.  Having a per-thread timestamp that was=
=20
updated each time the runtime was updated would help for a currently-ru=
nning
thread perhaps.  Another option would be to use an IPI (ewww) to force
currently running threads to update their runtime when the sysctl runs.=
  That
seems a bit expensive though.  (I might at least try it to see if it do=
es
resolve it to verify my understanding of the issue.)

--=20
John Baldwin

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1567020.dfiLmFunn8>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation