Date: Mon, 21 Jun 2004 01:01:20 -0700 (PDT) From: Don Lewis <truckman@FreeBSD.org> To: bde@zeta.org.au Cc: current@FreeBSD.org Subject: Re: calcru: negative time ... followed by freeze Message-ID: <200406210801.i5L81K5u047553@gw.catspoiler.org> In-Reply-To: <20040621132119.Q8596@gamplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 21 Jun, Bruce Evans wrote: > On Sun, 20 Jun 2004, Don Lewis wrote: > >> On 20 Jun, Robert Watson wrote: >> > FYI, this is a Xeon box with two physical processors, each with two >> > logical processors, and the problem could well be SMP-related. >> >> The hang might be SMP-related, but I just started getting the calcru >> messages on my UP Athlon XP box. I just upgraded to today's >> -CURRENT from a week-old version, and my console got spammed by a bunch >> of these messages while I was running portupgrade. >> >> calcru: negative time of 0 usec for pid 21261 (sh) >> calcru: negative time of 0 usec for pid 21261 (sh) >> ... >> calcru: negative time of 3917 usec for pid 49504 (sh) >> calcru: negative time of 3917 usec for pid 49504 (sh) > > Hmm, these are nonnegative times. > > The extra messages in -current might be caused by fill_kinfo() now calling > calcru(). > > Misreporting of negative times is by the following too-simple dianostic: > > % tu = (u_int64_t)tv.tv_sec * 1000000 + tv.tv_usec; > % ptu = p->p_uu + p->p_su + p->p_iu; > % if (tu < ptu || (int64_t)tu < 0) { > % printf("calcru: negative time of %jd usec for pid %d (%s)\n", > % (intmax_t)tu, p->p_pid, p->p_comm); > % tu = ptu; > % } > > The message is also printed for the tu < ptu case, which is what you are > getting. > > I fixed the messages when I got a lot of them due to a local bug. The local > bug was from double rounding for calcru() on child times (which -current > doesn't do). I can't see how this could be the problem in -current, since > the components of ptu are rounded down and there is a KASSERT that they > added up to less than tu in the previous call. > > Ah, here is a likely cause of the bug in -current: > > % if (p == curthread->td_proc) { > % /* > % * Adjust for the current time slice. This is actually fairly > % * important since the error here is on the order of a time > % * quantum, which is much greater than the sampling error. > % * XXXKSE use a different test due to threads on other > % * processors also being 'current'. > % */ > % binuptime(&bt); > % bintime_sub(&bt, PCPU_PTR(switchtime)); > % bintime_add(&bt, &p->p_runtime); > % } else > % bt = p->p_runtime; > > The XXXKSE comment is correct that this might be broken. If the (p > != curthread->td_proc) case happens at all for a running process, then > it gives a wrong (out of date) timestamp in bt. This wrongness will > be detected if calcru() is was called called earlier in the current > timeslice and took the other path here. > > The recent change to fill_kinfo() is quite likely to trigger detection > of this bug. fill_kinfo() is often used to iterate over all processes > for ps, so it will call calcru() with (p != curthread->td_proc) for > all processes other than the one running it, and give a bt that is out > of date for all such processes that are actually running. Since there > can be at most one running process per CPU, this bug only affects SMP. > > The call to calcru() from ttyinfo() may be the only other trigger. > ttyinfo() picks a process and should rarely or never pick the ithread > running it, so it will almost always take the (p != curthread->td_proc) > path. Again, this is only a problem for the SMP case since in the !SMP > case the picked process must have been switched away from to run the > ithread, so it cannot be running. There must be some !SMP trigger for this as well. I just checked and I was able to trigger this on my Pentium-M laptop as well by leaning on the ^T key while I was logged on via ssh and running 'portupgrade -aP'. Jun 21 00:41:31 hairball kernel: calcru: negative time of 23169 usec for pid 44653 (sh) Jun 21 00:41:32 hairball kernel: calcru: negative time of 21990 usec for pid 44665 (sh) I didn't use ^T on my Athlon box. I might have had top running, though. It's interesting that this bug only seems to get triggered on /bin/sh. Maybe it is fork()/exit()/wait() related?
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200406210801.i5L81K5u047553>