Date: Mon, 21 Jun 2004 15:01:38 +1000 (EST) From: Bruce Evans <bde@zeta.org.au> To: Don Lewis <truckman@freebsd.org> Cc: current@freebsd.org Subject: Re: calcru: negative time ... followed by freeze Message-ID: <20040621132119.Q8596@gamplex.bde.org> In-Reply-To: <200406210258.i5L2wTKF047144@gw.catspoiler.org> References: <200406210258.i5L2wTKF047144@gw.catspoiler.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 20 Jun 2004, Don Lewis wrote: > On 20 Jun, Robert Watson wrote: > > FYI, this is a Xeon box with two physical processors, each with two > > logical processors, and the problem could well be SMP-related. > > The hang might be SMP-related, but I just started getting the calcru > messages on my UP Athlon XP box. I just upgraded to today's > -CURRENT from a week-old version, and my console got spammed by a bunch > of these messages while I was running portupgrade. > > calcru: negative time of 0 usec for pid 21261 (sh) > calcru: negative time of 0 usec for pid 21261 (sh) > ... > calcru: negative time of 3917 usec for pid 49504 (sh) > calcru: negative time of 3917 usec for pid 49504 (sh) Hmm, these are nonnegative times. The extra messages in -current might be caused by fill_kinfo() now calling calcru(). Misreporting of negative times is by the following too-simple dianostic: % tu = (u_int64_t)tv.tv_sec * 1000000 + tv.tv_usec; % ptu = p->p_uu + p->p_su + p->p_iu; % if (tu < ptu || (int64_t)tu < 0) { % printf("calcru: negative time of %jd usec for pid %d (%s)\n", % (intmax_t)tu, p->p_pid, p->p_comm); % tu = ptu; % } The message is also printed for the tu < ptu case, which is what you are getting. I fixed the messages when I got a lot of them due to a local bug. The local bug was from double rounding for calcru() on child times (which -current doesn't do). I can't see how this could be the problem in -current, since the components of ptu are rounded down and there is a KASSERT that they added up to less than tu in the previous call. Ah, here is a likely cause of the bug in -current: % if (p == curthread->td_proc) { % /* % * Adjust for the current time slice. This is actually fairly % * important since the error here is on the order of a time % * quantum, which is much greater than the sampling error. % * XXXKSE use a different test due to threads on other % * processors also being 'current'. % */ % binuptime(&bt); % bintime_sub(&bt, PCPU_PTR(switchtime)); % bintime_add(&bt, &p->p_runtime); % } else % bt = p->p_runtime; The XXXKSE comment is correct that this might be broken. If the (p != curthread->td_proc) case happens at all for a running process, then it gives a wrong (out of date) timestamp in bt. This wrongness will be detected if calcru() is was called called earlier in the current timeslice and took the other path here. The recent change to fill_kinfo() is quite likely to trigger detection of this bug. fill_kinfo() is often used to iterate over all processes for ps, so it will call calcru() with (p != curthread->td_proc) for all processes other than the one running it, and give a bt that is out of date for all such processes that are actually running. Since there can be at most one running process per CPU, this bug only affects SMP. The call to calcru() from ttyinfo() may be the only other trigger. ttyinfo() picks a process and should rarely or never pick the ithread running it, so it will almost always take the (p != curthread->td_proc) path. Again, this is only a problem for the SMP case since in the !SMP case the picked process must have been switched away from to run the ithread, so it cannot be running. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20040621132119.Q8596>