Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 10 Sep 2009 12:46:45 -0400
From:      Linda Messerschmidt <linda.messerschmidt@gmail.com>
To:        freebsd-hackers@freebsd.org
Subject:   Re: Intermittent system hangs on 7.2-RELEASE-p1
Message-ID:  <237c27100909100946q3d186af3h66757e0efff307a5@mail.gmail.com>
In-Reply-To: <200908271729.55213.jhb@freebsd.org>
References:  <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <200908261642.59419.jhb@freebsd.org> <237c27100908271237y66219ef4o4b1b8a6e13ab2f6c@mail.gmail.com> <200908271729.55213.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Aug 27, 2009 at 5:29 PM, John Baldwin<jhb@freebsd.org> wrote:
> Ah, cool, what you want to do is use KTR with KTR_SCHED and then use
> schedgraph.py (src/tools/sched) to get a visual picture of what the box d=
oes
> during a hang. =A0The timestamps in KTR are TSC cycle counts rather than =
an
> actual wall time which is why they look off. =A0If you have a lot of even=
ts you
> may want to use a larger KTR_ENTRIES size btw (I use 1048576 (2 ^ 20) her=
e at
> work to get large (multiple-second) traces).

I'm still working on this.

I enabled KTR and set it up to log KTR_SCHED events.  Then, I wrote a
script to exercise the HTTP server that actually ran on that machine,
and set it to issue "sysctl debug.ktr.cpumask=3D0" and abort if a
request took over 2 seconds.  28,613 requests later, it tripped over
one that took 2007ms.

(Just a refresher: this is a static file being served by an Apache
process that has nothing else to do but serve this file on a
relatively unloaded machine.)

I don't have access to any machines that can run X, so I did the best
I could to examine it from the shell.

First, this machine has two CPU's so I split up the KTR results on a
per-CPU basis so I could look at each individually.

With KTR_ENTRIES set to 1048576, I got about 53 seconds of data with
just KTR_SCHED enabled.  Since I was interested in a 2.007 second
period of time right at the end, I hacked it down to the last 3.795
seconds.

In the 3.795 seconds captured in the trace period on CPU 0 that
includes the entire 2.007 second stall, CPU 0 was idle for 3.175
seconds.

In the same period, CPU 1 was idle for 3.2589 seconds.

I did the best I could to manually page through all the scheduling
activity on both CPUs during that 3.7 second time, and I didn't see
anything really disruptive.  Mainly idle, with jumps into the clock
and ethernet kernel threads, as well as httpd.

If I understand that correctly and have done everything right, that
means that whatever happened, it wasn't related to CPU contention or
scheduling issues of any sort.

So, a couple of follow-up questions:

First, what else should I be looking at?  I built the kernel with kind
of a lot of KTR flags
(KTR_LOCK|KTR_SCHED|KTR_PROC|KTR_INTR|KTR_CALLOUT|KTR_UMA|KTR_SYSC)
but enabling them all produces enough output that even 1048576 entries
doesn't always go back two seconds; the volume of data is all but
unmanageable.

Second, is there any way to correlate the process address reported by
the KTR scheduler entries back to a PID?  It'd be nice to be able to
view the scheduler activity just for the process I'm interested in,
but I can't figure out which one it is. :)

Thanks!



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?237c27100909100946q3d186af3h66757e0efff307a5>