Date: Thu, 10 Sep 2009 12:46:45 -0400 From: Linda Messerschmidt <linda.messerschmidt@gmail.com> To: freebsd-hackers@freebsd.org Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 Message-ID: <237c27100909100946q3d186af3h66757e0efff307a5@mail.gmail.com> In-Reply-To: <200908271729.55213.jhb@freebsd.org> References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <200908261642.59419.jhb@freebsd.org> <237c27100908271237y66219ef4o4b1b8a6e13ab2f6c@mail.gmail.com> <200908271729.55213.jhb@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Aug 27, 2009 at 5:29 PM, John Baldwin<jhb@freebsd.org> wrote: > Ah, cool, what you want to do is use KTR with KTR_SCHED and then use > schedgraph.py (src/tools/sched) to get a visual picture of what the box d= oes > during a hang. =A0The timestamps in KTR are TSC cycle counts rather than = an > actual wall time which is why they look off. =A0If you have a lot of even= ts you > may want to use a larger KTR_ENTRIES size btw (I use 1048576 (2 ^ 20) her= e at > work to get large (multiple-second) traces). I'm still working on this. I enabled KTR and set it up to log KTR_SCHED events. Then, I wrote a script to exercise the HTTP server that actually ran on that machine, and set it to issue "sysctl debug.ktr.cpumask=3D0" and abort if a request took over 2 seconds. 28,613 requests later, it tripped over one that took 2007ms. (Just a refresher: this is a static file being served by an Apache process that has nothing else to do but serve this file on a relatively unloaded machine.) I don't have access to any machines that can run X, so I did the best I could to examine it from the shell. First, this machine has two CPU's so I split up the KTR results on a per-CPU basis so I could look at each individually. With KTR_ENTRIES set to 1048576, I got about 53 seconds of data with just KTR_SCHED enabled. Since I was interested in a 2.007 second period of time right at the end, I hacked it down to the last 3.795 seconds. In the 3.795 seconds captured in the trace period on CPU 0 that includes the entire 2.007 second stall, CPU 0 was idle for 3.175 seconds. In the same period, CPU 1 was idle for 3.2589 seconds. I did the best I could to manually page through all the scheduling activity on both CPUs during that 3.7 second time, and I didn't see anything really disruptive. Mainly idle, with jumps into the clock and ethernet kernel threads, as well as httpd. If I understand that correctly and have done everything right, that means that whatever happened, it wasn't related to CPU contention or scheduling issues of any sort. So, a couple of follow-up questions: First, what else should I be looking at? I built the kernel with kind of a lot of KTR flags (KTR_LOCK|KTR_SCHED|KTR_PROC|KTR_INTR|KTR_CALLOUT|KTR_UMA|KTR_SYSC) but enabling them all produces enough output that even 1048576 entries doesn't always go back two seconds; the volume of data is all but unmanageable. Second, is there any way to correlate the process address reported by the KTR scheduler entries back to a PID? It'd be nice to be able to view the scheduler activity just for the process I'm interested in, but I can't figure out which one it is. :) Thanks!
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?237c27100909100946q3d186af3h66757e0efff307a5>