From owner-freebsd-hackers@FreeBSD.ORG Thu Sep 10 16:46:46 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BBC7E106566C for ; Thu, 10 Sep 2009 16:46:46 +0000 (UTC) (envelope-from linda.messerschmidt@gmail.com) Received: from mail-qy0-f200.google.com (mail-qy0-f200.google.com [209.85.221.200]) by mx1.freebsd.org (Postfix) with ESMTP id 72E278FC0A for ; Thu, 10 Sep 2009 16:46:46 +0000 (UTC) Received: by qyk38 with SMTP id 38so243758qyk.27 for ; Thu, 10 Sep 2009 09:46:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=qNNycoBbxJH+ZU6thx3OtaGneU72NfUzfypTEP61b7M=; b=oRtvXL05FO1Dr8oUhpD7S6gq/x8zP2zwXR9zSSxKIfyDM8+5K+ka0XdmLodX68r2wc x+ieA+OS9ZP6j2vT3ksLghif121SDK6JALfyQx1QLlPRbl9L6rvrGUoGFsfmWSpTwHB4 pc9V3kpjpV6xVjvCSB1hp45yQugqSQEJdTKd0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=a5bl9pVhf/TFuUu8EynmSRq//JYFyjkMh23gjhd4zfpx2zjrjx+ZAfRXbA0qs59jjq RSrGrMHA5MgyYSjLSKIW0j/ME8uI3etXF+1NRn4wzZZpwgP55UzqkngafeLgJwEdEFU7 rGy+4SLKbDgtSLo+pQbLQ/JTbAW6VZeJMeuS8= MIME-Version: 1.0 Received: by 10.229.106.83 with SMTP id w19mr973707qco.72.1252601205724; Thu, 10 Sep 2009 09:46:45 -0700 (PDT) In-Reply-To: <200908271729.55213.jhb@freebsd.org> References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <200908261642.59419.jhb@freebsd.org> <237c27100908271237y66219ef4o4b1b8a6e13ab2f6c@mail.gmail.com> <200908271729.55213.jhb@freebsd.org> Date: Thu, 10 Sep 2009 12:46:45 -0400 Message-ID: <237c27100909100946q3d186af3h66757e0efff307a5@mail.gmail.com> From: Linda Messerschmidt To: freebsd-hackers@freebsd.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Sep 2009 16:46:46 -0000 On Thu, Aug 27, 2009 at 5:29 PM, John Baldwin wrote: > Ah, cool, what you want to do is use KTR with KTR_SCHED and then use > schedgraph.py (src/tools/sched) to get a visual picture of what the box d= oes > during a hang. =A0The timestamps in KTR are TSC cycle counts rather than = an > actual wall time which is why they look off. =A0If you have a lot of even= ts you > may want to use a larger KTR_ENTRIES size btw (I use 1048576 (2 ^ 20) her= e at > work to get large (multiple-second) traces). I'm still working on this. I enabled KTR and set it up to log KTR_SCHED events. Then, I wrote a script to exercise the HTTP server that actually ran on that machine, and set it to issue "sysctl debug.ktr.cpumask=3D0" and abort if a request took over 2 seconds. 28,613 requests later, it tripped over one that took 2007ms. (Just a refresher: this is a static file being served by an Apache process that has nothing else to do but serve this file on a relatively unloaded machine.) I don't have access to any machines that can run X, so I did the best I could to examine it from the shell. First, this machine has two CPU's so I split up the KTR results on a per-CPU basis so I could look at each individually. With KTR_ENTRIES set to 1048576, I got about 53 seconds of data with just KTR_SCHED enabled. Since I was interested in a 2.007 second period of time right at the end, I hacked it down to the last 3.795 seconds. In the 3.795 seconds captured in the trace period on CPU 0 that includes the entire 2.007 second stall, CPU 0 was idle for 3.175 seconds. In the same period, CPU 1 was idle for 3.2589 seconds. I did the best I could to manually page through all the scheduling activity on both CPUs during that 3.7 second time, and I didn't see anything really disruptive. Mainly idle, with jumps into the clock and ethernet kernel threads, as well as httpd. If I understand that correctly and have done everything right, that means that whatever happened, it wasn't related to CPU contention or scheduling issues of any sort. So, a couple of follow-up questions: First, what else should I be looking at? I built the kernel with kind of a lot of KTR flags (KTR_LOCK|KTR_SCHED|KTR_PROC|KTR_INTR|KTR_CALLOUT|KTR_UMA|KTR_SYSC) but enabling them all produces enough output that even 1048576 entries doesn't always go back two seconds; the volume of data is all but unmanageable. Second, is there any way to correlate the process address reported by the KTR scheduler entries back to a PID? It'd be nice to be able to view the scheduler activity just for the process I'm interested in, but I can't figure out which one it is. :) Thanks!