From owner-freebsd-stable@FreeBSD.ORG Mon Mar 9 19:46:20 2009 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EDF4F106567C; Mon, 9 Mar 2009 19:46:20 +0000 (UTC) (envelope-from ghelmer@palisadesys.com) Received: from cetus.palisadesys.com (cetus.palisadesys.com [205.237.115.21]) by mx1.freebsd.org (Postfix) with ESMTP id AF4E08FC18; Mon, 9 Mar 2009 19:46:20 +0000 (UTC) (envelope-from ghelmer@palisadesys.com) Received: from cancer.palisadesys.com (serverwatch [172.16.1.98]) by cetus.palisadesys.com (8.14.3/8.14.3) with ESMTP id n29JkJsU011719; Mon, 9 Mar 2009 14:46:20 -0500 (CDT) (envelope-from ghelmer@palisadesys.com) Received: from [172.16.2.242] (cetus.palisadesys.com [205.237.115.21]) (authenticated bits=0) by cancer.palisadesys.com (8.14.2/8.14.2) with ESMTP id n29JkJQp051838; Mon, 9 Mar 2009 14:46:19 -0500 (CDT) (envelope-from ghelmer@palisadesys.com) Message-ID: <49B57208.4000601@palisadesys.com> Date: Mon, 09 Mar 2009 14:46:16 -0500 From: Guy Helmer User-Agent: Thunderbird 2.0.0.19 (Windows/20081209) MIME-Version: 1.0 To: John Baldwin References: <49A46AB4.3080003@palisadesys.com> <200902261648.32845.jhb@freebsd.org> <49A7173B.4030608@palisadesys.com> <200902261753.29607.jhb@freebsd.org> <49A80A55.5070004@palisadesys.com> <49ABE8FB.3060202@palisadesys.com> In-Reply-To: <49ABE8FB.3060202@palisadesys.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-3.0 (cancer.palisadesys.com [205.237.115.20]); Mon, 09 Mar 2009 14:46:19 -0500 (CDT) X-Palisade-MailScanner-Information: Please contact the ISP for more information X-Palisade-MailScanner: Found to be clean X-Palisade-MailScanner-SpamCheck: not spam (whitelisted), SpamAssassin (not cached, score=-4.399, required 6, autolearn=not spam, ALL_TRUSTED -1.80, BAYES_00 -2.60) X-Palisade-MailScanner-From: ghelmer@palisadesys.com Cc: freebsd-stable@freebsd.org Subject: Re: 7.1 hangs in cache_lookup mutex? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 09 Mar 2009 19:46:23 -0000 Guy Helmer wrote: > Guy Helmer wrote: >> John Baldwin wrote: >>> On Thursday 26 February 2009 5:27:07 pm Guy Helmer wrote: >>> >>>> John Baldwin wrote: >>>> >>>>> On Thursday 26 February 2009 4:22:15 pm Guy Helmer wrote: >>>>> >>>>>> db> show sleepchain 23110 >>>>>> thread 100181 (pid 23110, vmstat) blocked on sx "user map" XLOCK >>>>>> thread 100208 (pid 23092, kvoop) is on a run queue >>>>>> db> show sleepchain 23092 >>>>>> thread 100208 (pid 23092, kvoop) is on a run queue >>>>>> >>>>> Ah, so this is normal (well, mostly) in that kvoop is simply on >>>>> the run >>> queue >>>>> waiting for a CPU. Can you find the thread pointer for kvoop and >>>>> check on things such as if it is pinned and if so to which CPU >>>>> (td_pinned will tell you the first, and td_sched->ts_cpu will tell >>>>> you the second with ULE). >>>>> >>>> (kgdb) print td->td_pinned >>>> $2 = 0 >>>> >>> >>> Ok, not pinned. >>> >>> >>>> From my captured ddb run: >>>> cpuid = 3 >>>> curthread = 0xc5e2f000: pid 23090 "filter" >>>> curpcb = 0xe6f90d90 >>>> fpcurthread = none >>>> idlethread = 0xc442daf0: pid 11 "idle: cpu3" >>>> APIC ID = 7 >>>> currentldt = 0x50 >>>> spin locks held: >>>> >>> >>> At http://www.freebsd.org/~jhb/gdb/ you can find my kgdb scripts. >>> If you source gdb6 you can run 'runtds' which will show you what >>> each CPU is doing (more or less) in ps-style output. >>> >>> >>>> I sure wish I could find the root cause of the hangs. On a hunch, >>>> I tried setting "machdep.cpu_idle_hlt=0" on the amd64 machine, and >>>> it has run 32 hours without a hang. It could just be coincidence, >>>> though... >>>> >>> >>> Ahhh, that actually could explain it perhaps. Do your CPUs support >>> C2 or higher sleep states for idle? You can try limiting it to only >>> C1 (or disable C1E in your BIOS if it has an option for that) to see >>> if that fixes it. >>> >>> >> I don't think the CPUs support anything lower than C1 - there is no >> hw.acpi.cpu.cx_supported sysctl node, and hw.cpi.cpu.cx_lowest is >> C1. C1-Enhanced was already disabled in the BIOS, at least on the >> machine running amd64. 48 hours of runtime, and no hangs seen yet. >> I did reboot it this morning to check the sleep settings in the BIOS. > Despite having machdep.cpu_idle_hlt=0, the machine wedged for 40 hours > over the weekend but came back to life by itself. Could this be lost > IPIs, or a bug in the scheduler? To finish off this thread, after I disabled hyperthreading in the BIOS on this machine (dual Nocona Xeons in a Supermicro X6DHR-8G) it was stable for 96 hours. I applied rev 189023 (machdep.hyperthreading_allowed=0 disables HT cores at boot) to 7.1-release, set machdep.hyperthreading_allowed=0 in /boot/loader.conf, re-enabled hyperthreading the BIOS to verify the effect of r189023, and the machine has been stable for 92 hours. Guy