From owner-freebsd-acpi@FreeBSD.ORG Fri Nov 2 01:35:33 2007 Return-Path: Delivered-To: freebsd-acpi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8E9F616A417 for ; Fri, 2 Nov 2007 01:35:33 +0000 (UTC) (envelope-from Glen.Leeder@nokia.com) Received: from mgw-ext12.nokia.com (smtp.nokia.com [131.228.20.171]) by mx1.freebsd.org (Postfix) with ESMTP id 1A31713C4B0 for ; Fri, 2 Nov 2007 01:35:32 +0000 (UTC) (envelope-from Glen.Leeder@nokia.com) Received: from esebh106.NOE.Nokia.com (esebh106.ntc.nokia.com [172.21.138.213]) by mgw-ext12.nokia.com (Switch-3.2.5/Switch-3.2.5) with ESMTP id lA21Yed3005378; Fri, 2 Nov 2007 03:34:48 +0200 Received: from siebh102.NOE.Nokia.com ([172.30.195.29]) by esebh106.NOE.Nokia.com with Microsoft SMTPSVC(6.0.3790.1830); Fri, 2 Nov 2007 03:34:42 +0200 Received: from syebe101.NOE.Nokia.com ([172.30.128.65]) by siebh102.NOE.Nokia.com with Microsoft SMTPSVC(6.0.3790.1830); Fri, 2 Nov 2007 09:34:41 +0800 Received: from [172.30.67.19] ([172.30.67.19]) by syebe101.NOE.Nokia.com with Microsoft SMTPSVC(6.0.3790.1830); Fri, 2 Nov 2007 12:34:39 +1100 Message-ID: <472A7EAE.6050608@nokia.com> Date: Fri, 02 Nov 2007 11:34:38 +1000 From: Glen User-Agent: Thunderbird 1.5.0.12 (Windows/20070509) MIME-Version: 1.0 To: ext Nate Lawson References: <472A53B2.6030901@nokia.com> <472A72AB.4000809@root.org> In-Reply-To: <472A72AB.4000809@root.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 02 Nov 2007 01:34:39.0378 (UTC) FILETIME=[88DA0F20:01C81CF0] X-Nokia-AV: Clean Cc: ACPI mailing list Subject: Re: SMP system shutdown hang (acpi_cpu_shutdown - smp_rendezvous) X-BeenThere: freebsd-acpi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: ACPI and power management development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 02 Nov 2007 01:35:33 -0000 ext Nate Lawson wrote: > Glen wrote: > >> Hi, >> >> I have been seeing intermittent hangs in the acpi shutdown code on a >> Intel 2.4GHz 8 CPU system. I am running a with a Freebsd6.1 code base >> but cannot see a reason why this can't happen in other Freebsd versions. >> The hang is very irregular, I am recreating it using an expect script >> that repeatedly reboots the system. Sometimes, I can do up to 200 >> reboots before observing the hang, sometimes, it happens after 5-20 >> reboots. >> >> It has been difficult to pin down the hang as the system is not >> responding to NMI events but using breakpoints I believe the hang is in >> acpi_cpu.c:acpi_cpu_shutdown with the call to smp_rendezvous. >> > > First, thank you for your careful debugging help. This is wonderful. > > >> My theory is that one of the CPUs does not respond to ipi_all_but_self >> and that all the other CPUs are waiting for it in smp_rendezvous_action. >> The smp_rv_waiters[0] < mp_ncpus condition never gets met and the system >> hangs. This maybe happen due to other activity (or a deadlock?) on that >> CPU. >> >> I noticed a few threads relating to this and have already tried stuff >> like changing kern.sched.ipiwakeup.enabled & machdep.cpu_idle_hlt. >> Neither had any effect. >> > > Very interesting. I didn't think anything could cause an IPI not to get > delivered eventually but during shutdown interrupts may be disabled at > some point. > > It was only a theory; I couldn't think of any other reasons why one of the CPUs doesn't rendezvous, interrupts being disabled is a good reason though! >> 1) I tried removing the call to smp_rendezvous in acpi_cpu_shutdown and >> this stops the hang from happening. Does anyone know the purpose of this >> call in the shutdown code or if I might suffer some consequence by >> removing it? >> > > Yes, I put it in to break all APs out of their potential C1-3 sleep. > This way they are not halted when shutdown needs to synchronize and stop > them. But that code sends its own IPI so there is no reason to do it > again here. I will remove smp_rendezvous() now. > It sounds like removing smp_rendezvous is a safe thing to do, thanks for your insight. > >> 2) Has anyone got any suggestions for debugging this further given that >> I can't break into the debugger? I thought I could maybe instrument some >> counters in i386/i386/local_apic.c & kern_smp.c with the aim of >> identifying a root cause. >> > > Sounds reasonable. Thanks again for a detailed problem report. > > I will notify the list if I find anything further regarding this problem. Thanks for your response.