From owner-freebsd-sparc64@FreeBSD.ORG Fri Feb 24 13:05:23 2006 Return-Path: X-Original-To: sparc64@freebsd.org Delivered-To: freebsd-sparc64@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 18DD216A420; Fri, 24 Feb 2006 13:05:23 +0000 (GMT) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (66-23-211-162.clients.speedfactory.net [66.23.211.162]) by mx1.FreeBSD.org (Postfix) with ESMTP id 27B8943D46; Fri, 24 Feb 2006 13:05:22 +0000 (GMT) (envelope-from jhb@freebsd.org) Received: from zion.baldwin.cx (zion.baldwin.cx [192.168.0.7]) (authenticated bits=0) by server.baldwin.cx (8.13.4/8.13.4) with ESMTP id k1OD5Kt0061857; Fri, 24 Feb 2006 08:05:21 -0500 (EST) (envelope-from jhb@freebsd.org) From: John Baldwin To: freebsd-sparc64@freebsd.org Date: Fri, 24 Feb 2006 07:43:21 -0500 User-Agent: KMail/1.8.3 References: <20060223204716.GA90985@xor.obsecurity.org> <20060223230734.GA93088@xor.obsecurity.org> In-Reply-To: <20060223230734.GA93088@xor.obsecurity.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Message-Id: <200602240743.22979.jhb@freebsd.org> X-Virus-Scanned: ClamAV 0.87.1/1301/Fri Feb 24 05:04:11 2006 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-1.4 required=4.2 tests=ALL_TRUSTED autolearn=ham version=3.1.0 X-Spam-Checker-Version: SpamAssassin 3.1.0 (2005-09-13) on server.baldwin.cx Cc: sparc64@freebsd.org, Kris Kennaway Subject: Re: "sched_lock held too long" panic + trace X-BeenThere: freebsd-sparc64@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Porting FreeBSD to the Sparc List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Feb 2006 13:05:23 -0000 On Thursday 23 February 2006 06:07 pm, Kris Kennaway wrote: > On Thu, Feb 23, 2006 at 03:47:16PM -0500, Kris Kennaway wrote: > > One of my e4500s has started panicking regularly under load because > > sched_lock was held for > 5 seconds. Since on sparc64 it always > > deadlocks after this panic instead of entering DDB, I wasn't able to > > track down the cause. Instead, I changed the panic to first > > DELAY(1000000*PCPU_GET(cpuid)) (so that different CPUs don't overlap > > the printfs) and then kdb_backtrace(). > > > > Doing so I obtained the following trace (still a bit corrupted, but > > hopefully more useful). > > > > KDB: stack backtrace: > > hardclock_cpu() at hardclock_cpu+0x6c > > tick_hardclock() at tick_hardclock+0xc4 > > -- interrupt level=3D0xe pil=3D0 %o7=3D0xc0190a98 -- > > _mtx_lock_spin() at _mtx_lock_spin+0xf4 > > tlb_page_demap() at tlb_page_demap+0xa0 > > pmap_zero_page_idle() at pmap_zero_page_idle+0xdc > > vm_page_zero_idle() at vm_page_zero_idle+0x108 > > vm_pagezero() at vm_pagezero+0x4c > > fork_exit() at fork_exit+0x94 > > fork_trampoline() at fork_trampoline+0x8 > > Witness seems to have caught this: > > panic: blockable sleep lock (sleep mutex) system map @ vm/vm_map.c:2995 > db> wh > Tracing pid 1267 tid 100248 td 0xfffff800612a0540 > panic() at panic+0x164 > witness_checkorder() at witness_checkorder+0xc8 > _mtx_lock_flags() at _mtx_lock_flags+0x80 > _vm_map_lock_read() at _vm_map_lock_read+0x3c > vm_map_lookup() at vm_map_lookup+0x1c > vm_fault() at vm_fault+0x68 > trap_pfault() at trap_pfault+0x1a8 > trap() at trap+0x2b0 > -- fast data access mmu miss tar=3D0xe819c000 %o7=3D0xc031d204 -- > cpu_ipi_selected() at cpu_ipi_selected+0x2c > tlb_page_demap() at tlb_page_demap+0x74 > pmap_copy_page() at pmap_copy_page+0x39c > vm_fault() at vm_fault+0xe5c > trap_pfault() at trap_pfault+0x134 > trap() at trap+0xa0 > -- data access protection tar=3D0x4065c524 sfar=3D0x4065d314 sfsr=3D0x800= 005 > %o7=3D0x40350c94 -- This is just a bug. You shouldn't get a pagefault in cpu_ipi_selected(). =2D-=20 John Baldwin =A0<>< =A0http://www.FreeBSD.org/~jhb/ "Power Users Use the Power to Serve" =A0=3D =A0http://www.FreeBSD.org