From owner-freebsd-stable@FreeBSD.ORG Mon Jan 8 15:52:55 2007 Return-Path: X-Original-To: stable@freebsd.org Delivered-To: freebsd-stable@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 8E9C516A403; Mon, 8 Jan 2007 15:52:55 +0000 (UTC) (envelope-from sven@dmv.com) Received: from smtp-gw-cl-c.dmv.com (smtp-gw-cl-c.dmv.com [216.240.97.41]) by mx1.freebsd.org (Postfix) with ESMTP id 4953613C428; Mon, 8 Jan 2007 15:52:54 +0000 (UTC) (envelope-from sven@dmv.com) Received: from mail-gw-cl-a.dmv.com (mail-gw-cl-a.dmv.com [216.240.97.38]) by smtp-gw-cl-c.dmv.com (8.12.10/8.12.10) with ESMTP id l08Fqr0F077568; Mon, 8 Jan 2007 10:52:53 -0500 (EST) (envelope-from sven@dmv.com) Received: from lanshark.dmv.com (lanshark.dmv.com [216.240.97.46]) by mail-gw-cl-a.dmv.com (8.12.9/8.12.9) with ESMTP id l08FqpXi024980; Mon, 8 Jan 2007 10:52:51 -0500 (EST) (envelope-from sven@dmv.com) From: Sven Willenberger To: Bruce Evans In-Reply-To: <20070108154433.C75042@delplex.bde.org> References: <1168211205.22629.6.camel@lanshark.dmv.com> <20070108154433.C75042@delplex.bde.org> Content-Type: text/plain Date: Mon, 08 Jan 2007 10:58:55 -0500 Message-Id: <1168271935.23549.10.camel@lanshark.dmv.com> Mime-Version: 1.0 X-Mailer: Evolution 2.6.3 Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.39 X-Scanned-By: MIMEDefang 2.48 on 216.240.97.38 Cc: stable@freebsd.org, freebsd-amd64@freebsd.org Subject: Re: Panic in 6.2-PRERELEASE with bge on amd64 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 08 Jan 2007 15:52:55 -0000 On Mon, 2007-01-08 at 16:06 +1100, Bruce Evans wrote: > On Sun, 7 Jan 2007, Sven Willenberger wrote: > > > I am starting a new thread on this as what I had assumed was a panic in > > nfsd turns out to be an issue with the bge driver. This is an amd64 box, > > dual processor (SMP kernel) that happens to be running nfsd. About every > > 3-5 days the kernel panics and I have finally managed to get a core > > dump. > > The system: FreeBSD 6.2-PRERELEASE #8: Tue Jan 2 10:57:39 EST 2007 > > Like most NIC drivers, bge unlocks and re-locks around its call to > ether_input() in its interrupt handler. This isn't very safe, and it > certainly causes panics for bge. I often see it panic when bringing > the interface down and up while input is arriving, on a non-SMP non-amd64 > (actually i386) non-6.x (actually -current) system. Bringing the > interface down is probably the worst case. It creates a null pointer > for bge_intr() to follow. > > > The short and dirty of the dump: > > ... > > --- trap 0xc, rip = 0xffffffff801d5f17, rsp = 0xffffffffb371ab50, rbp = 0xffffffffb371aba0 --- > > bge_rxeof() at bge_rxeof+0x3b7 > > What is the instruction here? I will do my best to ferret out the information you need. For the bge_rxeof() at bge_rxeof+0x3b7 line, the instruction is: 0xffffffff801d5f17 : mov %r15,0x28(%r14) bge_intr() at bge_intr+0x1c8 line, the instruction is: 0xffffffff801db818 : mov %rbx,%rdi > > > bge_intr() at bge_intr+0x1c8 > > ithread_loop() at ithread_loop+0x14c > > fork_exit() at fork_exit+0xbb > > fork_trampoline() at fork_trampoline+0xe > > --- trap 0, rip = 0, rsp = 0xffffffffb371ad00, rbp = 0 --- > > > Fatal trap 12: page fault while in kernel mode > > cpuid = 1; apic id = 01 > > fault virtual address = 0x28 > > Looks like a null pointer panic anyway. I guess the instruction is > movl to/from 0x28(%reg) where %reg is a null pointer. > from the above lines, apparently %r14 is null then. > > ... > > #8 0xffffffff801db818 in bge_intr (xsc=0x0) at /usr/src/sys/dev/bge/if_bge.c:2707 > > What is the statement here? It presumably follow a null pointer and only > the exprssion for the pointer is interesting. xsc is already null but > that is probably a bug in gdb, or the result of excessive optimization. > Compiling kernels with -O2 has little effect except to break debugging. > the block of code from if_bge.c: 2705 if (ifp->if_drv_flags & IFF_DRV_RUNNING) { 2706 /* Check RX return ring producer/consumer. */ 2707 bge_rxeof(sc); 2708 2709 /* Check TX ring producer/consumer. */ 2710 bge_txeof(sc); 2711 } By default -O2 is passed to CC (I don't use any custom make flags other than and only define CPUTYPE in my /etc/make.conf). > I rarely use gdb on kernels and haven't looked closely enough using ddb > to see where the null pointer for the panic on down/up came from. > > BTW, the sbdrop panic in -current isn't bge-only or SMP-only. I saw > it once for sk on a non-SMP system. It rarely happens for non-SMP > (much more rarely than the panic in bge_intr()). Under -current, on > an SMP amd64 system with bge, It happens almost every time on close > of the socket for a ttcp server if input is arriving at the time of > the close. I haven't seen it for 6.x. > > Bruce The short of it is that this interface sees pretty much non-stop traffic as this is a mailserver (final destination) and is constantly being delivered to (direct disk access) and mail being retrieved (remote machine(s) with nfs mounted mail spools. If a momentary down of the interface is enough to completely panic the driver and then the kernel, this hardly seems "robust" if, in fact, this is what is happening. So the question arises as to what would be causing the down/up of the interface; I could start looking at the cable, the switch it's connected to and ... any other ideas? (I don't have watchdog enabled or anything like that, for example). Sven