From owner-freebsd-stable@FreeBSD.ORG Wed Jan 20 23:12:52 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 330D4106568D for ; Wed, 20 Jan 2010 23:12:52 +0000 (UTC) (envelope-from erik@malcolm.berkeley.edu) Received: from malcolm.berkeley.edu (malcolm.Berkeley.EDU [IPv6:2607:f140:ffff:ffff::239]) by mx1.freebsd.org (Postfix) with ESMTP id 0E0768FC1A for ; Wed, 20 Jan 2010 23:12:52 +0000 (UTC) Received: from malcolm.berkeley.edu (localhost [127.0.0.1]) by malcolm.berkeley.edu (8.14.3/8.13.8m1) with ESMTP id o0KNCp06086724 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 20 Jan 2010 15:12:51 -0800 (PST) (envelope-from erik@malcolm.berkeley.edu) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.95.3 at malcolm.berkeley.edu Received: (from erik@localhost) by malcolm.berkeley.edu (8.14.3/8.13.3/Submit) id o0KNCpW2086723; Wed, 20 Jan 2010 15:12:51 -0800 (PST) (envelope-from erik) Date: Wed, 20 Jan 2010 15:12:51 -0800 From: Erik Klavon To: Pyun YongHyeon Message-ID: <20100120231251.GA85328@malcolm.berkeley.edu> References: <20100114014719.GA11284@malcolm.berkeley.edu> <20100114020640.GT1228@michelle.cdnetworks.com> <20100114232618.GA27380@malcolm.berkeley.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100114232618.GA27380@malcolm.berkeley.edu> User-Agent: Mutt/1.4.2.3i X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.3 (malcolm.berkeley.edu [127.0.0.1]); Wed, 20 Jan 2010 15:12:51 -0800 (PST) Cc: freebsd-stable@freebsd.org Subject: Re: bge panic in 8.0 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Jan 2010 23:12:52 -0000 On Thu, Jan 14, 2010 at 03:26:18PM -0800, Erik Klavon wrote: > On Wed, Jan 13, 2010 at 06:06:40PM -0800, Pyun YongHyeon wrote: > > On Wed, Jan 13, 2010 at 05:47:19PM -0800, Erik Klavon wrote: > > > One of my amd64 machines running 8.0p1 acting as a NAT system for many > > > network clients dropped into kdb today. tr indicates a problem in > > > bge. > > > > > > Tracing pid 12 tid 100033 td 0xffffff0001687000 > > > pmap_kextract() at pmap_kextract+0x4e > > > bus_dmamap_load() at bus_dmamap_load+0xab > > > bge_newbuf_std() at bge_newbuf_std+0xcc > > > bge_rxeof() at bge_rxeof+0x36a > > > bge_intr() at bge_intr+0x1c0 > > > intr_event_execute_handlers() at intr_event_execute_handlers+0xfd > > > ithread_loop() at ithread_loop+0x8e > > > fork_exit() at fork_exit+0x118 > > > fork_trampoline() at fork_trampoline+0xe > > > --- trap 0, rip = 0, rsp = 0xffffff8074c01d30, rbp = 0 --- > > > > > > I haven't been able to find a PR that matches this particular trace. > > > > > > Pyun recently MFCd to stable (hence my post to this list) some changes > > > to bge that involve functions in the above trace and according to the > > > commit log (r201685) may address a kernel panic. Is there any > > > indication in the above trace that this is the type of panic the > > > commit attempts to address? I don't have a core dump for this > > > panic. This machine has been unstable on 8, so I may be able to get a > > > core dump in the future. If there is other information you'd like me > > > to gather, please let me know. > > > > Yes, that part of code in trace above were rewritten to address > > bus_dma(9) issues. So it would be great if you can try latest > > bge(4) in stable/8 and let me know how it goes on your box. I guess > > you can just download if_bge.c and if_bgereg.h from stable/8 and > > rebuild bge(4) would be enough to run it on 8.0-RELEASE. > > Great, I will try this out on a test machine today. If it holds up > under testing, I will put it into production. These crashes can happen > weeks after a machine boots, so I won't know if the problem is solved > for some time. Thanks for your help, I didn't run into any problems while testing. I started running bge(4) from stable in production this morning. I had three kernel panics in a couple hours; here's an example Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x18 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff805ccf17 stack pointer = 0x28:0xffffff800004f830 frame pointer = 0x28:0xffffff800004f890 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0 pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 13 (ng_queue0) [thread pid 13 tid 100009 ] Stopped at m_copym+0x37: movl 0x18(%r12),%eax db> tr Tracing pid 13 tid 100009 td 0xffffff000189aab0 m_copym() at m_copym+0x37 ip_fragment() at ip_fragment+0x131 ip_output() at ip_output+0xeec ip_forward() at ip_forward+0x16a ip_input() at ip_input+0x57d ng_ipfw_rcvdata() at ng_ipfw_rcvdata+0xb9 ng_apply_item() at ng_apply_item+0x220 ngthread() at ngthread+0x16b fork_exit() at fork_exit+0x118 fork_trampoline() at fork_trampoline+0xe --- trap 0, rip = 0, rsp = 0xffffff800004fd30, rbp = 0 --- I tried the kdb command 'panic' to dump core, but this command only produced further faults. After the third panic related to m_copym, I reverted to the previous version of bge(4) from 8.0p1. A couple of hours has passed without these panics repeating while running the previous version of bge(4). There is a long open PR, 89070, that looks to be related to the above panic. I don't have any proof that these panics resulted from the newer version of bge(4). I haven't seen kernel panics such as these on any of the other machines with this same configuration. I have seen a kernel panic on systems running 8.0p1 with a different stack trace than the one I posted previous that also appears to be related to bge(4). Fatal trap 12: page fault while in kernel mode cpuid = 1; apic id = 01 fault virtual address = 0x28 fault code = supervisor write data, page not present instruction pointer = 0x20:0xffffffff802cdf0e stack pointer = 0x28:0xffffff8074c1ab10 frame pointer = 0x28:0xffffff8074c1ab70 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (irq25: bge1) [thread pid 12 tid 100034 ] Stopped at bge_rxeof+0x1be: movq %r15,0x28(%r14) db> trace Tracing pid 12 tid 100034 td 0xffffff0001680ab0 bge_rxeof() at bge_rxeof+0x1be bge_intr() at bge_intr+0x1c0 intr_event_execute_handlers() at intr_event_execute_handlers+0xfd ithread_loop() at ithread_loop+0x8e fork_exit() at fork_exit+0x118 fork_trampoline() at fork_trampoline+0xe --- trap 0, rip = 0, rsp = 0xffffff8074c1ad30, rbp = 0 --- Erik