From owner-freebsd-current@FreeBSD.ORG Fri Jul 13 20:19:30 2007 Return-Path: X-Original-To: freebsd-current@freebsd.org Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id ABB6316A403; Fri, 13 Jul 2007 20:19:30 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id 47FCF13C4A5; Fri, 13 Jul 2007 20:19:30 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from scott-longs-computer.local (phobos.samsco.home [192.168.254.11]) (authenticated bits=0) by pooker.samsco.org (8.13.8/8.13.8) with ESMTP id l6DKJQH4016907; Fri, 13 Jul 2007 14:19:27 -0600 (MDT) (envelope-from scottl@samsco.org) Message-ID: <4697DE41.7090100@samsco.org> Date: Fri, 13 Jul 2007 14:19:13 -0600 From: Scott Long User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.4) Gecko/20070509 SeaMonkey/1.1.2 MIME-Version: 1.0 To: John Baldwin References: <200707131528.51396.jhb@freebsd.org> In-Reply-To: <200707131528.51396.jhb@freebsd.org> X-Enigmail-Version: 0.95.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (pooker.samsco.org [168.103.85.57]); Fri, 13 Jul 2007 14:19:27 -0600 (MDT) X-Spam-Status: No, score=-1.4 required=5.5 tests=ALL_TRUSTED autolearn=failed version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org Cc: freebsd-current@freebsd.org, Matt Reimer Subject: Re: arcmsr crash X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 13 Jul 2007 20:19:30 -0000 John Baldwin wrote: > On Tuesday 05 June 2007 05:22:38 pm Matt Reimer wrote: >> Once a week or so we're seeing a panic with a -current kernel built >> just before the gcc 4.2 import (maybe three weeks ago). The box has a >> Supermicro X7DBE/X7DBE+ motherboard with two Xeon 5160s, 16G RAM, and >> an Areca 1220 controller with eight 500G disks connected. >> >> Does this indicate that the arcmsr driver is at fault: >> >> Tracing command irq16: arcmsr0 pid 26 tid 100018 td 0xffffff040fc5b000 >> cpustop_handler() at cpustop_handler+0x35 >> ipi_nmi_handler() at ipi_nmi_handler+0x2e >> trap() at trap+0x365 >> nmi_calltrap() at nmi_calltrap+0x8 >> --- trap 0x13, rip = 0xffffffff8041ab11, rsp = 0xffffffffab59eff0, rbp >> = 0xffffffffac0a37d0 --- >> siocnclose() at siocnclose+0x21 >> sio_cnputc() at sio_cnputc+0x89 >> cnputc() at cnputc+0x6a >> putchar() at putchar+0x5f >> kvprintf() at kvprintf+0xd45 >> printf() at printf+0xe1 >> panic() at panic+0x145 >> xpt_done() at xpt_done+0x14a >> arcmsr_interrupt() at arcmsr_interrupt+0x2df >> ithread_loop() at ithread_loop+0x108 >> fork_exit() at fork_exit+0xaa >> fork_trampoline() at fork_trampoline+0xe >> --- trap 0, rip = 0, rsp = 0xffffffffac0a3d30, rbp = 0 --- > > Looks like it has panic'd here: > > switch (done_ccb->ccb_h.path->periph->type) { > case CAM_PERIPH_BIO: > mtx_lock(&cam_bioq_lock); > TAILQ_INSERT_TAIL(&cam_bioq, &done_ccb->ccb_h, > sim_links.tqe); > done_ccb->ccb_h.pinfo.index = CAM_DONEQ_INDEX; > mtx_unlock(&cam_bioq_lock); > swi_sched(cambio_ih, 0); > break; > default: > panic("unknown periph type %d", > done_ccb->ccb_h.path->periph->type); > } > > which should seem to indicate that, yes, it is a driver bug. > The doneq has gotten corrupted somehow. The only real way that this could happen is if xpt_done() was called twice on the same ccb. Whether this is a hardware bug (hardware completing the same command twice) or a driver bug is unknown. I'll try to add some seatbelts to CAM to detect this kind of condition. But yes, it's ultimately something in the arcmsr subsystem that is at fault. Scott