From owner-freebsd-current@FreeBSD.ORG  Mon Jan 19 03:39:24 2004
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id AA7F616A4CE
	for <freebsd-current@FreeBSD.org>;
	Mon, 19 Jan 2004 03:39:24 -0800 (PST)
Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 1E59643D1F
	for <freebsd-current@FreeBSD.org>;
	Mon, 19 Jan 2004 03:39:23 -0800 (PST)
	(envelope-from truckman@FreeBSD.org)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
	by gw.catspoiler.org (8.12.9p2/8.12.9) with ESMTP id i0JBdD7E055679;
	Mon, 19 Jan 2004 03:39:17 -0800 (PST)
	(envelope-from truckman@FreeBSD.org)
Message-Id: <200401191139.i0JBdD7E055679@gw.catspoiler.org>
Date: Mon, 19 Jan 2004 03:39:13 -0800 (PST)
From: Don Lewis <truckman@FreeBSD.org>
To: mjs@cc.tut.fi
In-Reply-To: <qzusmicp6n9.fsf@butler.cc.tut.fi>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=us-ascii
cc: freebsd-current@FreeBSD.org
Subject: Re: 5.2R: panic (syncer) on IBM x345 (SMP and Vinum)
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 19 Jan 2004 11:39:24 -0000

On 19 Jan, Matti Saarinen wrote:
> 
> I've been able to crash a server (usenet news server) running 5.2R.
> The crash happens with and without ACPI. The attached info is with
> ACPI enabled. I would be very pleased if someone could tell me why the
> box crashed and how to prevent it from happening. I tried searching
> the list archives and googling wihout any positive result.
> 
> The hardware is IBM x345 with two CPUs (Pentium4), internal LSI
> SCSI/RAID controller and external IBM SCSI controller (which is really
> Adaptec SCSI Card 29320LP). There is IBM ESX400 disk array connected
> to the Adaptec controller. All the disks are U320 disk.
> 
> The root filesystem is mirrored with the LSI adapter (which only
> supports mirroring of two drives). There are three other mirrored
> filesystems created with vinum. On all file systems except root, I've
> enabled soft updates. I've tested all the filesystems (mirrored root, 
> vinum mirrors and filesystems created on single disks) with bonnie++
> and iozone and the server has behaved well. 

> (da0:ahd0:0:0:0): Retrying Command
> (da0:ahd0:0:0:0): Queue Full
> (da0:ahd0:0:0:0): tagged openings now 128
> (da0:ahd0:0:0:0): Retrying Command


Try using the camcontrol modepage command to turn off write caching on
each of the drives (set the WCE bit to 0).  This should eliminate the
need for the driver to crank down the number of tagged openings. Less
stress on the error recovery code may keep the bug from being triggered.


> Fatal trap 12: page fault while in kernel mode
> cpuid = 0; apic id = 00
> fault virtual address   = 0x0
> fault code              = supervisor write, page not present
> instruction pointer     = 0x8:0xc07bcafe
> stack pointer           = 0x10:0xe7b96784
> frame pointer           = 0x10:0xe7b967c0
> code segment            = base 0x0, limit 0xfffff, type 0x1b
>                         = DPL 0, pres 1, def32 1, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 79 (syncer)
> 
> 
> 
> Attached below are the verbose boot logs from the server and the
> kernel debugger output.

> trap_fatal(e7b96744,0,c0837ed0,2cd,cafe9500) at trap_fatal+0x326
> trap_pfault(e7b96744,0,0,1ea30e7,0) at trap_pfault+0x1c2
> trap(e7b90018,10,e7b90010,0,d9a46000) at trap+0x2fd
> calltrap() at calltrap+0x5
> --- trap 0xc, eip = 0xc07bcafe, esp = 0xe7b96784, ebp = 0xe7b967c0 ---
> generic_bcopy(d78de930,0,d78de930,e7b967e4,c06590e1) at generic_bcopy+0x1a
> vinumstrategy(d78de930,cafe9500,e7b9680c,c05da937,d78de930) at vinumstrategy+0xa6
> dev_strategy(d78de930,0,2ee,1,c077dc95) at dev_strategy+0x41
> spec_xstrategy(cb6d071c,d78de930,e7b96828,c05d9c38,e7b96854) at spec_xstrategy+0x1d7

Looks like vinum is passing a NULL pointer to bcopy.