From owner-freebsd-hardware@FreeBSD.ORG  Wed Jun 18 03:52:44 2003
Return-Path: <owner-freebsd-hardware@FreeBSD.ORG>
Delivered-To: freebsd-hardware@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 017AA37B401
	for <freebsd-hardware@freebsd.org>;
	Wed, 18 Jun 2003 03:52:44 -0700 (PDT)
Received: from smtp0.adl1.internode.on.net (smtp0.adl1.internode.on.net
	[203.16.214.194])
	by mx1.FreeBSD.org (Postfix) with ESMTP id E4C4643F75
	for <freebsd-hardware@freebsd.org>;
	Wed, 18 Jun 2003 03:52:42 -0700 (PDT)
	(envelope-from smckay@internode.on.net)
Received: from dungeon.home (ppp155.qld.padsl.internode.on.net
	[150.101.176.154])h5IAqTea002255;
	Wed, 18 Jun 2003 20:22:31 +0930 (CST)
Received: from dungeon.home (localhost [127.0.0.1])
	by dungeon.home (8.12.8p1/8.11.6) with ESMTP id h5IAqTu2008960;
	Wed, 18 Jun 2003 20:52:29 +1000 (EST)
	(envelope-from mckay)
Message-Id: <200306181052.h5IAqTu2008960@dungeon.home>
To: joshuah@synology.com
References: <200306171554.h5HFs2DQ041575@mail.synology.com>
In-Reply-To: <200306171554.h5HFs2DQ041575@mail.synology.com>
    from Jaw-Shiang Joshua Huang at "Tue, 17 Jun 2003 23:54:02 +0800"
Date: Wed, 18 Jun 2003 20:52:29 +1000
From: Stephen McKay <smckay@internode.on.net>
cc: freebsd-hardware@freebsd.org
cc: Stephen McKay <smckay@internode.on.net>
Subject: Re: ATA READ command timeout (and worse) 
X-BeenThere: freebsd-hardware@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: General discussion of FreeBSD hardware
	<freebsd-hardware.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hardware>,
	<mailto:freebsd-hardware-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hardware>
List-Post: <mailto:freebsd-hardware@freebsd.org>
List-Help: <mailto:freebsd-hardware-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hardware>,
	<mailto:freebsd-hardware-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 18 Jun 2003 10:52:44 -0000

On Tuesday, 17th June 2003, Jaw-Shiang Joshua Huang wrote:

>Because your machine will reboot automatically when the disk driver operation
>is abnormal, it makes me want to know more.
>
>Is your kernel compiled with DDB?  If not, it will reboot after 15 seconds
>while hitting panic.  If it's reproducable, would you mind to compile a new 
>kernel and try to find out where it panic or page fault?  I just want to know
>this bug will make FreeBSD kernel reboot or just hit panic or page fault.

I recompiled the kernel with DDB.  A few test runs and I got this:

Jun 18 19:19:44 peon /kernel: ad4: no status, reselecting device
Jun 18 19:19:44 peon /kernel: ad4: timeout sending command=c8 s=ff e=00
Jun 18 19:19:44 peon /kernel: ad4: error executing command - resetting
Jun 18 19:19:44 peon /kernel: ata2: resetting devices .. 
Jun 18 19:19:44 peon /kernel: ad4: removed from configuration
Jun 18 19:19:44 peon /kernel: ad5: removed from configuration
Jun 18 19:19:44 peon /kernel: done

Fatal trap 12: page fault while in kernel mode
fault virtual address   = 0x63657865
fault code              = supervisor read, page not present
instruction pointer     = 0x8:0xc0164cd9
stack pointer           = 0x10:0xc02bd438
frame pointer           = 0x10:0xc02bd4c0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = Idle
interrupt mask          =
kernel: type 12 trap, code=0
Stopped at      kvprintf+0x545: repne scasb     (%esi)
db> trace
kvprintf(c028add6,c016456c,c02bd4e0,a,c02bd4fc) at kvprintf+0x545
printf(c028add4,63657865,c1246800,c02bd528,c012d908) at printf+0x44
ata_prtdev(c139a400,c028d280,c028d271,5b512a0,0,0) at ata_prtdev+0x1a
ad_timeout(c13bb200,400000,0,0,ffffffff) at ad_timeout+0x40
softclock(0,10,10,10,ffffffff) at softclock+0xd1
doreti_swi(e,665,2,183f9ff,756e6547) at doreti_swi+0xf
idle_loop() at idle_loop+0x1d
db>

Obviously 0x63657865 is suspicious.  On further investigation, the
ata_device structure at 0xc139a400 has been corrupted.  The unit and
subsequent fields have been replace by the text string "/libexec/ld-elf.so.1"
which is odd, to say the least.

Now I don't know what I'm chasing: a random VM bug, bad memory, PCI bus
errors, sagging power, bugs in the ata driver, cosmic rays, space aliens.

It's been a long time since I've had to do any kernel debugging, but I
suppose I'll have to set up a serial console and get to it.

Stephen.