From owner-freebsd-hardware@FreeBSD.ORG Wed Jun 18 03:52:44 2003 Return-Path: Delivered-To: freebsd-hardware@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 017AA37B401 for ; Wed, 18 Jun 2003 03:52:44 -0700 (PDT) Received: from smtp0.adl1.internode.on.net (smtp0.adl1.internode.on.net [203.16.214.194]) by mx1.FreeBSD.org (Postfix) with ESMTP id E4C4643F75 for ; Wed, 18 Jun 2003 03:52:42 -0700 (PDT) (envelope-from smckay@internode.on.net) Received: from dungeon.home (ppp155.qld.padsl.internode.on.net [150.101.176.154])h5IAqTea002255; Wed, 18 Jun 2003 20:22:31 +0930 (CST) Received: from dungeon.home (localhost [127.0.0.1]) by dungeon.home (8.12.8p1/8.11.6) with ESMTP id h5IAqTu2008960; Wed, 18 Jun 2003 20:52:29 +1000 (EST) (envelope-from mckay) Message-Id: <200306181052.h5IAqTu2008960@dungeon.home> To: joshuah@synology.com References: <200306171554.h5HFs2DQ041575@mail.synology.com> In-Reply-To: <200306171554.h5HFs2DQ041575@mail.synology.com> from Jaw-Shiang Joshua Huang at "Tue, 17 Jun 2003 23:54:02 +0800" Date: Wed, 18 Jun 2003 20:52:29 +1000 From: Stephen McKay cc: freebsd-hardware@freebsd.org cc: Stephen McKay Subject: Re: ATA READ command timeout (and worse) X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jun 2003 10:52:44 -0000 On Tuesday, 17th June 2003, Jaw-Shiang Joshua Huang wrote: >Because your machine will reboot automatically when the disk driver operation >is abnormal, it makes me want to know more. > >Is your kernel compiled with DDB? If not, it will reboot after 15 seconds >while hitting panic. If it's reproducable, would you mind to compile a new >kernel and try to find out where it panic or page fault? I just want to know >this bug will make FreeBSD kernel reboot or just hit panic or page fault. I recompiled the kernel with DDB. A few test runs and I got this: Jun 18 19:19:44 peon /kernel: ad4: no status, reselecting device Jun 18 19:19:44 peon /kernel: ad4: timeout sending command=c8 s=ff e=00 Jun 18 19:19:44 peon /kernel: ad4: error executing command - resetting Jun 18 19:19:44 peon /kernel: ata2: resetting devices .. Jun 18 19:19:44 peon /kernel: ad4: removed from configuration Jun 18 19:19:44 peon /kernel: ad5: removed from configuration Jun 18 19:19:44 peon /kernel: done Fatal trap 12: page fault while in kernel mode fault virtual address = 0x63657865 fault code = supervisor read, page not present instruction pointer = 0x8:0xc0164cd9 stack pointer = 0x10:0xc02bd438 frame pointer = 0x10:0xc02bd4c0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = Idle interrupt mask = kernel: type 12 trap, code=0 Stopped at kvprintf+0x545: repne scasb (%esi) db> trace kvprintf(c028add6,c016456c,c02bd4e0,a,c02bd4fc) at kvprintf+0x545 printf(c028add4,63657865,c1246800,c02bd528,c012d908) at printf+0x44 ata_prtdev(c139a400,c028d280,c028d271,5b512a0,0,0) at ata_prtdev+0x1a ad_timeout(c13bb200,400000,0,0,ffffffff) at ad_timeout+0x40 softclock(0,10,10,10,ffffffff) at softclock+0xd1 doreti_swi(e,665,2,183f9ff,756e6547) at doreti_swi+0xf idle_loop() at idle_loop+0x1d db> Obviously 0x63657865 is suspicious. On further investigation, the ata_device structure at 0xc139a400 has been corrupted. The unit and subsequent fields have been replace by the text string "/libexec/ld-elf.so.1" which is odd, to say the least. Now I don't know what I'm chasing: a random VM bug, bad memory, PCI bus errors, sagging power, bugs in the ata driver, cosmic rays, space aliens. It's been a long time since I've had to do any kernel debugging, but I suppose I'll have to set up a serial console and get to it. Stephen.