From owner-freebsd-fs@FreeBSD.ORG Sun May 18 12:42:17 2008 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8796A106566B for ; Sun, 18 May 2008 12:42:17 +0000 (UTC) (envelope-from jdc@parodius.com) Received: from mx01.sc1.parodius.com (mx01.sc1.parodius.com [72.20.106.3]) by mx1.freebsd.org (Postfix) with ESMTP id 7A4678FC17 for ; Sun, 18 May 2008 12:42:17 +0000 (UTC) (envelope-from jdc@parodius.com) Received: by mx01.sc1.parodius.com (Postfix, from userid 1000) id 63FAE1CC033; Sun, 18 May 2008 05:42:17 -0700 (PDT) Date: Sun, 18 May 2008 05:42:17 -0700 From: Jeremy Chadwick To: Andrew Hill Message-ID: <20080518124217.GA16222@eos.sc1.parodius.com> References: <683A6ED2-0E54-42D7-8212-898221C05150@thefrog.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <683A6ED2-0E54-42D7-8212-898221C05150@thefrog.net> User-Agent: Mutt/1.5.17 (2007-11-01) Cc: freebsd-fs@freebsd.org Subject: Re: ZFS lockup in "zfs" state X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 18 May 2008 12:42:17 -0000 On Sun, May 18, 2008 at 05:11:37PM +1000, Andrew Hill wrote: > right now, i'm getting uptimes in the order of days before everything locks > up, i assume its related to this bug, though i'm also getting the following > output when it locks up > > ad2: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=350494631 > ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=234920650 > ad2: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=443427007 > ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=350174938 > ad2: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=350494631 > ad0: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=234920650 > ad2: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=443427007 > ad0: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=350174938 > ad2: FAILURE - WRITE_DMA48 timed out LBA=350494631 > ad0: FAILURE - WRITE_DMA timed out LBA=234920650 > ad2: FAILURE - WRITE_DMA48 timed out LBA=443427007 > ad0: FAILURE - WRITE_DMA48 timed out LBA=350174938 I've documented this fairly well, although I suppose I could write up a diagnosis method as an addendum. Anyway: http://wiki.freebsd.org/JeremyChadwick/Commonly_reported_issues One thing: are the timeouts always on ad0 and ad2? > typically repeated for a number of different LBA values before the system > panics. I don't know if this is more likely to be related to the cause of > the lockups (e.g. faulty hardware/driver) or if its an effect of the lockup > (e.g. waiting on a deadlocked thread)... from what i've found searching > mailing lists, this kind of error seems to turn up with faulty > hardware/drivers so i guess it could just be that zfs exposes the faults > because its using the hardware differently to my previous ufs setup... It is possible you have some bad hardware, but there are many of us who have seen the above (with or without ZFS) on perfectly good hardware. For some, changing cables fixed the problem, while for others absolutely nothing fixed it (changed cables, changed controller brands, changed to new disks). If the DMA timeouts are easily reproducable, please get in touch with Scott Long , who is in the process of researching why these happen. Serial console access might be required. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |