From owner-freebsd-current@FreeBSD.ORG  Mon Oct 29 18:24:10 2007
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id ED0E816A41A
	for <freebsd-current@freebsd.org>; Mon, 29 Oct 2007 18:24:10 +0000 (UTC)
	(envelope-from M.S.Powell@salford.ac.uk)
Received: from akis.salford.ac.uk (akis.salford.ac.uk [146.87.0.14])
	by mx1.freebsd.org (Postfix) with SMTP id 68BBB13C4BC
	for <freebsd-current@freebsd.org>; Mon, 29 Oct 2007 18:24:09 +0000 (UTC)
	(envelope-from M.S.Powell@salford.ac.uk)
Received: (qmail 26412 invoked by uid 98); 29 Oct 2007 09:59:13 +0000
Received: from 146.87.255.121 by akis.salford.ac.uk (envelope-from
	<M.S.Powell@salford.ac.uk>, uid 401) with qmail-scanner-2.01 
	(clamdscan: 0.90/3843. spamassassin: 3.1.8.  
	Clear:RC:1(146.87.255.121):. 
	Processed in 0.046486 secs); 29 Oct 2007 09:59:13 -0000
Received: from rust.salford.ac.uk (HELO rust.salford.ac.uk) (146.87.255.121)
	by akis.salford.ac.uk (qpsmtpd/0.3x.614) with SMTP;
	Mon, 29 Oct 2007 09:59:13 +0000
Received: (qmail 19318 invoked by uid 1002); 29 Oct 2007 09:59:11 -0000
Received: from localhost (sendmail-bs@127.0.0.1)
	by localhost with SMTP; 29 Oct 2007 09:59:11 -0000
Date: Mon, 29 Oct 2007 09:59:11 +0000 (GMT)
From: "Mark Powell" <M.S.Powell@salford.ac.uk>
To: freebsd-current@freebsd.org
Message-ID: <20071029092531.J99722@rust.salford.ac.uk>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Subject: WRITE_DMA48 error causing loss of ZFS array
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 29 Oct 2007 18:24:11 -0000

   I've experienced several dma errors over the past few months with 
various incarnations of 7.0 which were all fixed.
   Seems I have a new one. Don't know if there was a connection, but this 
only occured after updating to 7.0-BETA1 last weekend.
   I have a small ufs mirror for /boot and everything else on one ZFS pool.
   I scrub my zpool in the early hours every monday morning. Last Monday 
when I got to the console I saw DMA_ERRORs slowly scrolling up the screen. 
Could type 'root' to login prompt on virtual terminal but it just hung. 
Nothing I could do apart from reset.
   When it came back it was fine AFAICT. Later that day I got the problem 
again. Reset and all ok. I then, confusingly, managed to successfully 
scrub the whole pool with no problems.
   However, again this morning I had the same symptoms. A couple of 
screenshots here, as nothing got logged, the pool seemed to be effectively 
unavailable:

http://webhost.salford.ac.uk/aix502/29102007(001).thb.jpg
http://webhost.salford.ac.uk/aix502/29102007(004).thb.jpg

The errors all seemto be on one drive. AFAICT it had probably been going 
on for hours when I get to it and seems like it will continue this way 
forever.
   I've looked in the smartctl output for the drive (I do a short offline 
test everyday and a long offline test every Sunday) but nothing there. Ran 
the Hitachi Drive Fitness test on the drive and no problems reported.
   This is one of two drives on a JMB363 controller which is in IDE mode. 
If that makes a difference, as I've seen posts referring to problems with 
that controller, but think they might've been dealing with AHCI mode only?
   Is this a known problem? I've seen mention of known problems with ata, 
but it's hard to get a clear picture of what is currently outstanding from 
searching the last few month's -current.
   Also, why do I lose my zpool and have to reset? This one drive failing 
would not cause a problem for the zpool, as it has redundancy. However, 
why am I effectively losing the whole pool due to this error?
   I'll be glad to provide any more info.
   Many thanks in advance.

-- 
Mark Powell - UNIX System Administrator - The University of Salford
Information Services Division, Clifford Whitworth Building,
Salford University, Manchester, M5 4WT, UK.
Tel: +44 161 295 4837  Fax: +44 161 295 5888  www.pgp.com for PGP key