From owner-freebsd-fs@FreeBSD.ORG Tue Apr 26 13:49:07 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3A6C9106566B for ; Tue, 26 Apr 2011 13:49:07 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from QMTA11.westchester.pa.mail.comcast.net (qmta11.westchester.pa.mail.comcast.net [76.96.59.211]) by mx1.freebsd.org (Postfix) with ESMTP id D8CD78FC0C for ; Tue, 26 Apr 2011 13:49:05 +0000 (UTC) Received: from omta20.westchester.pa.mail.comcast.net ([76.96.62.71]) by QMTA11.westchester.pa.mail.comcast.net with comcast id cDlg1g00A1YDfWL5BDp6Mw; Tue, 26 Apr 2011 13:49:06 +0000 Received: from koitsu.dyndns.org ([67.180.84.87]) by omta20.westchester.pa.mail.comcast.net with comcast id cDp41g01Q1t3BNj3gDp5l7; Tue, 26 Apr 2011 13:49:06 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 2879A9B418; Tue, 26 Apr 2011 06:49:03 -0700 (PDT) Date: Tue, 26 Apr 2011 06:49:03 -0700 From: Jeremy Chadwick To: Conall O'Brien Message-ID: <20110426134903.GA62578@icarus.home.lan> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-fs@freebsd.org Subject: Re: Problems Terminating zpool scrub... X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 26 Apr 2011 13:49:07 -0000 On Tue, Apr 26, 2011 at 02:25:00PM +0100, Conall O'Brien wrote: > On 26 April 2011 13:15, ambrosehuang ambrose wrote: > > Could you post your PR number?I was curious about the driver used by > > West Digital Disk, cause I use > > the WR10EARS? > > http://www.freebsd.org/cgi/query-pr.cgi?pr=156647 > > I chalked it up to the SATA controller, since only 2 of my 5 identical > WD20EARS disks were reporting DMA issues. > > > > > 2011/4/25 Conall O'Brien > >> > >> On 15 April 2011 15:59, Conall O'Brien wrote: > >> > Hello, > >> > > >> > > >> > I've got a NAS box running 8-STABLEW [1] which I'm running with 5x > >> > Western Digital 2TB disks. > >> > > >> > > >> > One of the disks was having DMA issues as reported in dmesg, so I > >> > began the usual zfs workflow of "zpool offline pool dev", physically > >> > removing it and tried to "zpool replace pool dev" but my attempts to > >> > do so fail, actually the zpool command keeps ending up in > >> > uninterruptable wait (the D state). Before resorting to replacing the > >> > disk, a zpool scrub was in progress. Now, I can't kill it using "zpool > >> > scrub -s pool", it too ends up in the D state. > >> > > >> > > >> > Is there another way than "zpool scrub -s pool" to terminate a scrub > >> > process, so I can proceed with the disk replacement. I care more about > >> > resilvering my pool before getting around to scrubbing it. > >> > > >> > > >> > Thanks! > >> > > >> > > >> > [1] For completeness, uname -a reports FreeBSD galvatron.taku.ie > >> > 8.2-STABLE FreeBSD 8.2-STABLE #1: Sat Mar 19 13:18:46 UTC 2011 > >> > root@galvatron.taku.ie:/usr/src/obj/usr/src/sys/GALVATRON ??amd64 > >> > >> I worked out the problem. There's a regression in one of the drivers > >> between the kernel I was running and my previous kernel: > >> > >> FreeBSD galvatron.taku.ie 8.2-PRERELEASE FreeBSD 8.2-PRERELEASE #0: > >> Wed Dec 29 04:00:27 UTC 2010 > >> root@galvatron.taku.ie:/usr/src/obj/usr/src/sys/GALVATRON ??amd64 > >> > >> > >> I'll file a PR to get it fixed. The PR is extremely terse/sub-part quality. There isn't actual evidence of the problem being a driver regression. What needs to be provided in the PR: - Relevant dmesg output (pertaining to ataX and adX devices and anything else seen around that time; stuff from /var/adm/messages might be more useful since it contains timestamps) - Full dmesg seen during a fresh reboot - vmstat -i - atacontrol cap ataX (for each ataX channel. You can XXX out the serial number if desired) - smartctl -a /dev/adX (for each disk, be sure to label which disk is associated with what data. You can XXX out the serial number if desired) What really needs to be shown are the actual errors themselves, and in sequential order / with timestamps. "DMA errors" is too vague; I want to assume READ_DMA48 but I cannot assume that. Next: I'm not sure if your system support its, but can you run the controller in AHCI mode (BIOS setting) and load ahci.ko instead (ahci_load="yes" in /boot/loader.conf, your disks will change to /dev/adaX)? If so, this would allow you to narrow down whether or not the issue is truly a driver problem. You should try this *before* attempting the below. Next: Try updating your source to something newer than March 19th. There have been ata(4) changes since then that might pertain to your issue. If the same issue happens on a present-day build of RELENG_8 then we can start by trying to narrow it down to commits between, roughly, late December 2010 to mid-March 2011. Since you follow RELENG_8, you will need to follow commits. src/sys/dev/ata is what's relevant here, as well as the chipsets/ directory under that. http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/ata/ http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/ata/chipsets/ Let's get this figured out before other users start correlating their problems with whatever this is. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP 4BD6C0CB |