From owner-freebsd-fs@FreeBSD.ORG  Sun Jan  9 13:27:06 2011
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 7F7C5106566B
	for <freebsd-fs@freebsd.org>; Sun,  9 Jan 2011 13:27:06 +0000 (UTC)
	(envelope-from jdc@koitsu.dyndns.org)
Received: from qmta06.emeryville.ca.mail.comcast.net
	(qmta06.emeryville.ca.mail.comcast.net [76.96.30.56])
	by mx1.freebsd.org (Postfix) with ESMTP id 5F2F08FC15
	for <freebsd-fs@freebsd.org>; Sun,  9 Jan 2011 13:27:06 +0000 (UTC)
Received: from omta10.emeryville.ca.mail.comcast.net ([76.96.30.28])
	by qmta06.emeryville.ca.mail.comcast.net with comcast
	id tRDA1f0040cQ2SLA6RT5bU; Sun, 09 Jan 2011 13:27:05 +0000
Received: from koitsu.dyndns.org ([98.248.34.134])
	by omta10.emeryville.ca.mail.comcast.net with comcast
	id tRT41f00J2tehsa8WRT4qX; Sun, 09 Jan 2011 13:27:05 +0000
Received: by icarus.home.lan (Postfix, from userid 1000)
	id 7525F9B427; Sun,  9 Jan 2011 05:27:04 -0800 (PST)
Date: Sun, 9 Jan 2011 05:27:04 -0800
From: Jeremy Chadwick <freebsd@jdc.parodius.com>
To: Attila Nagy <bra@fsn.hu>
Message-ID: <20110109132704.GA38278@icarus.home.lan>
References: <4D0A09AF.3040005@FreeBSD.org> <4D297943.1040507@fsn.hu>
	<4D29A0C7.8050002@fsn.hu> <20110109121800.GA37231@icarus.home.lan>
	<4D29AD25.2030501@fsn.hu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4D29AD25.2030501@fsn.hu>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: freebsd-fs@FreeBSD.org, freebsd-stable@FreeBSD.org,
	Martin Matuska <mm@FreeBSD.org>
Subject: Re: New ZFSv28 patchset for 8-STABLE
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 09 Jan 2011 13:27:06 -0000

On Sun, Jan 09, 2011 at 01:42:13PM +0100, Attila Nagy wrote:
>  On 01/09/2011 01:18 PM, Jeremy Chadwick wrote:
> >On Sun, Jan 09, 2011 at 12:49:27PM +0100, Attila Nagy wrote:
> >>  On 01/09/2011 10:00 AM, Attila Nagy wrote:
> >>>On 12/16/2010 01:44 PM, Martin Matuska wrote:
> >>>>Hi everyone,
> >>>>
> >>>>following the announcement of Pawel Jakub Dawidek (pjd@FreeBSD.org) I am
> >>>>providing a ZFSv28 testing patch for 8-STABLE.
> >>>>
> >>>>Link to the patch:
> >>>>
> >>>>http://people.freebsd.org/~mm/patches/zfs/v28/stable-8-zfsv28-20101215.patch.xz
> >>>>
> >>>>
> >>>I've got an IO hang with dedup enabled (not sure it's related,
> >>>I've started to rewrite all data on pool, which makes a heavy
> >>>load):
> >>>
> >>>The processes are in various states:
> >>>65747   1001      1  54   10 28620K 24360K tx->tx  0   6:58  0.00% cvsup
> >>>80383   1001      1  54   10 40616K 30196K select  1   5:38  0.00% rsync
> >>>1501 www         1  44    0  7304K  2504K zio->i  0   2:09  0.00% nginx
> >>>1479 www         1  44    0  7304K  2416K zio->i  1   2:03  0.00% nginx
> >>>1477 www         1  44    0  7304K  2664K zio->i  0   2:02  0.00% nginx
> >>>1487 www         1  44    0  7304K  2376K zio->i  0   1:40  0.00% nginx
> >>>1490 www         1  44    0  7304K  1852K zfs     0   1:30  0.00% nginx
> >>>1486 www         1  44    0  7304K  2400K zfsvfs  1   1:05  0.00% nginx
> >>>
> >>>And everything which wants to touch the pool is/becomes dead.
> >>>
> >>>Procstat says about one process:
> >>># procstat -k 1497
> >>>  PID    TID COMM             TDNAME           KSTACK
> >>>1497 100257 nginx            -                mi_switch
> >>>sleepq_wait __lockmgr_args vop_stdlock VOP_LOCK1_APV null_lock
> >>>VOP_LOCK1_APV _vn_lock nullfs_root lookup namei vn_open_cred
> >>>kern_openat syscallenter syscall Xfast_syscall
> >>No, it's not related. One of the disks in the RAIDZ2 pool went bad:
> >>(da4:arcmsr0:0:4:0): READ(6). CDB: 8 0 2 10 10 0
> >>(da4:arcmsr0:0:4:0): CAM status: SCSI Status Error
> >>(da4:arcmsr0:0:4:0): SCSI status: Check Condition
> >>(da4:arcmsr0:0:4:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered
> >>read error)
> >>and it seems it froze the whole zpool. Removing the disk by hand
> >>solved the problem.
> >>I've seen this previously on other machines with ciss.
> >>I wonder why ZFS didn't throw it out of the pool.
> >Hold on a minute.  An unrecoverable read error does not necessarily mean
> >the drive is bad, it could mean that the individual LBA that was
> >attempted to be read resulted in ASC 0x11 (MEDIUM ERROR) (e.g. a bad
> >block was encountered).  I would check SMART stats on the disk (since
> >these are probably SATA given use of arcmsr(4)) and provide those.
> >*That* will tell you if the disk is bad.  I'll help you decode the
> >attributes values if you provide them.
> You are right, and I gave incorrect information. There are a lot
> more errors for that disk in the logs, and the zpool was frozen.
> I tried to offline the given disk. That helped in the ciss case,
> where the symptom is the same, or something similar, like there is
> no IO for ages, then something small and nothing for long
> seconds/minutes, and there are no errors logged. zpool status
> reported no errors, and the dmesg was clear too.
> There I could find the bad disk by watching gstat output and there I
> saw when the very small amount of IO was done, there was one disk
> with response times well above a second, while the others responded
> quickly.
> There the zpool offline helped. Here not, the command just got hang,
> like everything else.
> So what I did then: got into the areca-cli and searched for errors.
> One disk was set to failed and it seemed to be the cause. I've
> removed it (and did a camcontrol rescan, but I'm not sure it was
> necessary or not), and suddenly the zpool offline finished and
> everything went back to normal.
> But there are two controllers in the system and now I see that the
> above disk is on ctrl 1, while the one I have removed is on ctrl 2.
> I was misleaded by their same position. So now I have an offlined
> disk (which produces read errors, but I couldn't see them in the
> zpool output) and another, which is shown as failed in the RAID
> controller and got removed by hand (and solved the situation):
>     NAME                 STATE     READ WRITE CKSUM
>     data                 DEGRADED     0     0     0
>       raidz2-0           DEGRADED     0     0     0
>         label/disk20-01  ONLINE       0     0     0
>         label/disk20-02  ONLINE       0     0     0
>         label/disk20-03  ONLINE       0     0     0
>         label/disk20-04  ONLINE       0     0     0
>         label/disk20-05  OFFLINE      0     0     0
>         label/disk20-06  ONLINE       0     0     0
>         label/disk20-07  ONLINE       0     0     0
>         label/disk20-08  ONLINE       0     0     0
>         label/disk20-09  ONLINE       0     0     0
>         label/disk20-10  ONLINE       0     0     0
>         label/disk20-11  ONLINE       0     0     0
>         label/disk20-12  ONLINE       0     0     0
>       raidz2-1           DEGRADED     0     0     0
>         label/disk21-01  ONLINE       0     0     0
>         label/disk21-02  ONLINE       0     0     0
>         label/disk21-03  ONLINE       0     0     0
>         label/disk21-04  ONLINE       0     0     0
>         label/disk21-05  REMOVED      0     0     0
>         label/disk21-06  ONLINE       0     0     0
>         label/disk21-07  ONLINE       0     0     0
>         label/disk21-08  ONLINE       0     0     0
>         label/disk21-09  ONLINE       0     0     0
>         label/disk21-10  ONLINE       0     0     0
>         label/disk21-11  ONLINE       0     0     0
>         label/disk21-12  ONLINE       0     0     0
>     cache
>       ad4s2              ONLINE       0     0     0
>       ad6s2              ONLINE       0     0     0

With regards to arcmsr(4) and this behaviour: I received a mail from
Rich <rincebrain@gmail.com> (which doesn't appear to have hit the
freebsd-stable list yet, probably due to greylisting; I'm not sure if
you received a copy of his mail yet), which indicates the wedging when
the disks are on arcmsr(4) is a bug with arcmsr(4).

One more thing:

> I don't think it's cabling, it's the disks. I could repair all these
> errors by replacing the disks with a new one, and these are hot swap
> drives, so I wouldn't think about physical contact errors.

You're probably right, but I'll share a story that hopefully you won't
have the "pleasure" of experiencing.  :-)

<tale-of-woe>
Newer model servers at my workplace are SATA and backed by hot-swap
enclosures.  On occasion, a box will report disk I/O errors for LBAs
which are scattered all across the platters (meaning they aren't even
within 1,000,000 LBAs of one another).  This causes our NOC to open a
ticket stating the problem is a "bad disk".  The prognosis is completely
wrong (see below), and what ends up happening is our datacenter guys
replace the disk only to find that the situation starts happening again
within a few days.  Meaning: even more customer impact + wasted
efforts/time because everyone assumes an I/O error means a disk error.
So why was the prognosis wrong?

In these situations, I found SMART Attribute 199 (UDMA_CRC_Error_Count)
has incremented numerous times (hundreds) during the issue.  This
attribute represents the number of times ATA command blocks between the
disk and the controller didn't have matching CRCs, which is usually
caused by a hardware issue.  A few CRC errors during a system's lifetime
is more than acceptable, but numbers in the hundreds or thousands isn't.

The problem turned out to be a SATA backplane that had gone bad (there
are some basic ASICs on those things) or SATA cabling between the
backplane and the motherboard SATA ports was faulty.  More than a few
times, we found that the raw copper on the SATA cables had been exposed
in a couple spots, a result of the vendor running the cable through the
chassis and scraping it against sharp metal.  "Oops".
</tale-of-woe>

Disk troubleshooting is somewhat of an art, or at least thats what I'm
starting to feel after doing it for so long.  I like having concise
explanations for why a storage device reports an error; blindly assuming
it's the disk is a bad habit to get into.  There are many pieces of
hardware, and just as much firmware and software, between a disk and an
operating system.  Familiarising yourself with it all takes a lot of
time (months/years), and *every situation is different*.

Can I diagnose every situation that comes across my desk?  Hell no.
There are many which I simply can't determine the cause of, and in those
situations I recommend everything get replaced (sans RAM and CPU).  Have
I recommended that only to have the same situation happen again?  Yes,
and I'm lead to believe chassis temperature, humidity, or PSU/voltage
problems are responsible.

Don't let SSDs fool you either; already in the past year I've seen
literally 14 or 15 cases of people having their SSDs go bad, including 4
or 5 of our X25-Ms at work.  When they die, it's pretty catastrophic; I
haven't seen a "oh my SSD is acting wonky" situation yet, it's either
all or nothing.  YMMV.  I do love using X25-Ms for OS disks on FreeBSD
though.

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.               PGP 4BD6C0CB |