From owner-freebsd-fs@FreeBSD.ORG Sun Jan 9 13:27:06 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7F7C5106566B for ; Sun, 9 Jan 2011 13:27:06 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta06.emeryville.ca.mail.comcast.net (qmta06.emeryville.ca.mail.comcast.net [76.96.30.56]) by mx1.freebsd.org (Postfix) with ESMTP id 5F2F08FC15 for ; Sun, 9 Jan 2011 13:27:06 +0000 (UTC) Received: from omta10.emeryville.ca.mail.comcast.net ([76.96.30.28]) by qmta06.emeryville.ca.mail.comcast.net with comcast id tRDA1f0040cQ2SLA6RT5bU; Sun, 09 Jan 2011 13:27:05 +0000 Received: from koitsu.dyndns.org ([98.248.34.134]) by omta10.emeryville.ca.mail.comcast.net with comcast id tRT41f00J2tehsa8WRT4qX; Sun, 09 Jan 2011 13:27:05 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 7525F9B427; Sun, 9 Jan 2011 05:27:04 -0800 (PST) Date: Sun, 9 Jan 2011 05:27:04 -0800 From: Jeremy Chadwick To: Attila Nagy Message-ID: <20110109132704.GA38278@icarus.home.lan> References: <4D0A09AF.3040005@FreeBSD.org> <4D297943.1040507@fsn.hu> <4D29A0C7.8050002@fsn.hu> <20110109121800.GA37231@icarus.home.lan> <4D29AD25.2030501@fsn.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4D29AD25.2030501@fsn.hu> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-fs@FreeBSD.org, freebsd-stable@FreeBSD.org, Martin Matuska Subject: Re: New ZFSv28 patchset for 8-STABLE X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 09 Jan 2011 13:27:06 -0000 On Sun, Jan 09, 2011 at 01:42:13PM +0100, Attila Nagy wrote: > On 01/09/2011 01:18 PM, Jeremy Chadwick wrote: > >On Sun, Jan 09, 2011 at 12:49:27PM +0100, Attila Nagy wrote: > >> On 01/09/2011 10:00 AM, Attila Nagy wrote: > >>>On 12/16/2010 01:44 PM, Martin Matuska wrote: > >>>>Hi everyone, > >>>> > >>>>following the announcement of Pawel Jakub Dawidek (pjd@FreeBSD.org) I am > >>>>providing a ZFSv28 testing patch for 8-STABLE. > >>>> > >>>>Link to the patch: > >>>> > >>>>http://people.freebsd.org/~mm/patches/zfs/v28/stable-8-zfsv28-20101215.patch.xz > >>>> > >>>> > >>>I've got an IO hang with dedup enabled (not sure it's related, > >>>I've started to rewrite all data on pool, which makes a heavy > >>>load): > >>> > >>>The processes are in various states: > >>>65747 1001 1 54 10 28620K 24360K tx->tx 0 6:58 0.00% cvsup > >>>80383 1001 1 54 10 40616K 30196K select 1 5:38 0.00% rsync > >>>1501 www 1 44 0 7304K 2504K zio->i 0 2:09 0.00% nginx > >>>1479 www 1 44 0 7304K 2416K zio->i 1 2:03 0.00% nginx > >>>1477 www 1 44 0 7304K 2664K zio->i 0 2:02 0.00% nginx > >>>1487 www 1 44 0 7304K 2376K zio->i 0 1:40 0.00% nginx > >>>1490 www 1 44 0 7304K 1852K zfs 0 1:30 0.00% nginx > >>>1486 www 1 44 0 7304K 2400K zfsvfs 1 1:05 0.00% nginx > >>> > >>>And everything which wants to touch the pool is/becomes dead. > >>> > >>>Procstat says about one process: > >>># procstat -k 1497 > >>> PID TID COMM TDNAME KSTACK > >>>1497 100257 nginx - mi_switch > >>>sleepq_wait __lockmgr_args vop_stdlock VOP_LOCK1_APV null_lock > >>>VOP_LOCK1_APV _vn_lock nullfs_root lookup namei vn_open_cred > >>>kern_openat syscallenter syscall Xfast_syscall > >>No, it's not related. One of the disks in the RAIDZ2 pool went bad: > >>(da4:arcmsr0:0:4:0): READ(6). CDB: 8 0 2 10 10 0 > >>(da4:arcmsr0:0:4:0): CAM status: SCSI Status Error > >>(da4:arcmsr0:0:4:0): SCSI status: Check Condition > >>(da4:arcmsr0:0:4:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered > >>read error) > >>and it seems it froze the whole zpool. Removing the disk by hand > >>solved the problem. > >>I've seen this previously on other machines with ciss. > >>I wonder why ZFS didn't throw it out of the pool. > >Hold on a minute. An unrecoverable read error does not necessarily mean > >the drive is bad, it could mean that the individual LBA that was > >attempted to be read resulted in ASC 0x11 (MEDIUM ERROR) (e.g. a bad > >block was encountered). I would check SMART stats on the disk (since > >these are probably SATA given use of arcmsr(4)) and provide those. > >*That* will tell you if the disk is bad. I'll help you decode the > >attributes values if you provide them. > You are right, and I gave incorrect information. There are a lot > more errors for that disk in the logs, and the zpool was frozen. > I tried to offline the given disk. That helped in the ciss case, > where the symptom is the same, or something similar, like there is > no IO for ages, then something small and nothing for long > seconds/minutes, and there are no errors logged. zpool status > reported no errors, and the dmesg was clear too. > There I could find the bad disk by watching gstat output and there I > saw when the very small amount of IO was done, there was one disk > with response times well above a second, while the others responded > quickly. > There the zpool offline helped. Here not, the command just got hang, > like everything else. > So what I did then: got into the areca-cli and searched for errors. > One disk was set to failed and it seemed to be the cause. I've > removed it (and did a camcontrol rescan, but I'm not sure it was > necessary or not), and suddenly the zpool offline finished and > everything went back to normal. > But there are two controllers in the system and now I see that the > above disk is on ctrl 1, while the one I have removed is on ctrl 2. > I was misleaded by their same position. So now I have an offlined > disk (which produces read errors, but I couldn't see them in the > zpool output) and another, which is shown as failed in the RAID > controller and got removed by hand (and solved the situation): > NAME STATE READ WRITE CKSUM > data DEGRADED 0 0 0 > raidz2-0 DEGRADED 0 0 0 > label/disk20-01 ONLINE 0 0 0 > label/disk20-02 ONLINE 0 0 0 > label/disk20-03 ONLINE 0 0 0 > label/disk20-04 ONLINE 0 0 0 > label/disk20-05 OFFLINE 0 0 0 > label/disk20-06 ONLINE 0 0 0 > label/disk20-07 ONLINE 0 0 0 > label/disk20-08 ONLINE 0 0 0 > label/disk20-09 ONLINE 0 0 0 > label/disk20-10 ONLINE 0 0 0 > label/disk20-11 ONLINE 0 0 0 > label/disk20-12 ONLINE 0 0 0 > raidz2-1 DEGRADED 0 0 0 > label/disk21-01 ONLINE 0 0 0 > label/disk21-02 ONLINE 0 0 0 > label/disk21-03 ONLINE 0 0 0 > label/disk21-04 ONLINE 0 0 0 > label/disk21-05 REMOVED 0 0 0 > label/disk21-06 ONLINE 0 0 0 > label/disk21-07 ONLINE 0 0 0 > label/disk21-08 ONLINE 0 0 0 > label/disk21-09 ONLINE 0 0 0 > label/disk21-10 ONLINE 0 0 0 > label/disk21-11 ONLINE 0 0 0 > label/disk21-12 ONLINE 0 0 0 > cache > ad4s2 ONLINE 0 0 0 > ad6s2 ONLINE 0 0 0 With regards to arcmsr(4) and this behaviour: I received a mail from Rich (which doesn't appear to have hit the freebsd-stable list yet, probably due to greylisting; I'm not sure if you received a copy of his mail yet), which indicates the wedging when the disks are on arcmsr(4) is a bug with arcmsr(4). One more thing: > I don't think it's cabling, it's the disks. I could repair all these > errors by replacing the disks with a new one, and these are hot swap > drives, so I wouldn't think about physical contact errors. You're probably right, but I'll share a story that hopefully you won't have the "pleasure" of experiencing. :-) Newer model servers at my workplace are SATA and backed by hot-swap enclosures. On occasion, a box will report disk I/O errors for LBAs which are scattered all across the platters (meaning they aren't even within 1,000,000 LBAs of one another). This causes our NOC to open a ticket stating the problem is a "bad disk". The prognosis is completely wrong (see below), and what ends up happening is our datacenter guys replace the disk only to find that the situation starts happening again within a few days. Meaning: even more customer impact + wasted efforts/time because everyone assumes an I/O error means a disk error. So why was the prognosis wrong? In these situations, I found SMART Attribute 199 (UDMA_CRC_Error_Count) has incremented numerous times (hundreds) during the issue. This attribute represents the number of times ATA command blocks between the disk and the controller didn't have matching CRCs, which is usually caused by a hardware issue. A few CRC errors during a system's lifetime is more than acceptable, but numbers in the hundreds or thousands isn't. The problem turned out to be a SATA backplane that had gone bad (there are some basic ASICs on those things) or SATA cabling between the backplane and the motherboard SATA ports was faulty. More than a few times, we found that the raw copper on the SATA cables had been exposed in a couple spots, a result of the vendor running the cable through the chassis and scraping it against sharp metal. "Oops". Disk troubleshooting is somewhat of an art, or at least thats what I'm starting to feel after doing it for so long. I like having concise explanations for why a storage device reports an error; blindly assuming it's the disk is a bad habit to get into. There are many pieces of hardware, and just as much firmware and software, between a disk and an operating system. Familiarising yourself with it all takes a lot of time (months/years), and *every situation is different*. Can I diagnose every situation that comes across my desk? Hell no. There are many which I simply can't determine the cause of, and in those situations I recommend everything get replaced (sans RAM and CPU). Have I recommended that only to have the same situation happen again? Yes, and I'm lead to believe chassis temperature, humidity, or PSU/voltage problems are responsible. Don't let SSDs fool you either; already in the past year I've seen literally 14 or 15 cases of people having their SSDs go bad, including 4 or 5 of our X25-Ms at work. When they die, it's pretty catastrophic; I haven't seen a "oh my SSD is acting wonky" situation yet, it's either all or nothing. YMMV. I do love using X25-Ms for OS disks on FreeBSD though. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP 4BD6C0CB |