From owner-freebsd-fs@FreeBSD.ORG Sun Jan 9 12:42:17 2011 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 23C081065674; Sun, 9 Jan 2011 12:42:17 +0000 (UTC) (envelope-from bra@fsn.hu) Received: from people.fsn.hu (people.fsn.hu [195.228.252.137]) by mx1.freebsd.org (Postfix) with ESMTP id A86118FC0C; Sun, 9 Jan 2011 12:42:15 +0000 (UTC) Received: by people.fsn.hu (Postfix, from userid 1001) id 267D7704876; Sun, 9 Jan 2011 13:42:14 +0100 (CET) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.2 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MF-ACE0E1EA [pR: 31.0131] X-CRM114-CacheID: sfid-20110109_13421_CEB29D0A X-CRM114-Status: Good ( pR: 31.0131 ) X-Spambayes-Classification: ham; 0.00 Message-ID: <4D29AD25.2030501@fsn.hu> Date: Sun, 09 Jan 2011 13:42:13 +0100 From: Attila Nagy User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.23) Gecko/20090817 Thunderbird/2.0.0.23 Mnenhy/0.7.6.0 MIME-Version: 1.0 To: Jeremy Chadwick References: <4D0A09AF.3040005@FreeBSD.org> <4D297943.1040507@fsn.hu> <4D29A0C7.8050002@fsn.hu> <20110109121800.GA37231@icarus.home.lan> In-Reply-To: <20110109121800.GA37231@icarus.home.lan> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-fs@FreeBSD.org, freebsd-stable@FreeBSD.org, Martin Matuska Subject: Re: New ZFSv28 patchset for 8-STABLE X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 09 Jan 2011 12:42:17 -0000 On 01/09/2011 01:18 PM, Jeremy Chadwick wrote: > On Sun, Jan 09, 2011 at 12:49:27PM +0100, Attila Nagy wrote: >> On 01/09/2011 10:00 AM, Attila Nagy wrote: >>> On 12/16/2010 01:44 PM, Martin Matuska wrote: >>>> Hi everyone, >>>> >>>> following the announcement of Pawel Jakub Dawidek (pjd@FreeBSD.org) I am >>>> providing a ZFSv28 testing patch for 8-STABLE. >>>> >>>> Link to the patch: >>>> >>>> http://people.freebsd.org/~mm/patches/zfs/v28/stable-8-zfsv28-20101215.patch.xz >>>> >>>> >>> I've got an IO hang with dedup enabled (not sure it's related, >>> I've started to rewrite all data on pool, which makes a heavy >>> load): >>> >>> The processes are in various states: >>> 65747 1001 1 54 10 28620K 24360K tx->tx 0 6:58 0.00% cvsup >>> 80383 1001 1 54 10 40616K 30196K select 1 5:38 0.00% rsync >>> 1501 www 1 44 0 7304K 2504K zio->i 0 2:09 0.00% nginx >>> 1479 www 1 44 0 7304K 2416K zio->i 1 2:03 0.00% nginx >>> 1477 www 1 44 0 7304K 2664K zio->i 0 2:02 0.00% nginx >>> 1487 www 1 44 0 7304K 2376K zio->i 0 1:40 0.00% nginx >>> 1490 www 1 44 0 7304K 1852K zfs 0 1:30 0.00% nginx >>> 1486 www 1 44 0 7304K 2400K zfsvfs 1 1:05 0.00% nginx >>> >>> And everything which wants to touch the pool is/becomes dead. >>> >>> Procstat says about one process: >>> # procstat -k 1497 >>> PID TID COMM TDNAME KSTACK >>> 1497 100257 nginx - mi_switch >>> sleepq_wait __lockmgr_args vop_stdlock VOP_LOCK1_APV null_lock >>> VOP_LOCK1_APV _vn_lock nullfs_root lookup namei vn_open_cred >>> kern_openat syscallenter syscall Xfast_syscall >> No, it's not related. One of the disks in the RAIDZ2 pool went bad: >> (da4:arcmsr0:0:4:0): READ(6). CDB: 8 0 2 10 10 0 >> (da4:arcmsr0:0:4:0): CAM status: SCSI Status Error >> (da4:arcmsr0:0:4:0): SCSI status: Check Condition >> (da4:arcmsr0:0:4:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered >> read error) >> and it seems it froze the whole zpool. Removing the disk by hand >> solved the problem. >> I've seen this previously on other machines with ciss. >> I wonder why ZFS didn't throw it out of the pool. > Hold on a minute. An unrecoverable read error does not necessarily mean > the drive is bad, it could mean that the individual LBA that was > attempted to be read resulted in ASC 0x11 (MEDIUM ERROR) (e.g. a bad > block was encountered). I would check SMART stats on the disk (since > these are probably SATA given use of arcmsr(4)) and provide those. > *That* will tell you if the disk is bad. I'll help you decode the > attributes values if you provide them. You are right, and I gave incorrect information. There are a lot more errors for that disk in the logs, and the zpool was frozen. I tried to offline the given disk. That helped in the ciss case, where the symptom is the same, or something similar, like there is no IO for ages, then something small and nothing for long seconds/minutes, and there are no errors logged. zpool status reported no errors, and the dmesg was clear too. There I could find the bad disk by watching gstat output and there I saw when the very small amount of IO was done, there was one disk with response times well above a second, while the others responded quickly. There the zpool offline helped. Here not, the command just got hang, like everything else. So what I did then: got into the areca-cli and searched for errors. One disk was set to failed and it seemed to be the cause. I've removed it (and did a camcontrol rescan, but I'm not sure it was necessary or not), and suddenly the zpool offline finished and everything went back to normal. But there are two controllers in the system and now I see that the above disk is on ctrl 1, while the one I have removed is on ctrl 2. I was misleaded by their same position. So now I have an offlined disk (which produces read errors, but I couldn't see them in the zpool output) and another, which is shown as failed in the RAID controller and got removed by hand (and solved the situation): NAME STATE READ WRITE CKSUM data DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 label/disk20-01 ONLINE 0 0 0 label/disk20-02 ONLINE 0 0 0 label/disk20-03 ONLINE 0 0 0 label/disk20-04 ONLINE 0 0 0 label/disk20-05 OFFLINE 0 0 0 label/disk20-06 ONLINE 0 0 0 label/disk20-07 ONLINE 0 0 0 label/disk20-08 ONLINE 0 0 0 label/disk20-09 ONLINE 0 0 0 label/disk20-10 ONLINE 0 0 0 label/disk20-11 ONLINE 0 0 0 label/disk20-12 ONLINE 0 0 0 raidz2-1 DEGRADED 0 0 0 label/disk21-01 ONLINE 0 0 0 label/disk21-02 ONLINE 0 0 0 label/disk21-03 ONLINE 0 0 0 label/disk21-04 ONLINE 0 0 0 label/disk21-05 REMOVED 0 0 0 label/disk21-06 ONLINE 0 0 0 label/disk21-07 ONLINE 0 0 0 label/disk21-08 ONLINE 0 0 0 label/disk21-09 ONLINE 0 0 0 label/disk21-10 ONLINE 0 0 0 label/disk21-11 ONLINE 0 0 0 label/disk21-12 ONLINE 0 0 0 cache ad4s2 ONLINE 0 0 0 ad6s2 ONLINE 0 0 0 > My understanding is that a single LBA read failure should not warrant > ZFS marking the disk UNAVAIL in the pool. It should have incremented > the READ error counter and that's it. Did you receive a *single* error > for the disk and then things went catatonic? No, there are more entries in the logs, but only for disk 20-05, which I zpool offlined without success and nothing about disk 21-05, which I have removed by hand in the controller, which marked it as failed. As you can see, all counters are zero, I haven't cleared them. > If the entire system got wedged (a soft wedge, e.g. kernel is still > alive but nothing's happening in userland), that could be a different > problem -- either with ZFS or arcmsr(4). Does ZFS have some sort of > timeout value internal to itself where it will literally mark a disk > UNAVAIL in the case that repeated I/O transactions takes "too long"? > What is its error recovery methodology? No, everything was fine, because the system is on UFS. So only the zpool was dead, I could log into the machine. That's what I'm asking too, because I saw some similar errors (see above) on different machines, and from that I would think there is no forced timeout, this can make the system livelock like this. I use ZFS mostly on RAID controllers, but if I remember correctly I could see exactly the same problem on an X4540 too (which has a plain SAS controller), gstat helped there too, by showing which disk has a very long response time. > Speaking strictly about Solaris 10 and ZFS: I have seen many, many times > a system "soft wedge" after repeated I/O errors (read or write) are > spewed out on the console for a single SATA disk (via AHCI), but only > when the disk is used as a sole root filesystem disk (no mirror/raidz). > My impression is that ZFS isn't the problem in this scenario. In most > cases, post-mortem debugging on my part shows that disks encountered > some CRC errors (indicating cabling issues, etc.), sometimes as few as > 2, but "something else" went crazy -- or possibly ZFS couldn't mark the > disk UNAVAIL (if it has that logic) because it's a single disk > associated with root. Hardware in this scenario are Hitachi SATA disks > with an ICH ESB2 controller, software is Solaris 10 (Generic_142901-06) > with ZFS v15. > I don't think it's cabling, it's the disks. I could repair all these errors by replacing the disks with a new one, and these are hot swap drives, so I wouldn't think about physical contact errors.