From owner-freebsd-fs@FreeBSD.ORG Mon Jan 10 17:23:48 2011 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E00DC106567A; Mon, 10 Jan 2011 17:23:48 +0000 (UTC) (envelope-from bra@fsn.hu) Received: from people.fsn.hu (people.fsn.hu [195.228.252.137]) by mx1.freebsd.org (Postfix) with ESMTP id CD4E38FC18; Mon, 10 Jan 2011 17:23:47 +0000 (UTC) Received: by people.fsn.hu (Postfix, from userid 1001) id 4624270614E; Mon, 10 Jan 2011 18:23:45 +0100 (CET) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000004, version=1.2.2 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MF-ACE0E1EA [pR: 15.8155] X-CRM114-CacheID: sfid-20110110_18234_6D48A357 X-CRM114-Status: Good ( pR: 15.8155 ) X-Spambayes-Classification: ham; 0.00 Message-ID: <4D2B40A0.20008@fsn.hu> Date: Mon, 10 Jan 2011 18:23:44 +0100 From: Attila Nagy User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.23) Gecko/20090817 Thunderbird/2.0.0.23 Mnenhy/0.7.6.0 MIME-Version: 1.0 To: Pawel Jakub Dawidek References: <4D0A09AF.3040005@FreeBSD.org> <4D297943.1040507@fsn.hu> <4D29A0C7.8050002@fsn.hu> <20110110090205.GB1744@garage.freebsd.pl> In-Reply-To: <20110110090205.GB1744@garage.freebsd.pl> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-fs@FreeBSD.org, freebsd-stable@FreeBSD.org, Martin Matuska Subject: Re: New ZFSv28 patchset for 8-STABLE X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 10 Jan 2011 17:23:49 -0000 On 01/10/2011 10:02 AM, Pawel Jakub Dawidek wrote: > On Sun, Jan 09, 2011 at 12:49:27PM +0100, Attila Nagy wrote: >> No, it's not related. One of the disks in the RAIDZ2 pool went bad: >> (da4:arcmsr0:0:4:0): READ(6). CDB: 8 0 2 10 10 0 >> (da4:arcmsr0:0:4:0): CAM status: SCSI Status Error >> (da4:arcmsr0:0:4:0): SCSI status: Check Condition >> (da4:arcmsr0:0:4:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read >> error) >> and it seems it froze the whole zpool. Removing the disk by hand solved >> the problem. >> I've seen this previously on other machines with ciss. >> I wonder why ZFS didn't throw it out of the pool. > Such hangs happen when I/O never returns. ZFS doesn't timeout I/O > requests on its own, this is driver's responsibility. It is still > strange that the driver didn't pass I/O error up to ZFS or it might as > well be ZFS bug, but I don't think so. > Indeed, it may to be a controller/driver bug. The newly released (last december) firmware says something about a similar problem. I've upgraded, we'll see whether it will help next time a drive goes awry. I've only seen these errors in dmesg, not in zpool status, there everything was clear (all zeroes). BTW, I've swapped those bad drives (da4, which reported the above errors, and da16, which didn't reported anything to the OS, it was just plain bad according to the controller firmware -and after its deletion, I could offline da4, so it seems it's the real cause, see my previous e-mail), and zpool replaced first da4, but after some seconds of thinking all IO on all disks deceased. After waiting some minutes, it was still the same, so I've rebooted. Then I noticed that a scrub is going on, so I stopped it. Then the zpool replace da4 went fine, it started to resilver the disk. But another zpool replace (for da16) causes the same error: some seconds of IO, then nothing and it stuck in that. Has anybody tried replacing two drives simultaneously with the zfs v28 patch? (this is a stripe of two raidz2s and da4 and da16 are in different raidz2)