From owner-freebsd-fs@freebsd.org Fri Oct 30 10:38:42 2015 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E0AECA20F52 for ; Fri, 30 Oct 2015 10:38:42 +0000 (UTC) (envelope-from amdmi3@amdmi3.ru) Received: from vps.amdmi3.ru (vps.amdmi3.ru [109.234.38.216]) by mx1.freebsd.org (Postfix) with ESMTP id AB9B2132A for ; Fri, 30 Oct 2015 10:38:42 +0000 (UTC) (envelope-from amdmi3@amdmi3.ru) Received: from hive.panopticon (unknown [78.153.152.119]) by vps.amdmi3.ru (Postfix) with ESMTPS id 2B4E9B0616 for ; Fri, 30 Oct 2015 13:38:35 +0300 (MSK) Received: from hades.panopticon (hades.panopticon [192.168.0.32]) by hive.panopticon (Postfix) with ESMTP id D6A3CB11 for ; Fri, 30 Oct 2015 13:34:29 +0300 (MSK) Received: by hades.panopticon (Postfix, from userid 1000) id DA8966E; Fri, 30 Oct 2015 13:36:14 +0300 (MSK) Date: Fri, 30 Oct 2015 13:36:14 +0300 From: Dmitry Marakasov To: freebsd-fs@freebsd.org Subject: ZFS resilver from disk with bad sectors constantly restarts Message-ID: <20151030103614.GL57666@hades.panopticon> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline User-Agent: Mutt/1.5.24 (2015-08-30) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 30 Oct 2015 10:38:43 -0000 Hi! I've just got a case where resilvering a new replacement disk in raidz2 never finished. The problem: one disk in raidz is failing by having a large number of unreadable sectors. It's replaced with a spare. Resilver though is constantly restarted with log full of read error from bad disk. It looks like this: --- pool: spool state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Wed Oct 28 05:26:28 2015 369G scanned out of 9,87T at 123M/s, 22h29m to go 41,4G resilvered, 3,65% done config: NAME STATE READ WRITE CKSUM spool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ada0 ONLINE 0 0 0 ada1 ONLINE 0 0 0 spare-2 ONLINE 0 0 733 ada11 ONLINE 0 0 0 ada2 ONLINE 0 0 0 (resilvering) raidz1-1 ONLINE 0 0 0 ada3 ONLINE 0 0 0 ada4 ONLINE 0 0 0 ada5 ONLINE 0 0 0 raidz1-2 ONLINE 0 0 0 ada6 ONLINE 0 0 0 ada7 ONLINE 0 0 0 ada10 ONLINE 0 0 0 spares 588540573008830286 INUSE was /dev/ada2 errors: No known data errors --- `resilver in progress since' date is constantly reset, so resilved progress cannot pass beyond 5% or so. My guess is that it happens on read errors on ada11. I think I've seen (resilvering) on ada11 line couple of times. In the end I've had to offline ada11 and after that resilver completed in under 16 hours. However the situation doesn't seem normal, as I'd prefer to not lose redundancy with offlining dying disk and still be able to use it for resilvering (imagine there were bad sectors on ada0/1 as well, but not intersecting with bad sectors on ada11), or at least some more verbose indication of why the resilver is constantly restarted. I should also note that's outdated FreeBSD 9.1, so maybe that problem was fixed already. -- Dmitry Marakasov . 55B5 0596 FF1E 8D84 5F56 9510 D35A 80DD F9D2 F77D amdmi3@amdmi3.ru ..: jabber: amdmi3@jabber.ru http://amdmi3.ru