From owner-freebsd-fs@freebsd.org Sat Nov 19 20:22:31 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id EE74BC4BF87 for ; Sat, 19 Nov 2016 20:22:31 +0000 (UTC) (envelope-from marek.salwerowicz@misal.pl) Received: from mail3.misal.pl (mail3.misal.pl [83.19.131.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 7DE7F1383 for ; Sat, 19 Nov 2016 20:22:30 +0000 (UTC) (envelope-from marek.salwerowicz@misal.pl) Received: from localhost (mail3.misal.pl [127.0.0.1]) by mail3.misal.pl (Postfix) with ESMTP id 5FE3E317F for ; Sat, 19 Nov 2016 21:16:14 +0100 (CET) X-Virus-Scanned: amavisd X-Spam-Flag: NO X-Spam-Score: -3 X-Spam-Level: X-Spam-Status: No, score=-3 tagged_above=-9999 required=9 tests=[ALL_TRUSTED=-1, BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1] autolearn=ham autolearn_force=no Authentication-Results: mail3.misal.pl (amavisd-new); dkim=pass (1024-bit key) header.d=misal.pl Received: from mail3.misal.pl ([127.0.0.1]) by localhost (mail3.misal.pl [127.0.0.1]) (amavisd-new, port 10024) with LMTP id c7yzudMx2tZ6 for ; Sat, 19 Nov 2016 21:15:59 +0100 (CET) DKIM-Filter: OpenDKIM Filter v2.10.3 mail3.misal.pl D20323179 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=misal.pl; s=misal.pl; t=1479586559; bh=J2E0we6stjkb5VqRJYQMTzZdUCfEIYXX0QU/qWUqjbY=; h=From:Subject:To:Date:From; b=VDJeIxOCOYnKaWREKjPd1ss8qozdd9SNtqwPpd1QhoIWmBa9JmAQvCfVH1mTPdFxv Bd3h8PUMX6kotO14QUD61qJ9KNwCd2Vo4BiAXTd3I6eSzUsv3tHK5QVH2OtzrCGWOd YrNsd2ZSDMzXk8QXEiH902m4TTmBQZWLc97lPSQg= From: Marek Salwerowicz Subject: zpool raidz2 stopped working after failure of one drive To: "freebsd-fs@freebsd.org" Message-ID: Date: Sat, 19 Nov 2016 21:15:54 +0100 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 19 Nov 2016 20:22:32 -0000 Hi all, I run a following server: - Supermicro 6047R-E1R36L - 96 GB RAM - 1x INTEL CPU E5-2640 v2 @ 2.00GHz - FreeBSD 10.3-RELEASE-p11 Drive for OS: - HW RAID1: 2x KINGSTON SV300S37A120G zpool: - 18x WD RED 4TB @ raidz2 - log: mirrored Intel 730 SSD - cache: single Intel 730 SSD Today after one drive's failure, the whole vdev was removed from the zpool (basically the zpool was down, zpool / zfs commands were not responding): Nov 19 12:19:51 storage2 kernel: (da14:mps0:0:22:0): READ(10). CDB: 28 00 29 e7 b5 79 00 00 10 00 Nov 19 12:19:51 storage2 kernel: (da14:mps0:0:22:0): CAM status: SCSI Status Error Nov 19 12:19:51 storage2 kernel: (da14:mps0:0:22:0): SCSI status: Check Condition Nov 19 12:19:51 storage2 kernel: (da14:mps0:0:22:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error) Nov 19 12:19:51 storage2 kernel: (da14:mps0:0:22:0): Info: 0x29e7b579 Nov 19 12:19:51 storage2 kernel: (da14: Nov 19 12:19:52 storage2 kernel: mps0:0:22:0): Error 5, Unretryable error Nov 19 12:20:03 storage2 kernel: mps0: mpssas_prepare_remove: Sending reset for target ID 22 Nov 19 12:20:03 storage2 kernel: da14 at mps0 bus 0 scbus0 target 22 lun 0 Nov 19 12:20:04 storage2 kernel: da14: s/n WD-WCC131430652 detached Nov 19 12:20:04 storage2 kernel: (da14:mps0:0:22:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 547 terminated ioc 804b scsi 0 st Nov 19 12:20:13 storage2 kernel: ate c xfer 0 Nov 19 12:20:13 storage2 kernel: (da14:mps0:0:22:0): READ(6). CDB: 08 00 02 10 10 00 length 8192 SMID 292 terminated ioc 804b scsi 0 state c xfer 0 Nov 19 12:20:13 storage2 kernel: (da14:mps0:0:22:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Nov 19 12:20:13 storage2 kernel: (da14:mps0:0:22:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 bc 10 00 00 00 10 00 00 length 8192 SMID 248 terminated ioc 804b s(da14:mps0:0:22:0): CAM status: Unconditionally Re-queue Request Nov 19 12:20:13 storage2 kernel: csi 0 state c xfer 0 Nov 19 12:20:13 storage2 kernel: (da14: (da14:mps0:0:22:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 ba 10 00 00 00 10 00 00 length 8192 SMID 905 terminated ioc 804b smps0:0:csi 0 state c xfer 0 Nov 19 12:20:13 storage2 kernel: 22:mps0: 0): IOCStatus = 0x4b while resetting device 0x18 Nov 19 12:20:13 storage2 kernel: Error 5, Periph was invalidated Nov 19 12:20:13 storage2 kernel: mps0: (da14:mps0:0:22:0): READ(6). CDB: 08 00 02 10 10 00 Nov 19 12:20:13 storage2 kernel: Unfreezing devq for target ID 22 Nov 19 12:20:13 storage2 kernel: (da14:mps0:0:22:0): CAM status: Unconditionally Re-queue Request Nov 19 12:20:13 storage2 kernel: (da14:mps0:0:22:0): Nov 19 12:20:17 storage2 kernel: Error 5, Periph was invalidated Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 bc 10 00 00 00 10 00 00 Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): CAM status: Unconditionally Re-queue Request Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): Error 5, Periph was invalidated Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 ba 10 00 00 00 10 00 00 Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): CAM status: Unconditionally Re-queue Request Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): Error 5, Periph was invalidated Nov 19 12:20:17 storage2 kernel: (da14: Nov 19 12:20:17 storage2 devd: Executing 'logger -p kern.notice -t ZFS 'vdev is removed, pool_guid=15598571108475493154 vdev_guid=2747493726448938619'' Nov 19 12:20:17 storage2 ZFS: vdev is removed, pool_guid=15598571108475493154 vdev_guid=2747493726448938619 Nov 19 12:20:17 storage2 kernel: mps0:0:22: Nov 19 12:20:17 storage2 kernel: 0): Periph destroyed There was no other option than hard-rebooting server. SMART value "Raw_Read_Error_Rate " for the failed drive has increased 0 -> 1. I am about to replace it - it still has warranty. I have now disabled the failing drive in zpool and it works fine (of course, in DEGRADED state until I replace the drive) However, I am concerned by the fact that one drive's failure has blocked completely zpool. Is it normal normal behaviour for zpools ? Also, is there already auto hot-spare in ZFS? If I had a hot spare drive in my zpool, would it be automatically replaced? Cheers, Marek