Date: Sat, 19 Nov 2016 21:15:54 +0100 From: Marek Salwerowicz <marek.salwerowicz@misal.pl> To: "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org> Subject: zpool raidz2 stopped working after failure of one drive Message-ID: <aa638ae8-4664-c45f-25af-f9e9337498de@misal.pl>
next in thread | raw e-mail | index | archive | help
Hi all, I run a following server: - Supermicro 6047R-E1R36L - 96 GB RAM - 1x INTEL CPU E5-2640 v2 @ 2.00GHz - FreeBSD 10.3-RELEASE-p11 Drive for OS: - HW RAID1: 2x KINGSTON SV300S37A120G zpool: - 18x WD RED 4TB @ raidz2 - log: mirrored Intel 730 SSD - cache: single Intel 730 SSD Today after one drive's failure, the whole vdev was removed from the zpool (basically the zpool was down, zpool / zfs commands were not responding): Nov 19 12:19:51 storage2 kernel: (da14:mps0:0:22:0): READ(10). CDB: 28 00 29 e7 b5 79 00 00 10 00 Nov 19 12:19:51 storage2 kernel: (da14:mps0:0:22:0): CAM status: SCSI Status Error Nov 19 12:19:51 storage2 kernel: (da14:mps0:0:22:0): SCSI status: Check Condition Nov 19 12:19:51 storage2 kernel: (da14:mps0:0:22:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error) Nov 19 12:19:51 storage2 kernel: (da14:mps0:0:22:0): Info: 0x29e7b579 Nov 19 12:19:51 storage2 kernel: (da14: Nov 19 12:19:52 storage2 kernel: mps0:0:22:0): Error 5, Unretryable error Nov 19 12:20:03 storage2 kernel: mps0: mpssas_prepare_remove: Sending reset for target ID 22 Nov 19 12:20:03 storage2 kernel: da14 at mps0 bus 0 scbus0 target 22 lun 0 Nov 19 12:20:04 storage2 kernel: da14: <ATA WDC WD4000FYYZ-0 1K02> s/n WD-WCC131430652 detached Nov 19 12:20:04 storage2 kernel: (da14:mps0:0:22:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 547 terminated ioc 804b scsi 0 st Nov 19 12:20:13 storage2 kernel: ate c xfer 0 Nov 19 12:20:13 storage2 kernel: (da14:mps0:0:22:0): READ(6). CDB: 08 00 02 10 10 00 length 8192 SMID 292 terminated ioc 804b scsi 0 state c xfer 0 Nov 19 12:20:13 storage2 kernel: (da14:mps0:0:22:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Nov 19 12:20:13 storage2 kernel: (da14:mps0:0:22:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 bc 10 00 00 00 10 00 00 length 8192 SMID 248 terminated ioc 804b s(da14:mps0:0:22:0): CAM status: Unconditionally Re-queue Request Nov 19 12:20:13 storage2 kernel: csi 0 state c xfer 0 Nov 19 12:20:13 storage2 kernel: (da14: (da14:mps0:0:22:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 ba 10 00 00 00 10 00 00 length 8192 SMID 905 terminated ioc 804b smps0:0:csi 0 state c xfer 0 Nov 19 12:20:13 storage2 kernel: 22:mps0: 0): IOCStatus = 0x4b while resetting device 0x18 Nov 19 12:20:13 storage2 kernel: Error 5, Periph was invalidated Nov 19 12:20:13 storage2 kernel: mps0: (da14:mps0:0:22:0): READ(6). CDB: 08 00 02 10 10 00 Nov 19 12:20:13 storage2 kernel: Unfreezing devq for target ID 22 Nov 19 12:20:13 storage2 kernel: (da14:mps0:0:22:0): CAM status: Unconditionally Re-queue Request Nov 19 12:20:13 storage2 kernel: (da14:mps0:0:22:0): Nov 19 12:20:17 storage2 kernel: Error 5, Periph was invalidated Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 bc 10 00 00 00 10 00 00 Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): CAM status: Unconditionally Re-queue Request Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): Error 5, Periph was invalidated Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 ba 10 00 00 00 10 00 00 Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): CAM status: Unconditionally Re-queue Request Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): Error 5, Periph was invalidated Nov 19 12:20:17 storage2 kernel: (da14: Nov 19 12:20:17 storage2 devd: Executing 'logger -p kern.notice -t ZFS 'vdev is removed, pool_guid=15598571108475493154 vdev_guid=2747493726448938619'' Nov 19 12:20:17 storage2 ZFS: vdev is removed, pool_guid=15598571108475493154 vdev_guid=2747493726448938619 Nov 19 12:20:17 storage2 kernel: mps0:0:22: Nov 19 12:20:17 storage2 kernel: 0): Periph destroyed There was no other option than hard-rebooting server. SMART value "Raw_Read_Error_Rate " for the failed drive has increased 0 -> 1. I am about to replace it - it still has warranty. I have now disabled the failing drive in zpool and it works fine (of course, in DEGRADED state until I replace the drive) However, I am concerned by the fact that one drive's failure has blocked completely zpool. Is it normal normal behaviour for zpools ? Also, is there already auto hot-spare in ZFS? If I had a hot spare drive in my zpool, would it be automatically replaced? Cheers, Marek
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?aa638ae8-4664-c45f-25af-f9e9337498de>