Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 19 Nov 2016 21:15:54 +0100
From:      Marek Salwerowicz <marek.salwerowicz@misal.pl>
To:        "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
Subject:   zpool raidz2 stopped working after failure of one drive
Message-ID:  <aa638ae8-4664-c45f-25af-f9e9337498de@misal.pl>

next in thread | raw e-mail | index | archive | help
Hi all,

I run a following server:

- Supermicro 6047R-E1R36L
- 96 GB RAM
- 1x INTEL CPU E5-2640 v2 @ 2.00GHz
- FreeBSD 10.3-RELEASE-p11

Drive for OS:
- HW RAID1: 2x KINGSTON SV300S37A120G

zpool:
- 18x WD RED 4TB @ raidz2
- log: mirrored Intel 730 SSD
- cache: single Intel 730 SSD


Today after one drive's failure, the whole vdev was removed from the 
zpool (basically the zpool was down, zpool / zfs commands were not 
responding):

Nov 19 12:19:51 storage2 kernel: (da14:mps0:0:22:0): READ(10). CDB: 28 
00 29 e7 b5 79 00 00 10 00
Nov 19 12:19:51 storage2 kernel: (da14:mps0:0:22:0): CAM status: SCSI 
Status Error
Nov 19 12:19:51 storage2 kernel: (da14:mps0:0:22:0): SCSI status: Check 
Condition
Nov 19 12:19:51 storage2 kernel: (da14:mps0:0:22:0): SCSI sense: MEDIUM 
ERROR asc:11,0 (Unrecovered read error)
Nov 19 12:19:51 storage2 kernel: (da14:mps0:0:22:0): Info: 0x29e7b579
Nov 19 12:19:51 storage2 kernel: (da14:
Nov 19 12:19:52 storage2 kernel: mps0:0:22:0): Error 5, Unretryable error
Nov 19 12:20:03 storage2 kernel: mps0: mpssas_prepare_remove: Sending 
reset for target ID 22
Nov 19 12:20:03 storage2 kernel: da14 at mps0 bus 0 scbus0 target 22 lun 0
Nov 19 12:20:04 storage2 kernel: da14: <ATA WDC WD4000FYYZ-0 1K02> 
s/n      WD-WCC131430652 detached
Nov 19 12:20:04 storage2 kernel: (da14:mps0:0:22:0): SYNCHRONIZE 
CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 547 
terminated ioc 804b scsi 0 st
Nov 19 12:20:13 storage2 kernel: ate c xfer 0
Nov 19 12:20:13 storage2 kernel: (da14:mps0:0:22:0): READ(6). CDB: 08 00 
02 10 10 00 length 8192 SMID 292 terminated ioc 804b scsi 0 state c xfer 0
Nov 19 12:20:13 storage2 kernel: (da14:mps0:0:22:0): SYNCHRONIZE 
CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Nov 19 12:20:13 storage2 kernel: (da14:mps0:0:22:0): READ(16). CDB: 88 
00 00 00 00 01 d1 c0 bc 10 00 00 00 10 00 00 length 8192 SMID 248 
terminated ioc 804b s(da14:mps0:0:22:0): CAM status: Unconditionally 
Re-queue Request
Nov 19 12:20:13 storage2 kernel: csi 0 state c xfer 0
Nov 19 12:20:13 storage2 kernel: (da14: (da14:mps0:0:22:0): READ(16). 
CDB: 88 00 00 00 00 01 d1 c0 ba 10 00 00 00 10 00 00 length 8192 SMID 
905 terminated ioc 804b smps0:0:csi 0 state c xfer 0
Nov 19 12:20:13 storage2 kernel: 22:mps0: 0): IOCStatus = 0x4b while 
resetting device 0x18
Nov 19 12:20:13 storage2 kernel: Error 5, Periph was invalidated
Nov 19 12:20:13 storage2 kernel: mps0: (da14:mps0:0:22:0): READ(6). CDB: 
08 00 02 10 10 00
Nov 19 12:20:13 storage2 kernel: Unfreezing devq for target ID 22
Nov 19 12:20:13 storage2 kernel: (da14:mps0:0:22:0): CAM status: 
Unconditionally Re-queue Request
Nov 19 12:20:13 storage2 kernel: (da14:mps0:0:22:0):
Nov 19 12:20:17 storage2 kernel: Error 5, Periph was invalidated
Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): READ(16). CDB: 88 
00 00 00 00 01 d1 c0 bc 10 00 00 00 10 00 00
Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): CAM status: 
Unconditionally Re-queue Request
Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): Error 5, Periph was 
invalidated
Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): READ(16). CDB: 88 
00 00 00 00 01 d1 c0 ba 10 00 00 00 10 00 00
Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): CAM status: 
Unconditionally Re-queue Request
Nov 19 12:20:17 storage2 kernel: (da14:mps0:0:22:0): Error 5, Periph was 
invalidated
Nov 19 12:20:17 storage2 kernel: (da14:
Nov 19 12:20:17 storage2 devd: Executing 'logger -p kern.notice -t ZFS 
'vdev is removed, pool_guid=15598571108475493154 
vdev_guid=2747493726448938619''
Nov 19 12:20:17 storage2 ZFS: vdev is removed, 
pool_guid=15598571108475493154 vdev_guid=2747493726448938619
Nov 19 12:20:17 storage2 kernel: mps0:0:22:
Nov 19 12:20:17 storage2 kernel: 0): Periph destroyed


There was no other option than hard-rebooting server.
SMART value "Raw_Read_Error_Rate "  for the failed drive has increased 0 
-> 1. I am about to replace it - it still has warranty.

I have now disabled the failing drive in zpool and it works fine (of 
course, in DEGRADED state until I replace the drive)

However, I am concerned by the fact that one drive's failure has blocked 
completely zpool.
Is it normal normal behaviour for zpools ?

Also, is there already auto hot-spare in ZFS? If I had a hot spare drive 
in my zpool, would it be automatically replaced?


Cheers,

Marek



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?aa638ae8-4664-c45f-25af-f9e9337498de>