Date: Wed, 29 Dec 2021 17:35:31 +0200 From: Andriy Gapon <avg@FreeBSD.org> To: current@FreeBSD.org, geom@freebsd.org Subject: gmirror: read failed when one disk (of two) failed Message-ID: <4dfbdcce-43d8-3bfc-6858-939ab18d637d@FreeBSD.org>
next in thread | raw e-mail | index | archive | help
I have a gmirror-ed swap partition: # swapinfo -h Device Size Used Avail Capacity /dev/mirror/swap 8.0G 2.6G 5.4G 33% # gmirror list Geom name: swap State: COMPLETE Components: 2 Balance: prefer Slice: 4096 Flags: NONE GenID: 4 SyncID: 20 ID: 1722474567 Type: AUTOMATIC Providers: 1. Name: mirror/swap Mediasize: 8589934080 (8.0G) Sectorsize: 512 Stripesize: 4096 Stripeoffset: 0 Mode: r1w1e0 Consumers: 1. Name: ada0p2 Mediasize: 8589934592 (8.0G) Sectorsize: 512 Stripesize: 4096 Stripeoffset: 0 Mode: r1w1e1 State: ACTIVE Priority: 0 Flags: (null) GenID: 4 SyncID: 20 ID: 1654937889 2. Name: ada1p2 Mediasize: 8589934592 (8.0G) Sectorsize: 512 Stripesize: 0 Stripeoffset: 655360 Mode: r1w1e1 State: ACTIVE Priority: 0 Flags: (null) GenID: 4 SyncID: 20 ID: 1289215755 Recently I had an accident where one of the disks got lost, with some "struggle". I see two things in the logs that concern me. One is mildly concerning and the other seems to be serious. The disk was known as ada1, it was attached to ahcich5. So, while the failing disk was struggling, I got messages like: ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080) swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1175296, size: 4096 swap_pager: indefinite wait buffer: bufobj: 0, blkno: 268223, size: 4096 swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1175296, size: 4096 swap_pager: indefinite wait buffer: bufobj: 0, blkno: 268223, size: 4096 ahcich5: Timeout on slot 18 port 0 ahcich5: is 00000000 cs 00040000 ss 00000000 rs 00040000 tfd 80 serr 00000000 cmd 0020f217 (aprobe0:ahcich5:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 (aprobe0:ahcich5:0:0:0): CAM status: Command timeout (aprobe0:ahcich5:0:0:0): Retrying command, 0 more tries remain swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1175296, size: 4096 swap_pager: indefinite wait buffer: bufobj: 0, blkno: 268223, size: 4096 ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080) swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1175296, size: 4096 swap_pager: indefinite wait buffer: bufobj: 0, blkno: 268223, size: 4096 swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1175296, size: 4096 swap_pager: indefinite wait buffer: bufobj: 0, blkno: 268223, size: 4096 ahcich5: Timeout on slot 19 port 0 ahcich5: is 00000000 cs 00080000 ss 00000000 rs 00080000 tfd 80 serr 00000000 cmd 0020f317 (aprobe0:ahcich5:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00 (aprobe0:ahcich5:0:0:0): CAM status: Command timeout (aprobe0:ahcich5:0:0:0): Error 5, Retries exhausted swap_pager: indefinite wait buffer: bufobj: 0, blkno: 1175296, size: 4096 swap_pager: indefinite wait buffer: bufobj: 0, blkno: 268223, size: 4096 Ideally, it would be nice if gmirror could instruct a chosen member disk to "fail fast", so that a different disk could be tried ASAP in case of any trouble. Only if all members cannot provide the requested data the mirror should switch to "try hard" mode. The more serious issue happened after the struggle was over: ada1 at ahcich5 bus 0 scbus6 target 0 lun 0 ada1: <SSDPR-CX400-256-G2 H190117D> s/n GY3049963 detached GEOM_MIRROR: Request failed (error=6). ada1p2[READ(offset=1098641408, length=4096)] pass4 at ahcich5 bus 0 scbus6 target 0 lun 0 pass4: <SSDPR-CX400-256-G2 H190117D> s/n GY3049963 detached GEOM_MIRROR: Device swap: provider ada1p2 disconnected. swap_pager: I/O error - pagein failed; blkno 1175296,size 4096, error 6 vm_fault: pager read error, pid 1310 (devd) pid 1310 (devd), jid 0, uid 0: exited on signal 10 (core dumped) It looks that the mirror did not do its job properly? It got ENXIO from one disk and it passed that up to a consumer (swap). I suspect that there wa a race between an I/O request getting ENXIO from a failed disk and an orphan event from the same disk. With only one member left the mirror may not realize that the ENXIO came from the detached disk, so it may think that there is no device left to re-try the request. -- Andriy Gapon
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4dfbdcce-43d8-3bfc-6858-939ab18d637d>