Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 28 Feb 2025 13:14:40 +0300 (MSK)
From:      Dmitry Morozovsky <woozle@woozle.net>
To:        Ronald Klop <ronald-lists@klop.ws>
Cc:        freebsd-fs@FreeBSD.org, mm@FreeBSD.org
Subject:   Re: zfs: non-redundant zpool suspended till hard boot on any transient error
Message-ID:  <alpine.BSF.2.00.2502281257030.11003@woozle.rinet.ru>
In-Reply-To: <431556233.2998.1740664261847@localhost>
References:  <431556233.2998.1740664261847@localhost>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 27 Feb 2025, Ronald Klop wrote:

> See man zpoolprops and look for failmode.
> Default is ?wait?.
> 
> Does that help?

unfortunately no, actually, it doesn't seem to have any effect:

(bext pool is on external USB SATA disk, which I power-cycled$ the same effect 
is on USB cable remove/insert, e.g., any device reset situation):

- no `zpool clear' could be completed after pool suspension in standard 'wait' 
mode:

        NAME        STATE     READ WRITE CKSUM
        bext        UNAVAIL      0     0     0  insufficient replicas
          gpt/bext  REMOVED      0     0     0

errors: List of errors unavailable: pool I/O is currently suspended
# zpool online bext /dev/gpt/bext
cannot online /dev/gpt/bext: pool I/O is currently suspended
# zpool clear bext
cannot clear errors for bext: I/O error

- setting mode to continue:

# zpool set failmode=continue bext
# zpool get failmode bext
NAME  PROPERTY  VALUE     SOURCE
bext  failmode  continue  local
[power cycle the drive, kernlog excerpt:

Feb 28 13:05:12 <kern.crit> bat kernel: ugen1.2: <JMicron USB to ATA/ATAPI Bridge> at usbus1 (disconnected)
...
Feb 28 13:05:14 <kern.crit> bat kernel: (da0:umass-sim0:0:0:0): Periph destroyed
...
Feb 28 13:05:18 <kern.crit> bat kernel: ugen1.2: <JMicron USB to ATA/ATAPI Bridge> at usbus1
...
Feb 28 13:05:19 <kern.crit> bat kernel: Solaris: WARNING:
Feb 28 13:05:19 <kern.crit> bat kernel: Pool 'bext' has encountered an uncorrectable I/O failure and has been suspended.
...
Feb 28 13:05:21 <kern.crit> bat kernel: da0: <JMicron Generic 0508> Fixed Direct Access SPC-4 SCSI device
...
Feb 28 13:05:21 <kern.crit> bat kernel: da0: 3815447MB (7814037168 512 byte sectors)
 ]

# zpool status -v bext
  pool: bext
 state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-JQ
  scan: scrub repaired 0B in 01:31:09 with 0 errors on Wed Feb 12 20:21:49 2025
config:

        NAME        STATE     READ WRITE CKSUM
        bext        UNAVAIL      0     0     0  insufficient replicas
          gpt/bext  REMOVED      0     0     0

errors: List of errors unavailable: pool I/O is currently suspended
# zpool online bext dev/gpt/bext
# zpool status -v bext
  pool: bext
 state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-JQ
  scan: scrub repaired 0B in 01:31:09 with 0 errors on Wed Feb 12 20:21:49 2025
config:

        NAME        STATE     READ WRITE CKSUM
        bext        UNAVAIL      0     0     0  insufficient replicas
          gpt/bext  REMOVED      0     0     0

errors: List of errors unavailable: pool I/O is currently suspended
# zpool clear bext
cannot clear errors for bext: I/O error
# zpool export bext
load: 0.03  cmd: zpool 2507 [tx->tx_sync_done_cv] 1.95r 0.00u 0.00s 0% 8808k
load: 0.04  cmd: zpool 2507 [tx->tx_sync_done_cv] 86.99r 0.00u 0.00s 0% 8808k

(another hard lock till power cycle)

- panic does not seem to be viable option to me

> 
> Regards,
> Ronald.
> 
> Van: Dmitry Morozovsky <woozle@woozle.net>
> Datum: 27 februari 2025 09:33
> Aan: freebsd-fs@freebsd.org
> CC: mm@freebsd.org
> Onderwerp: zfs: non-redundant zpool suspended till hard boot on any transient
> error
> 
> > 
> > 
> > Colleagues,
> > 
> > regarding situations like
> > https://forums.freebsd.org/threads/external-drive-zfs-power-loss-insufficient-replicas-pool-suspended-cannot-online-dev-da0-pool-i-o-is-currently-suspended.94141/
> > (non redundant zpool on external drive)
> > 
> > as noted, even subsecond disconnect on otherwise idle pool leads to instant
> > and irreversible pool suspension.  hard reset (as kernel hangs on I/O
> > requests on suspended pool forever) is the only way to recover
> > 
> > also, found this patch, but can't evaluate it myself:
> > https://github.com/openzfs/zfs/pull/11082
> > 
> > any thoughts?  thanks in advance!
> > 
> > -- 
> > Sincerely,
> > D.Marck                                                          [MCK-RIPE]
> > [ FreeBSD committer:                                    marck@FreeBSD.org ]
> > ---------------------------------------------------------------------------
> > *** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- woozle@woozle.net ***
> > ---------------------------------------------------------------------------
> > 
> > 
> > 
> > 
> > 

-- 
Sincerely,
D.Marck                                                          [MCK-RIPE]
[ FreeBSD committer:                                    marck@FreeBSD.org ]
---------------------------------------------------------------------------
*** Dmitry Morozovsky --- D.Marck --- Wild Woozle --- woozle@woozle.net ***
---------------------------------------------------------------------------



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.BSF.2.00.2502281257030.11003>