Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 10 Sep 2021 12:10:05 -0400
From:      Michael Proto <mike@jellydonut.org>
To:        FreeBSD-STABLE Mailing List <freebsd-stable@freebsd.org>, freebsd-hardware@freebsd.org
Subject:   Re: Constant state-changes in a ZFS array
Message-ID:  <CAGAnWo1KXeAivyxk0UjuFdwR0Bd8-8prJ9QOH49GNvZ4jnYfSQ@mail.gmail.com>
In-Reply-To: <CAGAnWo0J0gHL3sPAM05B8XWTXqYANoBsg6RLd1Ffu6=q4CLzCw@mail.gmail.com>
References:  <CAGAnWo0J0gHL3sPAM05B8XWTXqYANoBsg6RLd1Ffu6=q4CLzCw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Just realized I neglected version info. This is on FreeBSD 11.4-RELEASE-p3

On Fri, Sep 10, 2021 at 11:45 AM Michael Proto <mike@jellydonut.org> wrote:
>
> Hey all,
>
> I had a server go dark earlier this week and after several hardware
> swaps I'm left scratching my head. The server is a HP DL380p Gen8 with
> a D3600 shelf attached, using 2 EF0600FARNA HP 600G disks in a ZFS
> mirror (da0 and da1) and another 22 8TB Ultrastar disks in a ZFS
> RAID10 for data (da3 through da23, though da23 has been removed in
> this situation). They're all attached to a LSI SAS2308 operating in
> HBA mode.
>
> The large array threw a disk shortly before which we would normally
> handle online like we've done dozens of times before. In this case
> there's a bigger problem I'm struggling with. In addition to the disk
> being thrown I'm now unable to bring the larger ZFS array online.
> Commands issued to check array status or bring it online during boot
> are stalling. The 2-disk zroot mirror is recognized on boot and is
> loading, so I can get into the OS as normal with the larger tank array
> failing to come online.
>
> Looking at syslog I'm seeing a regular stream of messages coming from
> devd regarding media and state-change events from both ZFS, GEOM and
> DEVFS. Sample below:
>
> Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS
> subsystem=CDEV type=MEDIACHANGE cdev=da2'
> Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM
> subsystem=DEV type=MEDIACHANGE cdev=da2'
> Sep 10 04:28:33 backup11 devd: Processing event '!system=ZFS
> subsystem=ZFS type=resource.fs.zfs.statechange  version=0
> class=resource.fs.zfs.statechange pool_guid=9328454021323814501
> vdev_guid=8915574321583737794'
> Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS
> subsystem=CDEV type=MEDIACHANGE cdev=da2p1'
> Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM
> subsystem=DEV type=MEDIACHANGE cdev=da2p1'
> Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS
> subsystem=CDEV type=MEDIACHANGE cdev=da6'
> Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS
> subsystem=CDEV type=MEDIACHANGE cdev=da2'
> Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM
> subsystem=DEV type=MEDIACHANGE cdev=da2'
> Sep 10 04:28:33 backup11 devd: Processing event '!system=ZFS
> subsystem=ZFS type=resource.fs.zfs.statechange  version=0
> class=resource.fs.zfs.statechange pool_guid=9328454021323814501
> vdev_guid=8915574321583737794'
> Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS
> subsystem=CDEV type=MEDIACHANGE cdev=da2p1'
> Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM
> subsystem=DEV type=MEDIACHANGE cdev=da2p1'
> Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS
> subsystem=CDEV type=MEDIACHANGE cdev=da6'
> Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM
> subsystem=DEV type=MEDIACHANGE cdev=da6'
> Sep 10 04:28:33 backup11 devd: Processing event '!system=ZFS
> subsystem=ZFS type=resource.fs.zfs.statechange  version=0
> class=resource.fs.zfs.statechange pool_guid=9328454021323814501
> vdev_guid=7024987654522270730'
> Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS
> subsystem=CDEV type=MEDIACHANGE cdev=da6p1'
> Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM
> subsystem=DEV type=MEDIACHANGE cdev=da6p1'
> Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS
> subsystem=CDEV type=MEDIACHANGE cdev=da9'
> Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM
> subsystem=DEV type=MEDIACHANGE cdev=da9'
> Sep 10 04:28:33 backup11 devd: Processing event '!system=ZFS
> subsystem=ZFS type=resource.fs.zfs.statechange  version=0
> class=resource.fs.zfs.statechange pool_guid=9328454021323814501
> vdev_guid=4207599288564790488'
> Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS
> subsystem=CDEV type=MEDIACHANGE cdev=da9p1'
> Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM
> subsystem=DEV type=MEDIACHANGE cdev=da9p1'
>
>
>
> The disk devices appearing in these messages are all disks in the
> RAID10 array. They appear as a group, every 5 seconds. Furthermore the
> state changes seem to be happening evenly across all affected devices
> with the exception of da15 which is precisely half the volume. Here's
> a count from /var/log/messages piped to sort and uniq (count, then
> message):
>
> 32100 cdev=da10'
> 32100 cdev=da10p1'
> 32100 cdev=da11'
> 32100 cdev=da11p1'
> 32100 cdev=da12'
> 32100 cdev=da12p1'
> 32100 cdev=da13'
> 32100 cdev=da13p1'
> 32100 cdev=da14'
> 32100 cdev=da14p1'
> 16050 cdev=da15'
> 16050 cdev=da15p1'
> 32100 cdev=da16'
> 32100 cdev=da16p1'
> 32100 cdev=da17'
> 32100 cdev=da17p1'
> 32100 cdev=da18'
> 32100 cdev=da18p1'
> 32100 cdev=da19'
> 32100 cdev=da19p1'
> 32100 cdev=da2'
> 32100 cdev=da20'
> 32100 cdev=da20p1'
> 32100 cdev=da21'
> 32100 cdev=da21p1'
> 32100 cdev=da22'
> 32100 cdev=da22p1'
> 32100 cdev=da2p1'
> 32100 cdev=da3'
> 32100 cdev=da3p1'
> 32100 cdev=da4'
> 32100 cdev=da4p1'
> 32100 cdev=da5'
> 32100 cdev=da5p1'
> 32100 cdev=da6'
> 32100 cdev=da6p1'
> 32100 cdev=da7'
> 32100 cdev=da7p1'
> 32100 cdev=da8'
> 32100 cdev=da8p1'
> 32100 cdev=da9'
> 32100 cdev=da9p1'
>
>
> I can run diskinfo against all the listed disks no problem and I see
> them via camcontrol. I can also issue a reset via camcontrol to both
> the chassis and the D3600 with no issues. sesutil-map sees the
> chassis, shelf, and all disk devices.
>
> So far I've swapped the LSI controller, the D3600 shelf (twice) and
> the cabling, same behavior. Previously when a collection of disks go
> problematic like this we've swapped the D3600 shelf or occasionally
> just reseated the external cabling and everything came back normal.
> Not this time. I'm scheduling a chassis swap for next week but figured
> I'd throw this out here to see if anyone has seen this before.
>
>
>
> Thanks!
> Mike Proto



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAGAnWo1KXeAivyxk0UjuFdwR0Bd8-8prJ9QOH49GNvZ4jnYfSQ>