Date: Fri, 10 Sep 2021 12:10:05 -0400 From: Michael Proto <mike@jellydonut.org> To: FreeBSD-STABLE Mailing List <freebsd-stable@freebsd.org>, freebsd-hardware@freebsd.org Subject: Re: Constant state-changes in a ZFS array Message-ID: <CAGAnWo1KXeAivyxk0UjuFdwR0Bd8-8prJ9QOH49GNvZ4jnYfSQ@mail.gmail.com> In-Reply-To: <CAGAnWo0J0gHL3sPAM05B8XWTXqYANoBsg6RLd1Ffu6=q4CLzCw@mail.gmail.com> References: <CAGAnWo0J0gHL3sPAM05B8XWTXqYANoBsg6RLd1Ffu6=q4CLzCw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Just realized I neglected version info. This is on FreeBSD 11.4-RELEASE-p3 On Fri, Sep 10, 2021 at 11:45 AM Michael Proto <mike@jellydonut.org> wrote: > > Hey all, > > I had a server go dark earlier this week and after several hardware > swaps I'm left scratching my head. The server is a HP DL380p Gen8 with > a D3600 shelf attached, using 2 EF0600FARNA HP 600G disks in a ZFS > mirror (da0 and da1) and another 22 8TB Ultrastar disks in a ZFS > RAID10 for data (da3 through da23, though da23 has been removed in > this situation). They're all attached to a LSI SAS2308 operating in > HBA mode. > > The large array threw a disk shortly before which we would normally > handle online like we've done dozens of times before. In this case > there's a bigger problem I'm struggling with. In addition to the disk > being thrown I'm now unable to bring the larger ZFS array online. > Commands issued to check array status or bring it online during boot > are stalling. The 2-disk zroot mirror is recognized on boot and is > loading, so I can get into the OS as normal with the larger tank array > failing to come online. > > Looking at syslog I'm seeing a regular stream of messages coming from > devd regarding media and state-change events from both ZFS, GEOM and > DEVFS. Sample below: > > Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS > subsystem=CDEV type=MEDIACHANGE cdev=da2' > Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM > subsystem=DEV type=MEDIACHANGE cdev=da2' > Sep 10 04:28:33 backup11 devd: Processing event '!system=ZFS > subsystem=ZFS type=resource.fs.zfs.statechange version=0 > class=resource.fs.zfs.statechange pool_guid=9328454021323814501 > vdev_guid=8915574321583737794' > Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS > subsystem=CDEV type=MEDIACHANGE cdev=da2p1' > Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM > subsystem=DEV type=MEDIACHANGE cdev=da2p1' > Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS > subsystem=CDEV type=MEDIACHANGE cdev=da6' > Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS > subsystem=CDEV type=MEDIACHANGE cdev=da2' > Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM > subsystem=DEV type=MEDIACHANGE cdev=da2' > Sep 10 04:28:33 backup11 devd: Processing event '!system=ZFS > subsystem=ZFS type=resource.fs.zfs.statechange version=0 > class=resource.fs.zfs.statechange pool_guid=9328454021323814501 > vdev_guid=8915574321583737794' > Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS > subsystem=CDEV type=MEDIACHANGE cdev=da2p1' > Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM > subsystem=DEV type=MEDIACHANGE cdev=da2p1' > Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS > subsystem=CDEV type=MEDIACHANGE cdev=da6' > Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM > subsystem=DEV type=MEDIACHANGE cdev=da6' > Sep 10 04:28:33 backup11 devd: Processing event '!system=ZFS > subsystem=ZFS type=resource.fs.zfs.statechange version=0 > class=resource.fs.zfs.statechange pool_guid=9328454021323814501 > vdev_guid=7024987654522270730' > Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS > subsystem=CDEV type=MEDIACHANGE cdev=da6p1' > Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM > subsystem=DEV type=MEDIACHANGE cdev=da6p1' > Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS > subsystem=CDEV type=MEDIACHANGE cdev=da9' > Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM > subsystem=DEV type=MEDIACHANGE cdev=da9' > Sep 10 04:28:33 backup11 devd: Processing event '!system=ZFS > subsystem=ZFS type=resource.fs.zfs.statechange version=0 > class=resource.fs.zfs.statechange pool_guid=9328454021323814501 > vdev_guid=4207599288564790488' > Sep 10 04:28:33 backup11 devd: Processing event '!system=DEVFS > subsystem=CDEV type=MEDIACHANGE cdev=da9p1' > Sep 10 04:28:33 backup11 devd: Processing event '!system=GEOM > subsystem=DEV type=MEDIACHANGE cdev=da9p1' > > > > The disk devices appearing in these messages are all disks in the > RAID10 array. They appear as a group, every 5 seconds. Furthermore the > state changes seem to be happening evenly across all affected devices > with the exception of da15 which is precisely half the volume. Here's > a count from /var/log/messages piped to sort and uniq (count, then > message): > > 32100 cdev=da10' > 32100 cdev=da10p1' > 32100 cdev=da11' > 32100 cdev=da11p1' > 32100 cdev=da12' > 32100 cdev=da12p1' > 32100 cdev=da13' > 32100 cdev=da13p1' > 32100 cdev=da14' > 32100 cdev=da14p1' > 16050 cdev=da15' > 16050 cdev=da15p1' > 32100 cdev=da16' > 32100 cdev=da16p1' > 32100 cdev=da17' > 32100 cdev=da17p1' > 32100 cdev=da18' > 32100 cdev=da18p1' > 32100 cdev=da19' > 32100 cdev=da19p1' > 32100 cdev=da2' > 32100 cdev=da20' > 32100 cdev=da20p1' > 32100 cdev=da21' > 32100 cdev=da21p1' > 32100 cdev=da22' > 32100 cdev=da22p1' > 32100 cdev=da2p1' > 32100 cdev=da3' > 32100 cdev=da3p1' > 32100 cdev=da4' > 32100 cdev=da4p1' > 32100 cdev=da5' > 32100 cdev=da5p1' > 32100 cdev=da6' > 32100 cdev=da6p1' > 32100 cdev=da7' > 32100 cdev=da7p1' > 32100 cdev=da8' > 32100 cdev=da8p1' > 32100 cdev=da9' > 32100 cdev=da9p1' > > > I can run diskinfo against all the listed disks no problem and I see > them via camcontrol. I can also issue a reset via camcontrol to both > the chassis and the D3600 with no issues. sesutil-map sees the > chassis, shelf, and all disk devices. > > So far I've swapped the LSI controller, the D3600 shelf (twice) and > the cabling, same behavior. Previously when a collection of disks go > problematic like this we've swapped the D3600 shelf or occasionally > just reseated the external cabling and everything came back normal. > Not this time. I'm scheduling a chassis swap for next week but figured > I'd throw this out here to see if anyone has seen this before. > > > > Thanks! > Mike Proto
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAGAnWo1KXeAivyxk0UjuFdwR0Bd8-8prJ9QOH49GNvZ4jnYfSQ>