Date: Fri, 22 Mar 2019 10:12:02 +0100 From: "Aurelien \"beorn\" ROUGEMONT" <beorn@binaries.fr> To: freebsd-current@freebsd.org Subject: Re: lsi Message-ID: <27f18d66-d3f2-3e33-56d0-e9a1ddb37e1c@binaries.fr> In-Reply-To: <b78c6384-607f-6742-1be6-5c0dfa801320@binaries.fr> References: <b78c6384-607f-6742-1be6-5c0dfa801320@binaries.fr>
next in thread | previous in thread | raw e-mail | index | archive | help
On 3/22/19 10:06 AM, Aurelien "beorn" ROUGEMONT wrote: > Hi the list, > > I have been using FreeBSD at home and in production for years and today > i stumbled upon a question i could not answer. > > > Context > > ----------------------------------------- > > I'm building a backup server on a server with this HBA : > > 3:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] (rev 05) >    Subsystem: LSI Logic / Symbios Logic MegaRAID SAS 9271-8i >    Flags: bus master, fast devsel, latency 0, IRQ 34 >    I/O ports at e000 >    Memory at fb160000 (64-bit, non-prefetchable) >    Memory at fb100000 (64-bit, non-prefetchable) >    Expansion ROM at fb140000 [disabled] >    Capabilities: [50] Power Management version 3 >    Capabilities: [68] Express Endpoint, MSI 00 >    Capabilities: [d0] Vital Product Data >    Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+ >    Capabilities: [c0] MSI-X: Enable+ Count=16 Masked- >    Capabilities: [100] Advanced Error Reporting >    Capabilities: [1e0] Secondary PCI Express <?> >    Capabilities: [1c0] Power Budgeting <?> >    Capabilities: [190] Dynamic Power Allocation <?> >    Capabilities: [148] Alternative Routing-ID Interpretation (ARI) > > After pushing the server I/Os to its limits the server had a very nasty > crash. > > It happens very seldomly, in roughly 10 years among the petabytes of > storage servers i kept running it always was hardware or driver/firmware > related. > > |Shortening read at 4292967280 from 16 to 15 ZFS: i/o error - all > block copies unavailable ZFS: can't read object set for dataset 52 > ZFS: can't open root filesystem gptzfsboot: failed to mount default > pool zroot| > > After simply reinstalling (for nothing) the bootloaders, checking the > partition tables, i went digging a lot in the FreeBSD codebase. I found > that it was a ZFS problem. > > The nasty crash was indeed due to ZFS data corruption. Hence the > checksum errors while scrubing the zpool on a rescue network boot image : > >  pool: zroot                                                                                                                                                                                                      >  state: ONLINE                                                                    > status: One or more devices has experienced an unrecoverable error. An           >        attempt was made to correct the error. Applications are unaffected.      > action: Determine if the device needs to be replaced, and clear the errors        >        using 'zpool clear' or replace the device with 'zpool replace'.           >   see: http://illumos.org/msg/ZFS-8000-9P                                        >  scan: scrub in progress since Fri Mar 15 15:15:25 2019                          >        49.6G scanned out of 1.65T at 109M/s, 4h15m to go                         >        677M repaired, 2.94% done                                                 > config:                                                                           >        NAME             STATE    READ WRITE CKSUM                              >        zroot            ONLINE      0    0    0                              >          raidz2-0       ONLINE      0    0    0                              >            mfisyspd0p3  ONLINE      0    0 5.44K (repairing)                 >            mfisyspd1p3  ONLINE      0    0 4.76K (repairing)                 >            mfisyspd10p3 ONLINE      0    0 4.35K (repairing)                 >            mfisyspd11p3 ONLINE      0    0 5.17K (repairing)                 >            mfisyspd2p3  ONLINE      0    0 4.76K (repairing)                 >            mfisyspd3p3  ONLINE      0    0 4.24K (repairing)                 >            mfisyspd4p3  ONLINE      0    0 4.75K (repairing)                 >            mfisyspd5p3  ONLINE      0    0 5.20K (repairing)                 >            mfisyspd6p3  ONLINE      0    0 4.51K (repairing)                 >            mfisyspd7p3  ONLINE      0    0 4.65K (repairing)                 >            mfisyspd8p3  ONLINE      0    0 4.70K (repairing)                 >            mfisyspd9p3  ONLINE      0    0 3.81K (repairing)  > > At this point the server was still unable to reboot. I've had to force > data re-copy with a dumb : > > mv /boot{,.dist} > > cp -pr /boot{.dist} > > Which turned out to be fine. > > Going further i finally killed for good the zpool. It took me some time > and i stumbled upon the mfi(4) and the mrsas(4) man pages and code. > >     The mfi driver supports the following hardware: > >     o  LSI MegaRAID SAS 1078 > >     o  LSI MegaRAID SAS 8408E > >     o  LSI MegaRAID SAS 8480E > >     o  LSI MegaRAID SAS 9240 > >     o  LSI MegaRAID SAS 9260 > >     o  Dell PERC5 > >     o  Dell PERC6 > >     o  IBM ServeRAID M1015 SAS/SATA > >     o  IBM ServeRAID M1115 SAS/SATA > >     o  IBM ServeRAID M5015 SAS/SATA > >     o  IBM ServeRAID M5110 SAS/SATA > >     o  IBM ServeRAID-MR10i > >     o  Intel RAID Controller SRCSAS18E > >     o  Intel RAID Controller SROMBSAS18E > > >     The mrsas driver supports the following hardware: > >     [ Thunderbolt 6Gb/s MR controller ] > >     o  LSI MegaRAID SAS 9265 > >     o  LSI MegaRAID SAS 9266 > >     o  LSI MegaRAID SAS 9267 > >     o  LSI MegaRAID SAS 9270 > >     o  LSI MegaRAID SAS 9271 > >     o  LSI MegaRAID SAS 9272 > >     o  LSI MegaRAID SAS 9285 > >     o  LSI MegaRAID SAS 9286 > >     o  DELL PERC H810 > >     o  DELL PERC H710/P > There was a detection priority problem mfi wins for the wrong HBA. The fix was to add  hw.mfi.mrsas_enable=1 in /boot/loader.conf After this the server behaved correctly. Should it be fixed for everyone ? NB: sorry my last email was mistakenly sent unfinished
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?27f18d66-d3f2-3e33-56d0-e9a1ddb37e1c>