Date: Fri, 22 Mar 2019 10:06:11 +0100 From: "Aurelien \"beorn\" ROUGEMONT" <beorn@binaries.fr> To: freebsd-current@freebsd.org Subject: lsi Message-ID: <b78c6384-607f-6742-1be6-5c0dfa801320@binaries.fr>
next in thread | raw e-mail | index | archive | help
Hi the list, I have been using FreeBSD at home and in production for years and today i stumbled upon a question i could not answer. Context ----------------------------------------- I'm building a backup server on a server with this HBA : 3:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] (rev 05)    Subsystem: LSI Logic / Symbios Logic MegaRAID SAS 9271-8i    Flags: bus master, fast devsel, latency 0, IRQ 34    I/O ports at e000    Memory at fb160000 (64-bit, non-prefetchable)    Memory at fb100000 (64-bit, non-prefetchable)    Expansion ROM at fb140000 [disabled]    Capabilities: [50] Power Management version 3    Capabilities: [68] Express Endpoint, MSI 00    Capabilities: [d0] Vital Product Data    Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+    Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-    Capabilities: [100] Advanced Error Reporting    Capabilities: [1e0] Secondary PCI Express <?>    Capabilities: [1c0] Power Budgeting <?>    Capabilities: [190] Dynamic Power Allocation <?>    Capabilities: [148] Alternative Routing-ID Interpretation (ARI) After pushing the server I/Os to its limits the server had a very nasty crash. It happens very seldomly, in roughly 10 years among the petabytes of storage servers i kept running it always was hardware or driver/firmware related. |Shortening read at 4292967280 from 16 to 15 ZFS: i/o error - all block copies unavailable ZFS: can't read object set for dataset 52 ZFS: can't open root filesystem gptzfsboot: failed to mount default pool zroot| After simply reinstalling (for nothing) the bootloaders, checking the partition tables, i went digging a lot in the FreeBSD codebase. I found that it was a ZFS problem. The nasty crash was indeed due to ZFS data corruption. Hence the checksum errors while scrubing the zpool on a rescue network boot image :  pool: zroot                                                                                                                                                                                                       state: ONLINE                                                                    status: One or more devices has experienced an unrecoverable error. An                  attempt was made to correct the error. Applications are unaffected.      action: Determine if the device needs to be replaced, and clear the errors               using 'zpool clear' or replace the device with 'zpool replace'.             see: http://illumos.org/msg/ZFS-8000-9P                                         scan: scrub in progress since Fri Mar 15 15:15:25 2019                                 49.6G scanned out of 1.65T at 109M/s, 4h15m to go                                677M repaired, 2.94% done                                                 config:                                                                                  NAME             STATE    READ WRITE CKSUM                                     zroot            ONLINE      0    0    0                                       raidz2-0       ONLINE      0    0    0                                         mfisyspd0p3  ONLINE      0    0 5.44K (repairing)                            mfisyspd1p3  ONLINE      0    0 4.76K (repairing)                            mfisyspd10p3 ONLINE      0    0 4.35K (repairing)                            mfisyspd11p3 ONLINE      0    0 5.17K (repairing)                            mfisyspd2p3  ONLINE      0    0 4.76K (repairing)                            mfisyspd3p3  ONLINE      0    0 4.24K (repairing)                            mfisyspd4p3  ONLINE      0    0 4.75K (repairing)                            mfisyspd5p3  ONLINE      0    0 5.20K (repairing)                            mfisyspd6p3  ONLINE      0    0 4.51K (repairing)                            mfisyspd7p3  ONLINE      0    0 4.65K (repairing)                            mfisyspd8p3  ONLINE      0    0 4.70K (repairing)                            mfisyspd9p3  ONLINE      0    0 3.81K (repairing)  At this point the server was still unable to reboot. I've had to force data re-copy with a dumb : mv /boot{,.dist} cp -pr /boot{.dist} Which turned out to be fine. Going further i finally killed for good the zpool. It took me some time and i stumbled upon the mfi(4) and the mrsas(4) man pages and code.     The mfi driver supports the following hardware:     o  LSI MegaRAID SAS 1078     o  LSI MegaRAID SAS 8408E     o  LSI MegaRAID SAS 8480E     o  LSI MegaRAID SAS 9240     o  LSI MegaRAID SAS 9260     o  Dell PERC5     o  Dell PERC6     o  IBM ServeRAID M1015 SAS/SATA     o  IBM ServeRAID M1115 SAS/SATA     o  IBM ServeRAID M5015 SAS/SATA     o  IBM ServeRAID M5110 SAS/SATA     o  IBM ServeRAID-MR10i     o  Intel RAID Controller SRCSAS18E     o  Intel RAID Controller SROMBSAS18E     The mrsas driver supports the following hardware:     [ Thunderbolt 6Gb/s MR controller ]     o  LSI MegaRAID SAS 9265     o  LSI MegaRAID SAS 9266     o  LSI MegaRAID SAS 9267     o  LSI MegaRAID SAS 9270     o  LSI MegaRAID SAS 9271     o  LSI MegaRAID SAS 9272     o  LSI MegaRAID SAS 9285     o  LSI MegaRAID SAS 9286     o  DELL PERC H810     o  DELL PERC H710/P There was a detectoin priority problem hw.mfi.mrsas_enable=1
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?b78c6384-607f-6742-1be6-5c0dfa801320>