Date: Mon, 26 Oct 2020 22:50:18 +0100 From: Juraj Lutter <juraj@lutter.sk> To: freebsd-stable@freebsd.org Subject: Interrupt problems(?) on Dell R740xd Message-ID: <9FD07762-5744-480C-A289-DDB09730A74D@lutter.sk>
next in thread | raw e-mail | index | archive | help
Hi, on a Dell R740xd with: - 22x nvm0: Dell Express Flash PM1725b 1.6TB SFF - 2x ATA SSDSC2KG240G8R - 2 package(s) x 8 core(s) x 2 hardware threads - 256GB RAM running 12.2-STABLE r367058 I've run into a problem where under some = time, the machine locks up in certain operations (mkdir, for example, not always the = same). In top output, similar entries can be seen: 12 root -80 - 0B 7936K WAIT 0 0:05 0.00% = intr{irq48: pcib12+++} 12 root -88 - 0B 7936K WAIT 6 0:05 0.00% = intr{irq16: ahci0 xhci0*} 12 root -80 - 0B 7936K WAIT 8 0:05 0.00% = intr{irq53: pcib16++} 12 root -80 - 0B 7936K WAIT 12 0:05 0.00% = intr{irq54: pcib17++} For example, running poudriere: 4124 1 I+ 0:00.21 /usr/local/libexec/poudriere/sh -e = /usr/local/share/poudriere/bulk.sh 4217 1 D+ 0:00.00 cap_mkdb = /poudriere/build/data/.m/12sgx64-default/ref/etc/login.conf And then even the root pool is getting checksum errors, with subseqent = scrub needed: Oct 26 11:55:42 bnts-nvs-n1 ZFS[4117]: pool I/O failure, zpool=3D$zroot = error=3D$97 Oct 26 11:55:42 bnts-nvs-n1 ZFS[4118]: checksum mismatch, zpool=3D$zroot = path=3D$/dev/da0p3 offset=3D$30089228288 size=3D$53248 Oct 26 11:55:42 bnts-nvs-n1 ZFS[4119]: checksum mismatch, zpool=3D$zroot = path=3D$/dev/da1p3 offset=3D$30089228288 size=3D$53248 Oct 26 11:55:49 bnts-nvs-n1 ZFS[4121]: pool I/O failure, zpool=3D$zroot = error=3D$97 Oct 26 11:56:26 bnts-nvs-n1 ZFS[4239]: pool I/O failure, zpool=3D$zroot = error=3D$97 This all happens when "increased" I/O is going via mrsas-attached disks: AVAGO MegaRAID SAS FreeBSD mrsas driver version: 07.709.04.00-fbsd mrsas0: <AVAGO Invader SAS Controller> port 0x4000-0x40ff mem = 0x9db00000-0x9db0ffff,0x9da00000-0x9dafffff irq 32 at device 0.0 = numa-domain 0 on pci4 mrsas0: FW now in Ready state mrsas0: Using MSI-X with 32 number of vectors mrsas0: FW supports <96> MSIX vector,Online CPU 32 Current MSIX <32> mrsas0: max sge: 0x46, max chain frame size: 0x400, max fw cmd: 0x39f mrsas0: Issuing IOC INIT command to FW. mrsas0: IOC INIT response received from FW. mrsas0: System PD created target ID: 0x0 mrsas0: System PD created target ID: 0x1 mrsas0: FW supports: UnevenSpanSupport=3D1 mrsas0: max_fw_cmds: 927 max_scsi_cmds: 911 mrsas0: MSI-x interrupts setup success mrsas0: mrsas_ocr_thread Internal disks are: <ATA SSDSC2KG240G8R DL67> at scbus17 target 0 lun 0 (pass2,da0) <ATA SSDSC2KG240G8R DL67> at scbus17 target 1 lun 0 (pass3,da1) Example: da0 at mrsas0 bus 1 scbus17 target 0 lun 0 da0: <ATA SSDSC2KG240G8R DL67> Fixed Direct Access SPC-4 SCSI device da0: Serial Number BTYG01730DP5240AGN da0: 150.000MB/s transfers da0: 228936MB (468862128 512 byte sectors) Internal AHCI is: pci0: <ACPI PCI bus> numa-domain 0 on pcib0 pci0: <dasp, performance counters> at device 8.1 (no driver attached) pci0: <unknown> at device 17.0 (no driver attached) ahci0: <Intel Lewisburg AHCI SATA controller> ahci0: AHCI v1.31 with 6 6Gbps ports, Port Multiplier not supported ahci1: <Intel Lewisburg AHCI SATA controller> ahci1: AHCI v1.31 with 8 6Gbps ports, Port Multiplier not supported sesutil map excerpt: ses0: Enclosure Name: AHCI SGPIO Enclosure 2.00 Enclosure ID: 3061686369656d30 Element 0, Type: Array Device Slot Status: Unsupported (0x00 0x00 0x00 0x00) Description: Drive Slots NVME disks are: nda0 at nvme0 bus 0 scbus19 target 0 lun 1 nda0: <Dell Express Flash PM1725b 1.6TB SFF 1.1.0 S5CUNA0N201038> nda0: Serial Number S5CUNA0N201038 nda0: nvme version 1.2 x4 (max x4) lanes PCIe Gen3 (max Gen3) link nda0: 1526185MB (3125627568 512 byte sectors) The machine also has 4x bge and 4x bnxt. With hw.pci.enable_msi=3D"0" set, it's slightly better, with = hw.pci.enable_msi=3D"1", it happens more often and under even lower load than with enable_msi=3D0. enable_msix is set to 1. Once the machine locks up, one or more of the following also appears: bge2: Interface stopped DISTRIBUTING, possible flapping - this might be = caused by stuck interrupt(?) nvme0: Missing interrupt The only way out is to reboot. And I wonder, what steps could I take to narrow down the source of the = problem? The machine is not yet in production, I even can try a -CURRENT on it, = as a last resort. The one thing I=E2=80=99m also considering is to disable USB in order to = not share interrupt(s) with ahci. The weird thing is that it can survive a full buildworld with 1 make = job, but not with 32 or even 16. Did anyone came across something like this? Any hints are welcome. Thanks. =E2=80=94 Juraj Lutter XMPP: juraj (at) lutter.sk GSM: +421907986576
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?9FD07762-5744-480C-A289-DDB09730A74D>