From owner-freebsd-stable@freebsd.org Mon Oct 26 21:50:29 2020 Return-Path: Delivered-To: freebsd-stable@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 3D9E844FDD1 for ; Mon, 26 Oct 2020 21:50:29 +0000 (UTC) (envelope-from SRS0=gOlx=EB=lutter.sk=juraj@ns2.wilbury.net) Received: from ns2.wilbury.net (ns2.wilbury.net [92.60.51.55]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "svc.wilbury.net", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4CKpRR6nHSz3YBb for ; Mon, 26 Oct 2020 21:50:27 +0000 (UTC) (envelope-from SRS0=gOlx=EB=lutter.sk=juraj@ns2.wilbury.net) Received: from [10.3.1.13] (hq.bonet.sk [92.60.48.52]) (Authenticated sender: juraj@lutter.sk) by svc.wilbury.net (Postfix) with ESMTPSA id E855F45CE94 for ; Mon, 26 Oct 2020 22:50:18 +0100 (CET) From: Juraj Lutter Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.4\)) Subject: Interrupt problems(?) on Dell R740xd Message-Id: <9FD07762-5744-480C-A289-DDB09730A74D@lutter.sk> Date: Mon, 26 Oct 2020 22:50:18 +0100 To: freebsd-stable@freebsd.org X-Mailer: Apple Mail (2.3608.120.23.2.4) X-Spam-Status: No, score=-1.7 required=5.0 tests=BAYES_00,HELO_MISC_IP, LOTS_OF_MONEY,SPF_FAIL,TW_BN,TW_KD,TW_NV autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on ns2.wilbury.net X-Rspamd-Queue-Id: 4CKpRR6nHSz3YBb X-Spamd-Bar: + Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=none (mx1.freebsd.org: domain of SRS0=gOlx=EB=lutter.sk=juraj@ns2.wilbury.net has no SPF policy when checking 92.60.51.55) smtp.mailfrom=SRS0=gOlx=EB=lutter.sk=juraj@ns2.wilbury.net X-Spamd-Result: default: False [1.95 / 15.00]; RCVD_TLS_ALL(0.00)[]; SUBJECT_HAS_QUESTION(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; FROM_HAS_DN(0.00)[]; MV_CASE(0.50)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; MIME_GOOD(-0.10)[text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-stable@freebsd.org]; TO_DN_NONE(0.00)[]; AUTH_NA(1.00)[]; RCPT_COUNT_ONE(0.00)[1]; NEURAL_HAM_MEDIUM(-0.49)[-0.485]; ARC_NA(0.00)[]; NEURAL_HAM_SHORT(-0.08)[-0.080]; MID_RHS_MATCH_FROM(0.00)[]; DMARC_NA(0.00)[lutter.sk]; NEURAL_SPAM_LONG(0.82)[0.818]; R_SPF_NA(0.00)[no SPF record]; FORGED_SENDER(0.30)[juraj@lutter.sk,SRS0=gOlx=EB=lutter.sk=juraj@ns2.wilbury.net]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:44185, ipnet:92.60.48.0/22, country:SK]; RCVD_COUNT_TWO(0.00)[2]; FROM_NEQ_ENVFROM(0.00)[juraj@lutter.sk,SRS0=gOlx=EB=lutter.sk=juraj@ns2.wilbury.net]; MAILMAN_DEST(0.00)[freebsd-stable] X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.33 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 Oct 2020 21:50:29 -0000 Hi, on a Dell R740xd with: - 22x nvm0: Dell Express Flash PM1725b 1.6TB SFF - 2x ATA SSDSC2KG240G8R - 2 package(s) x 8 core(s) x 2 hardware threads - 256GB RAM running 12.2-STABLE r367058 I've run into a problem where under some = time, the machine locks up in certain operations (mkdir, for example, not always the = same). In top output, similar entries can be seen: 12 root -80 - 0B 7936K WAIT 0 0:05 0.00% = intr{irq48: pcib12+++} 12 root -88 - 0B 7936K WAIT 6 0:05 0.00% = intr{irq16: ahci0 xhci0*} 12 root -80 - 0B 7936K WAIT 8 0:05 0.00% = intr{irq53: pcib16++} 12 root -80 - 0B 7936K WAIT 12 0:05 0.00% = intr{irq54: pcib17++} For example, running poudriere: 4124 1 I+ 0:00.21 /usr/local/libexec/poudriere/sh -e = /usr/local/share/poudriere/bulk.sh 4217 1 D+ 0:00.00 cap_mkdb = /poudriere/build/data/.m/12sgx64-default/ref/etc/login.conf And then even the root pool is getting checksum errors, with subseqent = scrub needed: Oct 26 11:55:42 bnts-nvs-n1 ZFS[4117]: pool I/O failure, zpool=3D$zroot = error=3D$97 Oct 26 11:55:42 bnts-nvs-n1 ZFS[4118]: checksum mismatch, zpool=3D$zroot = path=3D$/dev/da0p3 offset=3D$30089228288 size=3D$53248 Oct 26 11:55:42 bnts-nvs-n1 ZFS[4119]: checksum mismatch, zpool=3D$zroot = path=3D$/dev/da1p3 offset=3D$30089228288 size=3D$53248 Oct 26 11:55:49 bnts-nvs-n1 ZFS[4121]: pool I/O failure, zpool=3D$zroot = error=3D$97 Oct 26 11:56:26 bnts-nvs-n1 ZFS[4239]: pool I/O failure, zpool=3D$zroot = error=3D$97 This all happens when "increased" I/O is going via mrsas-attached disks: AVAGO MegaRAID SAS FreeBSD mrsas driver version: 07.709.04.00-fbsd mrsas0: port 0x4000-0x40ff mem = 0x9db00000-0x9db0ffff,0x9da00000-0x9dafffff irq 32 at device 0.0 = numa-domain 0 on pci4 mrsas0: FW now in Ready state mrsas0: Using MSI-X with 32 number of vectors mrsas0: FW supports <96> MSIX vector,Online CPU 32 Current MSIX <32> mrsas0: max sge: 0x46, max chain frame size: 0x400, max fw cmd: 0x39f mrsas0: Issuing IOC INIT command to FW. mrsas0: IOC INIT response received from FW. mrsas0: System PD created target ID: 0x0 mrsas0: System PD created target ID: 0x1 mrsas0: FW supports: UnevenSpanSupport=3D1 mrsas0: max_fw_cmds: 927 max_scsi_cmds: 911 mrsas0: MSI-x interrupts setup success mrsas0: mrsas_ocr_thread Internal disks are: at scbus17 target 0 lun 0 (pass2,da0) at scbus17 target 1 lun 0 (pass3,da1) Example: da0 at mrsas0 bus 1 scbus17 target 0 lun 0 da0: Fixed Direct Access SPC-4 SCSI device da0: Serial Number BTYG01730DP5240AGN da0: 150.000MB/s transfers da0: 228936MB (468862128 512 byte sectors) Internal AHCI is: pci0: numa-domain 0 on pcib0 pci0: at device 8.1 (no driver attached) pci0: at device 17.0 (no driver attached) ahci0: ahci0: AHCI v1.31 with 6 6Gbps ports, Port Multiplier not supported ahci1: ahci1: AHCI v1.31 with 8 6Gbps ports, Port Multiplier not supported sesutil map excerpt: ses0: Enclosure Name: AHCI SGPIO Enclosure 2.00 Enclosure ID: 3061686369656d30 Element 0, Type: Array Device Slot Status: Unsupported (0x00 0x00 0x00 0x00) Description: Drive Slots NVME disks are: nda0 at nvme0 bus 0 scbus19 target 0 lun 1 nda0: nda0: Serial Number S5CUNA0N201038 nda0: nvme version 1.2 x4 (max x4) lanes PCIe Gen3 (max Gen3) link nda0: 1526185MB (3125627568 512 byte sectors) The machine also has 4x bge and 4x bnxt. With hw.pci.enable_msi=3D"0" set, it's slightly better, with = hw.pci.enable_msi=3D"1", it happens more often and under even lower load than with enable_msi=3D0. enable_msix is set to 1. Once the machine locks up, one or more of the following also appears: bge2: Interface stopped DISTRIBUTING, possible flapping - this might be = caused by stuck interrupt(?) nvme0: Missing interrupt The only way out is to reboot. And I wonder, what steps could I take to narrow down the source of the = problem? The machine is not yet in production, I even can try a -CURRENT on it, = as a last resort. The one thing I=E2=80=99m also considering is to disable USB in order to = not share interrupt(s) with ahci. The weird thing is that it can survive a full buildworld with 1 make = job, but not with 32 or even 16. Did anyone came across something like this? Any hints are welcome. Thanks. =E2=80=94 Juraj Lutter XMPP: juraj (at) lutter.sk GSM: +421907986576