From nobody Sun Nov 24 23:23:06 2024 X-Original-To: freebsd-stable@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4XxPyh3l4Fz5drFR for ; Sun, 24 Nov 2024 23:23:08 +0000 (UTC) (envelope-from jah@freebsd.org) Received: from smtp.freebsd.org (smtp.freebsd.org [IPv6:2610:1c1:1:606c::24b:4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R10" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4XxPyh33y7z45fN for ; Sun, 24 Nov 2024 23:23:08 +0000 (UTC) (envelope-from jah@freebsd.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1732490588; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type; bh=6IpE28XukupcbaS9ddhlwt00wTtnLxUQBiqTFHfUMD4=; b=OWD7u+MWjcS+xy7HaPljc7GaOTSPiaiS1wRLnyWfwSI2yn0ZO3Lr6cWqKiEbjo7ueR9wAc i0iLbjHspfrg1EoD82+YDAZ+uw7O76YDewKGoKW9eWKRK5q//z3HsSWmNKmrOJ95BOnnol m61kGg1OSk78Ukhg8D6fCY5KxU2Kix6NZazX9ZERYeFxYSIF9eRG1LDNaEcn+KOZH0dcDK r4iwP0zL1HQRcY912iE2wapIbD5bwZXMQCvFFkoLW+48k5nKCbT0+Wvi9s1ifQdz4C7Wqy PtTta+LbUPGy62fjm1Flx0eKsprNZ8TtHchXbjHOcNXQfnrF3qTmwrKmzjCkgA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1732490588; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type; bh=6IpE28XukupcbaS9ddhlwt00wTtnLxUQBiqTFHfUMD4=; b=ec1GG2Ll3R0NOERTi4RzspHCCcAo3BI48lOLmyhh6aPk0+pYR8w+3gZxkpachUjLEuq8Ls RKtL0V5zkqdHh/JJ2BU51VvKO7XFyjIZWXFFIOn7ECBuFwX36Sfgmu4rp5onsHwiyqHd/1 gSloypnX5ClsBbUJ/Dr5U+VWww9ffEWMaJ+sXMD22mpIcW8aJYNtocwg/xuVECU70PO/OE 7Nf4K89vkuRg60OJVrDPqihWgcL70KkZ4ZUiio9q1pQnGQW1EsD8jM4a094s/pvBPaqywB 4Pts6wDahPKD98LljZwG8ukHppLB0P8NMXfATzJuM50pnAGGeJiENB1R/y3eWQ== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1732490588; a=rsa-sha256; cv=none; b=VjY34ac/7eo7Kss6TzuvaSgEi64Q+Gbcb5hOVvHu1PBhjAdltmsiUcotuzvWqLC/PlOeJC EXy1nZeHMguoYjDNnSQT76LKcTG7uxOtYcrOiEJQNwEWklMX/Gj6Ehog8uZUeZzCnxe9S9 DmFZn6CAUN/U6tVn81vTesAn9wCjOxC7BKyBL6vEcIyL/lmPPk9Hsy3xB78OsrF5LELc9l G/TSWw7F0VqxG5wpsKv7t6ij71ZH/GI3Hr+kemYkBUmAvFQOVtOItfiGAqMXi8+5i8CvO3 mUaNmnXw2SDvCm1SwnRmp2W2fdUiACRj7ZohhFImQe4hI/8hbGF6P6q9wqJboQ== Received: from corona (syn-024-217-248-143.res.spectrum.com [24.217.248.143]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) (Authenticated sender: jah) by smtp.freebsd.org (Postfix) with ESMTPSA id 4XxPyh1bqGz17Kc for ; Sun, 24 Nov 2024 23:23:08 +0000 (UTC) (envelope-from jah@freebsd.org) Date: Sun, 24 Nov 2024 17:23:06 -0600 From: "Jason A. Harmening" To: freebsd-stable@freebsd.org Subject: possible NVMe DMA buffer management issue in 14-stable Message-ID: List-Id: Production branch of FreeBSD source code List-Archive: https://lists.freebsd.org/archives/freebsd-stable List-Help: List-Post: List-Subscribe: List-Unsubscribe: X-BeenThere: freebsd-stable@freebsd.org Sender: owner-freebsd-stable@FreeBSD.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hi, After updating from 13.4-stable to 14.2-stable earlier today, I've started seeing a few batches of entries like the following in my syslog: Nov 24 16:17:52 corona kernel: DMAR4: Fault Overflow Nov 24 16:17:52 corona kernel: nvme0: WRITE sqid:15 cid:121 nsid:1 lba:1615751416 len:256 Nov 24 16:17:52 corona kernel: DMAR4: nvme0: pci7:0:0 sid 700 fault acc 1 adt 0x0 reason 0x6 addr 42d000 Nov 24 16:17:52 corona kernel: nvme0: DATA TRANSFER ERROR (00/04) crd:0 m:1 dnr:1 p:1 sqid:15 cid:121 cdw0:0 Nov 24 16:17:52 corona kernel: (nda0:nvme0:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=604e68f8 0 ff 0 0 0 Nov 24 16:17:52 corona kernel: (nda0:nvme0:0:0:1): CAM status: Unknown (0x420) Nov 24 16:17:52 corona kernel: (nda0:nvme0:0:0:1): Error 5, Retries exhausted Nov 24 16:17:52 corona ZFS[11614]: vdev I/O failure, zpool=zroot path=/dev/nda0p4 offset=824843563008 size=131072 error=5 I've had Intel DMAR enabled on this machine for a long time and haven't seen anything like this before. The sequence of events here (with the DMAR fault first, followed by the NVMe transfer error), combined with the fact that I haven't yet seen DMAR faults for anything besides NVMe, plus the fact that I just upgraded from 13 to 14 a few hours ago, makes me suspect some nvme change between 13 and 14 introduced a subtle DMA buffer management bug that's being caught by the IOMMU. Has anyone else seen anything similar? Thanks, Jason