Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 24 Nov 2024 17:23:06 -0600
From:      "Jason A. Harmening" <jah@freebsd.org>
To:        freebsd-stable@freebsd.org
Subject:   possible NVMe DMA buffer management issue in 14-stable
Message-ID:  <Z0O1WjrNXVcsITfd@corona>

next in thread | raw e-mail | index | archive | help
Hi,

After updating from 13.4-stable to 14.2-stable earlier today, I've
started seeing a few batches of entries like the following in my syslog:

Nov 24 16:17:52 corona kernel: DMAR4: Fault Overflow
Nov 24 16:17:52 corona kernel: nvme0: WRITE sqid:15 cid:121 nsid:1 lba:1615751416 len:256
Nov 24 16:17:52 corona kernel: DMAR4: nvme0: pci7:0:0 sid 700 fault acc 1 adt 0x0 reason 0x6 addr 42d000
Nov 24 16:17:52 corona kernel: nvme0: DATA TRANSFER ERROR (00/04) crd:0 m:1 dnr:1 p:1 sqid:15 cid:121 cdw0:0
Nov 24 16:17:52 corona kernel: (nda0:nvme0:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=604e68f8 0 ff 0 0 0
Nov 24 16:17:52 corona kernel: (nda0:nvme0:0:0:1): CAM status: Unknown (0x420)
Nov 24 16:17:52 corona kernel: (nda0:nvme0:0:0:1): Error 5, Retries exhausted
Nov 24 16:17:52 corona ZFS[11614]: vdev I/O failure, zpool=zroot path=/dev/nda0p4 offset=824843563008 size=131072 error=5

I've had Intel DMAR enabled on this machine for a long time and haven't
seen anything like this before.  The sequence of events here (with the
DMAR fault first, followed by the NVMe transfer error), combined with
the fact that I haven't yet seen DMAR faults for anything besides NVMe,
plus the fact that I just upgraded from 13 to 14 a few hours ago, makes
me suspect some nvme change between 13 and 14 introduced a subtle DMA
buffer management bug that's being caught by the IOMMU.

Has anyone else seen anything similar?

Thanks,
Jason



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Z0O1WjrNXVcsITfd>