Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 06 Apr 2026 10:03:43 +0000
From:      bugzilla-noreply@freebsd.org
To:        bugs@FreeBSD.org
Subject:   [Bug 294280] mrsas: crash dump collection fails on RAID1 VD behind controller; firmware reports invalidSgl=1 on dump-time WRITE(10) (64KB)
Message-ID:  <bug-294280-227@https.bugs.freebsd.org/bugzilla/>

index | next in thread | raw e-mail

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=294280

            Bug ID: 294280
           Summary: mrsas: crash dump collection fails on RAID1 VD behind
                    controller; firmware reports invalidSgl=1 on dump-time
                    WRITE(10) (64KB)
           Product: Base System
           Version: 15.0-RELEASE
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: chandrakanth.patil@broadcom.com

Crash dump collection fails on a system where the OS is installed on a RAID1
virtual disk exposed by the mrsas controller. During a manual panic, dump I/O
fails after repeated write errors. Controller / firmware-side logs report
invalidSgl=1 for the failing WRITE(10) request.

The failure is reproducible only during panic dump collection so far. Normal
runtime I/O to the same VD has not shown a visible failure in our current
testing.

When a manual crash is triggered, the kernel starts writing the dump and then
aborts with I/O error:

panic: vm_fault_lookup: fault on nofault entry, addr: 0xffffffff8216d000
cpuid = 8
time = 1771836706
KDB: stack backtrace:
#0 0xffffffff80bbe1ed at kdb_backtrace+0x5d
#1 0xffffffff80b71576 at vpanic+0x136
#2 0xffffffff80b71433 at panic+0x43
#3 0xffffffff80f07a9b at vm_fault+0x17cb
#4 0xffffffff80f061e1 at vm_fault_trap+0x81
#5 0xffffffff81079d99 at trap_pfault+0x1f9
#6 0xffffffff8104ff18 at calltrap+0x8
#7 0xffffffff80bbf59e at kobj_init+0xe
#8 0xffffffff80babe63 at device_set_driver+0xa3
#9 0xffffffff80babb24 at device_probe_child+0xc4
#10 0xffffffff80baccd1 at device_probe+0x71
#11 0xffffffff80bace8e at device_probe_and_attach+0xe
#12 0xffffffff8083d362 at pci_driver_added+0xf2
#13 0xffffffff80baa8c9 at devclass_driver_added+0x29
#14 0xffffffff80baa85e at devclass_add_driver+0x11e
#15 0xffffffff80b4b575 at module_register_init+0x85
#16 0xffffffff80b3c0df at linker_load_module+0xc0f
#17 0xffffffff80b3dcd5 at kern_kldload+0x165
Uptime: 6m33s
Dumping 4071 out of 130287 MB:
mrsas0: FW cmd complete status 3c
(da2:mrsas0:0:3:0): WRITE(10). CDB: 2a 00 2b 49 49 d7 00 00 80 00
(da2:mrsas0:0:3:0): CAM status: CCB request completed with an error
(da2:mrsas0:0:3:0): Error 5, Retries exhausted
Aborting dump due to I/O error.

** DUMP FAILED (ERROR 5) **

Controller / firmware log

Controller / firmware-side analysis reports the following for the same failing
command:

12/23/25  9:08:48.929: C0:LdCmdValidateLdIo: ld:0 Data length 10000 invalidSgl
1 for Read/write IO with CDB 2a


The failing CDB is:

2a 00 2b 49 49 d7 00 00 80 00

This is a WRITE(10) request with transfer length 0x0080 blocks. For a 512-byte
block device, that is 0x80 * 512 = 0x10000 bytes, i.e. 64KB. So the transfer
size itself appears consistent with the CDB. The failure being reported by
firmware is specifically that the SGL attached to this host command is invalid,
not that the byte count itself is unexpected.

In upstream FreeBSD mrsas, CAM I/O requests are submitted through the SIM
action path, and the SIM is registered with mrsas_cam_poll as the CAM poll
callback. That means polled I/O is used through the CAM poll path for this
driver. Upstream also sets ccb->cpi.maxio based on sc->max_sectors_per_req *
512, so the driver advertises byte-sized transfer limits in this way.

Request for review

Since this issue is observed specifically during manual panic dump collection,
we would like help reviewing whether the crashed-kernel / panic-dump /
polled-I/O path can result in a malformed or incomplete SGL being attached to a
host write request issued.

In particular, could the panic-time environment, nofault state, or polled CAM
path cause the host command to carry an SGL that does not fully or correctly
describe the requested 64KB transfer, even though normal runtime I/O may not
visibly fail?

The firmware team’s position is that the write failure is due to a malformed
host SGL, and their validation log is pointing to the host-issued command
itself.

Steps to reproduce:

1. Configure a RAID1 virtual disk on an mrsas controller.
2. Install FreeBSD on the RAID1 VD.
3. Configure crash dumps.
4. Trigger a manual panic.
5. Observe that dump collection starts and then fails with WRITE(10) I/O
errors.
6. Firmware logs report invalidSgl=1 for the failing WRITE(10).

-- 
You are receiving this mail because:
You are the assignee for the bug.

home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-294280-227>