Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 22 May 2022 05:27:29 +0000
From:      bugzilla-noreply@freebsd.org
To:        fs@FreeBSD.org
Subject:   [Bug 264141] nvme(4): Heavy load to SSD wedges 13.1 system: Controller in fatal status, resetting ... Resetting controller due to a timeout and possible hot unplug.
Message-ID:  <bug-264141-3630-bcBeelI2go@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-264141-3630@https.bugs.freebsd.org/bugzilla/>
References:  <bug-264141-3630@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D264141

--- Comment #7 from Warner Losh <imp@FreeBSD.org> ---
nda is an alternative to nvd that uses CAM. Unless you need really high IOP=
S,
nda generally is better than nvd.

In loader.conf, add 'hw.nvme.use_nvd=3D0' and reboot.

We provide a compatible /dev/nvd* that points to /dev/nda* so almost all us=
es
of /dev/nvd* should work. But with zfs, chances are you won't notice.

I wrote this code, but had trouble driving the nvme drives I have access too
off the cliff to test all pathological behaviors. This is one I tested in
simulation.

However, looking at the code, I fear that this workaround likely won't help
you. The message happens when we fail the controller, and that seems to be
happening when reset fails (which we should report directly, but apparently
don't).

Do you have issues with the machines being too hot or having poor airflow o=
ver
the nvme cards so they get too hot? In general, FreeBSD (or any OS) shouldn=
't
be able to schedule so much I/O that the card's SoC controller fails... At
least not in a repeatable way across multiple drive types. The 'possible
hotplug' means we read all 'f's before trying to do a reset. If the card is=
n't
there at all, we'll timeout and fail the controller (which maybe what's rea=
lly
going on). That suggests power and/or cabling issues if it isn't thermal
somehow. It would be good to eliminate these possibilities if at all possib=
le.

--=20
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-264141-3630-bcBeelI2go>