Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 15 Mar 2017 06:05:30 +0000
From:      bugzilla-noreply@freebsd.org
To:        freebsd-bugs@FreeBSD.org
Subject:   [Bug 211713] NVME controller failure: resetting
Message-ID:  <bug-211713-8-NOXcGi9cgd@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-211713-8@https.bugs.freebsd.org/bugzilla/>
References:  <bug-211713-8@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D211713

--- Comment #19 from Warner Losh <imp@FreeBSD.org> ---
My samsung 960 PRO works great. We have other (hundreds) drives at work that
are doing close to 3.8GB/s steady for hours.... So it can work... Let's dig
down a level...

So the 'reset' messages that Terry is seeing in the two screen shots he just
posted are either the result of some prankster doing an nvmecontrol reset
(quite unlikely), or the result of the driver calling reset internally. It =
does
this only when it gets a timeout for a command. Assuming for the moment that
the timeout code is good, there's a command that's coming back bad and we w=
ind
up here:

nvme_timeout(void *arg)
...
        /* Read csts to get value of cfs - controller fatal status. */
        csts.raw =3D nvme_mmio_read_4(ctrlr, csts);

        if (ctrlr->enable_aborts && csts.bits.cfs =3D=3D 0) {
                /*=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
                 * If aborts are enabled, only use them if the controller i=
s=20=20=20=20
                 *  not reporting fatal status.=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
                 */
                nvme_ctrlr_cmd_abort(ctrlr, tr->cid, qpair->id,
                    nvme_abort_complete, tr);
        } else
                nvme_ctrlr_reset(ctrlr);

so we read the CSTS (the controller status) and if we've enabled aborts (wh=
ich
you can do by setting the tunable hw.nvme.enable_aborts=3D1 (it defaults to=
 0, so
that's the path we may be taking unless you've found this already), so we d=
o a
reset.

The reset turns out to be unsuccessful, and we drive off the road into the
ditch with the follow-on errors.

So, maybe try to set the tunable and try again. I'd normally ask about all =
the
stupid issues: is power good, are the connections good, are you seeing PCIe
errors (pciconf -lbace nvmeX), etc here, but I kinda assume with so many
reports that's unlikely to be fruitful to everybody.

Maybe I'll try to find a Samsung 950 Pro 512GB (which form factor do you ha=
ve?)
and try as well, but that process will take about a week or two since I hav=
e an
offsite soon and I don't think I can get one here before then.

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-211713-8-NOXcGi9cgd>