Date: Wed, 15 Mar 2017 06:05:30 +0000 From: bugzilla-noreply@freebsd.org To: freebsd-bugs@FreeBSD.org Subject: [Bug 211713] NVME controller failure: resetting Message-ID: <bug-211713-8-NOXcGi9cgd@https.bugs.freebsd.org/bugzilla/> In-Reply-To: <bug-211713-8@https.bugs.freebsd.org/bugzilla/> References: <bug-211713-8@https.bugs.freebsd.org/bugzilla/>
next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D211713 --- Comment #19 from Warner Losh <imp@FreeBSD.org> --- My samsung 960 PRO works great. We have other (hundreds) drives at work that are doing close to 3.8GB/s steady for hours.... So it can work... Let's dig down a level... So the 'reset' messages that Terry is seeing in the two screen shots he just posted are either the result of some prankster doing an nvmecontrol reset (quite unlikely), or the result of the driver calling reset internally. It = does this only when it gets a timeout for a command. Assuming for the moment that the timeout code is good, there's a command that's coming back bad and we w= ind up here: nvme_timeout(void *arg) ... /* Read csts to get value of cfs - controller fatal status. */ csts.raw =3D nvme_mmio_read_4(ctrlr, csts); if (ctrlr->enable_aborts && csts.bits.cfs =3D=3D 0) { /*=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 * If aborts are enabled, only use them if the controller i= s=20=20=20=20 * not reporting fatal status.=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 */ nvme_ctrlr_cmd_abort(ctrlr, tr->cid, qpair->id, nvme_abort_complete, tr); } else nvme_ctrlr_reset(ctrlr); so we read the CSTS (the controller status) and if we've enabled aborts (wh= ich you can do by setting the tunable hw.nvme.enable_aborts=3D1 (it defaults to= 0, so that's the path we may be taking unless you've found this already), so we d= o a reset. The reset turns out to be unsuccessful, and we drive off the road into the ditch with the follow-on errors. So, maybe try to set the tunable and try again. I'd normally ask about all = the stupid issues: is power good, are the connections good, are you seeing PCIe errors (pciconf -lbace nvmeX), etc here, but I kinda assume with so many reports that's unlikely to be fruitful to everybody. Maybe I'll try to find a Samsung 950 Pro 512GB (which form factor do you ha= ve?) and try as well, but that process will take about a week or two since I hav= e an offsite soon and I don't think I can get one here before then. --=20 You are receiving this mail because: You are the assignee for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-211713-8-NOXcGi9cgd>