Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 22 May 2021 23:16:02 -0600
From:      Warner Losh <imp@bsdimp.com>
To:        Mark Millard <marklmi@yahoo.com>
Cc:        freebsd-current <freebsd-current@freebsd.org>, freebsd-arm <freebsd-arm@freebsd.org>
Subject:   Re: I got a panic for "nvme0: cpl does not map to outstanding cmd" on a MACHIATObin Double Shot
Message-ID:  <CANCZdfqcrkC=VAuvB_i=sR=7rp_e8KeVX=y2apXj7E5De9eG8g@mail.gmail.com>
In-Reply-To: <063D5E36-126F-497C-97AF-827BADC1ED2F@yahoo.com>
References:  <063D5E36-126F-497C-97AF-827BADC1ED2F.ref@yahoo.com> <063D5E36-126F-497C-97AF-827BADC1ED2F@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
--00000000000042ff8605c2f864e8
Content-Type: text/plain; charset="UTF-8"

On Sat, May 22, 2021 at 10:44 PM Mark Millard via freebsd-arm <
freebsd-arm@freebsd.org> wrote:

> # mount -onoatime 192.168.1.187:/usr/ports/ /mnt/
> # diff -r /usr/ports/ /mnt/ | more
> nvme0: cpl does not map to outstanding cmd
> cdw0:00000000 sqhd:0020 sqid:0003 cid:007e p:1 sc:00 sct:0 m:0 dnr:0
> panic: received completion for unknown cmd
>

cid 0x7e has no currently active command. The cid is used by the driver
to map completions back to requests.

So, there's usually 3 possibilities that I've seen this with.

(1) There's a missing cache flush so you get a bogus cpl back because
something stale
was read. It's unlikely to be this one because the rest of this look like a
successful
command completed: sc = 0 is successful completion and sct is a generic
command queued.

(2) We're looking at the completion record twice because we failed to
properly update the
head pointer and we've already completed the command. I've only ever seen
this in a
panic situation where we interrupt the completion routine because something
else
paniced.

(3) There's something that's corrupting the act_tr array in the qpair. I've
not seen this,
but if something else smashes that area (zeroing it in this case), then
that could cause
an error like this.

Or it could be something new I've not seen nor thought about before.


> cpuid = 3
> time = 1621743752
> KDB: stack backtrace:
> db_trace_self() at db_trace_self
> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
> vpanic() at vpanic+0x188
> panic() at panic+0x44
> nvme_qpair_process_completions() at nvme_qpair_process_completions+0x1fc
> nvme_timeout() at nvme_timeout+0x3c
> softclock_call_cc() at softclock_call_cc+0x124
> softclock() at softclock+0x60
> ithread_loop() at ithread_loop+0x2a8
> fork_exit() at fork_exit+0x74
> fork_trampoline() at fork_trampoline+0x14
> KDB: enter: panic
> [ thread pid 12 tid 100028 ]
> Stopped at      kdb_enter+0x48: undefined       f904411f
> db>
>
> Based on the "nvme" references, I expect this is tied to
> handling the Optane 480 GiByte that is in the PCIe slot
> and is the boot/only media for the machine doing the diff.
>
> "db> dump" seems to have worked.
>
> After the reboot, zpool scrub found no errors.
>
> For reference:
>
> # uname -apKU
> FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #1
> main-n246854-03b0505b8fe8-dirty: Sat May 22 16:25:04 PDT 2021
>  root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-dbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-DBG-CA72
> arm64 aarch64 1400013 1400013
>

If you have the dump, I suggest starting to make sure that the act_tr array
looks sane. Make
sure all the live pointers point to a sane looking tr. Make sure that tr is
on the active list, etc

It will take a fair amount of driver reading, though, to see how we got
here. I'd also check to
make sure that qpair->num_enttries > cpl.cid (0x3e in this case).

Warner

--00000000000042ff8605c2f864e8
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Sat, May 22, 2021 at 10:44 PM Mark=
 Millard via freebsd-arm &lt;<a href=3D"mailto:freebsd-arm@freebsd.org">fre=
ebsd-arm@freebsd.org</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quo=
te" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204=
);padding-left:1ex"># mount -onoatime 192.168.1.187:/usr/ports/ /mnt/<br>
# diff -r /usr/ports/ /mnt/ | more<br>
nvme0: cpl does not map to outstanding cmd<br>
cdw0:00000000 sqhd:0020 sqid:0003 cid:007e p:1 sc:00 sct:0 m:0 dnr:0<br>
panic: received completion for unknown cmd<br></blockquote><div><br></div><=
div>cid 0x7e has no currently active command. The cid is used by the driver=
</div><div>to map completions=C2=A0back to requests.</div><div><br></div><d=
iv>So, there&#39;s usually 3 possibilities that I&#39;ve seen this with.</d=
iv><div><br></div><div>(1) There&#39;s a missing cache flush so you get a b=
ogus cpl back because something=C2=A0stale</div><div>was read. It&#39;s unl=
ikely=C2=A0to be this one because the rest of this look like a successful</=
div><div>command completed: sc =3D 0 is successful completion and sct is a =
generic command queued.</div><div><br></div><div>(2) We&#39;re looking at t=
he completion record twice because=C2=A0we failed to properly update the</d=
iv><div>head pointer and we&#39;ve already completed the command. I&#39;ve =
only ever seen this in a</div><div>panic situation where we interrupt the c=
ompletion routine because something else</div><div>paniced.</div><div><br><=
/div><div>(3) There&#39;s something that&#39;s corrupting the act_tr array =
in the qpair. I&#39;ve not seen this,</div><div>but if something else smash=
es that area (zeroing it in this case), then that could cause</div><div>an =
error like this.</div><div><br></div><div>Or it could be something new I&#3=
9;ve not seen nor thought about before.</div><div>=C2=A0<br></div><blockquo=
te class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px =
solid rgb(204,204,204);padding-left:1ex">
cpuid =3D 3<br>
time =3D 1621743752<br>
KDB: stack backtrace:<br>
db_trace_self() at db_trace_self<br>
db_trace_self_wrapper() at db_trace_self_wrapper+0x30<br>
vpanic() at vpanic+0x188<br>
panic() at panic+0x44<br>
nvme_qpair_process_completions() at nvme_qpair_process_completions+0x1fc<br=
>
nvme_timeout() at nvme_timeout+0x3c<br>
softclock_call_cc() at softclock_call_cc+0x124<br>
softclock() at softclock+0x60<br>
ithread_loop() at ithread_loop+0x2a8<br>
fork_exit() at fork_exit+0x74<br>
fork_trampoline() at fork_trampoline+0x14<br>
KDB: enter: panic<br>
[ thread pid 12 tid 100028 ]<br>
Stopped at=C2=A0 =C2=A0 =C2=A0 kdb_enter+0x48: undefined=C2=A0 =C2=A0 =C2=
=A0 =C2=A0f904411f<br>
db&gt; <br>
<br>
Based on the &quot;nvme&quot; references, I expect this is tied to<br>
handling the Optane 480 GiByte that is in the PCIe slot<br>
and is the boot/only media for the machine doing the diff.<br>
<br>
&quot;db&gt; dump&quot; seems to have worked.<br>
<br>
After the reboot, zpool scrub found no errors.<br>
<br>
For reference:<br>
<br>
# uname -apKU<br>
FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #1 main-n246854-03b=
0505b8fe8-dirty: Sat May 22 16:25:04 PDT 2021=C2=A0 =C2=A0 =C2=A0root@CA72_=
16Gp_ZFS:/usr/obj/BUILDs/main-CA72-dbg-clang/usr/main-src/arm64.aarch64/sys=
/GENERIC-DBG-CA72=C2=A0 arm64 aarch64 1400013 1400013<br></blockquote><div>=
<br></div><div>If you have the dump, I suggest starting to make sure that t=
he act_tr array looks sane. Make</div><div>sure all the live pointers point=
 to a sane looking tr. Make sure that tr is on the active list, etc</div><d=
iv><br></div><div>It will take a fair amount of driver reading, though, to =
see how we got here. I&#39;d also check to</div><div>make sure that qpair-&=
gt;num_enttries &gt; cpl.cid (0x3e in this case).</div><div><br></div><div>=
Warner</div></div></div>

--00000000000042ff8605c2f864e8--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfqcrkC=VAuvB_i=sR=7rp_e8KeVX=y2apXj7E5De9eG8g>