Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 23 May 2021 00:46:28 -0700
From:      Mark Millard via freebsd-current <freebsd-current@freebsd.org>
To:        Warner Losh <imp@bsdimp.com>
Cc:        freebsd-current <freebsd-current@freebsd.org>, freebsd-arm <freebsd-arm@freebsd.org>
Subject:   Re: I got a panic for "nvme0: cpl does not map to outstanding cmd" on a MACHIATObin Double Shot
Message-ID:  <90A23579-61B7-4619-AFD7-68439FCC16F3@yahoo.com>
In-Reply-To: <E5776EDA-8C2E-4309-9B90-9F402194F5BF@yahoo.com>
References:  <063D5E36-126F-497C-97AF-827BADC1ED2F.ref@yahoo.com> <063D5E36-126F-497C-97AF-827BADC1ED2F@yahoo.com> <CANCZdfqcrkC=VAuvB_i=sR=7rp_e8KeVX=y2apXj7E5De9eG8g@mail.gmail.com> <E5776EDA-8C2E-4309-9B90-9F402194F5BF@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help

On 2021-May-23, at 00:08, Mark Millard via freebsd-current =
<freebsd-current at freebsd.org> wrote:

> On 2021-May-22, at 22:16, Warner Losh <imp at bsdimp.com> wrote:
>=20
>> On Sat, May 22, 2021 at 10:44 PM Mark Millard via freebsd-arm =
<freebsd-arm@freebsd.org> wrote:
>> # mount -onoatime 192.168.1.187:/usr/ports/ /mnt/
>> # diff -r /usr/ports/ /mnt/ | more
>> nvme0: cpl does not map to outstanding cmd
>> cdw0:00000000 sqhd:0020 sqid:0003 cid:007e p:1 sc:00 sct:0 m:0 dnr:0
>> panic: received completion for unknown cmd
>>=20
>> cid 0x7e has no currently active command. The cid is used by the =
driver
>> to map completions back to requests.
>>=20
>> So, there's usually 3 possibilities that I've seen this with.
>>=20
>> (1) There's a missing cache flush so you get a bogus cpl back because =
something stale
>> was read. It's unlikely to be this one because the rest of this look =
like a successful
>> command completed: sc =3D 0 is successful completion and sct is a =
generic command queued.
>>=20
>> (2) We're looking at the completion record twice because we failed to =
properly update the
>> head pointer and we've already completed the command. I've only ever =
seen this in a
>> panic situation where we interrupt the completion routine because =
something else
>> paniced.
>>=20
>> (3) There's something that's corrupting the act_tr array in the =
qpair. I've not seen this,
>> but if something else smashes that area (zeroing it in this case), =
then that could cause
>> an error like this.
>=20
> Of note may be that I buildworld and buildkernel with extra
> tuning enabled, targeting the cortex-a72. In one past example
> this lead to finding a missing synchronization related to XHCI
> handling that was fixed. (The fix was not aarch64 specific at
> all.) For that: A cortex-a53 did not show the problem with or
> without that tuning. A cortex-a72 showed the problem only with
> the cortex-a72 tuning, not with targeting a cortex-a53 tuning
> or generic armv7, for example.
>=20
> Not that I've any evidence specifically suggesting such would
> be involved here. But it might be good to keep in mind as a
> possaibility.
>=20
>> Or it could be something new I've not seen nor thought about before.
>>=20
>> cpuid =3D 3
>> time =3D 1621743752
>> KDB: stack backtrace:
>> db_trace_self() at db_trace_self
>> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
>> vpanic() at vpanic+0x188
>> panic() at panic+0x44
>> nvme_qpair_process_completions() at =
nvme_qpair_process_completions+0x1fc
>> nvme_timeout() at nvme_timeout+0x3c
>> softclock_call_cc() at softclock_call_cc+0x124
>> softclock() at softclock+0x60
>> ithread_loop() at ithread_loop+0x2a8
>> fork_exit() at fork_exit+0x74
>> fork_trampoline() at fork_trampoline+0x14
>> KDB: enter: panic
>> [ thread pid 12 tid 100028 ]
>> Stopped at      kdb_enter+0x48: undefined       f904411f
>> db>=20
>>=20
>> Based on the "nvme" references, I expect this is tied to
>> handling the Optane 480 GiByte that is in the PCIe slot
>> and is the boot/only media for the machine doing the diff.
>>=20
>> "db> dump" seems to have worked.
>>=20
>> After the reboot, zpool scrub found no errors.
>>=20
>> For reference:
>>=20
>> # uname -apKU
>> FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #1 =
main-n246854-03b0505b8fe8-dirty: Sat May 22 16:25:04 PDT 2021     =
root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-dbg-clang/usr/main-src/arm64.=
aarch64/sys/GENERIC-DBG-CA72  arm64 aarch64 1400013 1400013
>>=20
>> If you have the dump, I suggest starting to make sure that the act_tr =
array looks sane. Make
>> sure all the live pointers point to a sane looking tr. Make sure that =
tr is on the active list, etc
>>=20
>> It will take a fair amount of driver reading, though, to see how we =
got here. I'd also check to
>> make sure that qpair->num_enttries > cpl.cid (0x3e in this case).
>>=20
>=20
> Okay. I got this while trying to test an odd diff -r over NFS
> issue with the more recent software. So the two will potentially
> compete for time.
>=20
> As investigation will be exploratory for me, not familiar, I'll
> probably publish periodic notes on things as I go along looking
> at stuff.
>=20
> My first is that the /var/crash/core.txt.0 has a gdb backtrace:
>=20
> . . .
> #10 0xffff00000047900c in panic (
>    fmt=3D0x12 <error: Cannot access memory at address 0x12>)
>    at /usr/main-src/sys/kern/kern_shutdown.c:843
> #11 0xffff0000002226b4 in nvme_qpair_process_completions (
>    qpair=3Dqpair@entry=3D0xffffa00008724300)
>    at /usr/main-src/sys/dev/nvme/nvme_qpair.c:617
> #12 0xffff000000223354 in nvme_timeout =
(arg=3Darg@entry=3D0xffffa0000b053980)
>    at /usr/main-src/sys/dev/nvme/nvme_qpair.c:938
> #13 0xffff000000495bf8 in softclock_call_cc (c=3D0xffffa0000b0539a0,=20=

>    cc=3Dcc@entry=3D0xffff000000de3500 <cc_cpu+768>, direct=3D0)
>    at /usr/main-src/sys/kern/kern_timeout.c:696
> #14 0xffff000000495fb0 in softclock (arg=3D0xffff000000de3500 =
<cc_cpu+768>)
>    at /usr/main-src/sys/kern/kern_timeout.c:816
> #15 0xffff0000004356dc in intr_event_execute_handlers (p=3D<optimized =
out>,=20
>    ie=3D0xffffa000058bc700) at /usr/main-src/sys/kern/kern_intr.c:1168
> #16 ithread_execute_handlers (p=3D<optimized out>, =
ie=3D0xffffa000058bc700)
>    at /usr/main-src/sys/kern/kern_intr.c:1181
> #17 ithread_loop (arg=3D<optimized out>, arg@entry=3D0xffffa000058aef60)=

>    at /usr/main-src/sys/kern/kern_intr.c:1269
> #18 0xffff000000431f6c in fork_exit (
>    callout=3D0xffff000000435430 <ithread_loop>, =
arg=3D0xffffa000058aef60,=20
>    frame=3D0xffff0000eb7cc990) at =
/usr/main-src/sys/kern/kern_fork.c:1083
> #19 <signal handler called>
>=20
> So via kgdb . . .
>=20
> (kgdb) up 11
> #11 0xffff0000002226b4 in nvme_qpair_process_completions =
(qpair=3Dqpair@entry=3D0xffffa00008724300) at =
/usr/main-src/sys/dev/nvme/nvme_qpair.c:617
> 617				KASSERT(0, ("received completion for =
unknown cmd"));
>=20
> (kgdb) print/x cpl.cid
> $4 =3D 0x7e
> (kgdb) print/x qpair->num_entries
> $5 =3D 0x100
>=20
> Based on also seeing the code:
>=20
>        qpair->act_tr =3D malloc_domainset(sizeof(struct nvme_tracker =
*) *
>            qpair->num_entries, M_NVME, DOMAINSET_PREF(qpair->domain),
>            M_ZERO | M_WAITOK);
>=20
> (kgdb) print qpair->act_tr
> $6 =3D (struct nvme_tracker **) 0xffffa00008725800
> (kgdb) x/256g 0xffffa00008725800
> 0xffffa00008725800:	0x0000000000000000	0x0000000000000000
> 0xffffa00008725810:	0x0000000000000000	0x0000000000000000
> . . .
> 0xffffa00008725fe0:	0x0000000000000000	0x0000000000000000
> 0xffffa00008725ff0:	0x0000000000000000	0x0000000000000000
>=20
> It was all zeros (null pointers). No "live" pointers and, so,
> no tr's to inspect.
>=20
> As none of this is familiar context beyond general programming
> concepts, it may be some time before I find anything else
> potentially of interest to report. If you have other specific
> things you would like me to look at, let me know.
>=20

A fairly obvious thing I should have provided:

(kgdb) print/x *qpair
$15 =3D {ctrlr =3D 0xffff0000fe154000, id =3D 0x3, domain =3D 0x0, cpu =3D=
 0x2, vector =3D 0x3, rid =3D 0x4, res =3D 0xffffa000086ded80, tag =3D =
0xffffa0000877b780, num_entries =3D 0x100, num_trackers =3D 0x80,=20
  sq_tdbl_off =3D 0x1018, cq_hdbl_off =3D 0x101c, phase =3D 0x1, sq_head =
=3D 0x1f, sq_tail =3D 0x20, cq_head =3D 0x20, num_cmds =3D 0x420, =
num_intr_handler_calls =3D 0xe66c, num_retries =3D 0x0, num_failures =3D =
0x0,=20
  cmd =3D 0xffff000100ebb000, cpl =3D 0xffff000100ebf000, dma_tag =3D =
0xffffa0000b093e00, dma_tag_payload =3D 0xffffa000059ef000, queuemem_map =
=3D 0xffffa00005a07700, cmd_bus_addr =3D 0xacbb000,=20
  cpl_bus_addr =3D 0xacbf000, free_tr =3D {tqh_first =3D =
0xffffa0000b053a80, tqh_last =3D 0xffffa0000869da80}, outstanding_tr =3D =
{tqh_first =3D 0xffffa0000b053980, tqh_last =3D 0xffffa0000b053980}, =
queued_req =3D {
    stqh_first =3D 0x0, stqh_last =3D 0xffffa000087243c8}, act_tr =3D =
0xffffa00008725800, is_enabled =3D 0x1, lock =3D {lock_object =3D =
{lo_name =3D 0xffff00000090321f, lo_flags =3D 0x1030000, lo_data =3D =
0x0,=20
      lo_witness =3D 0xffffa0043fd96080}, mtx_lock =3D 0x0}}

Looks like I need to boot into the non-debug builds for the
other problem I'm testing for repeatability after a commit.


=3D=3D=3D
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?90A23579-61B7-4619-AFD7-68439FCC16F3>