Date: Mon, 15 Jul 2024 20:40:05 +0200 From: Mateusz Guzik <mjguzik@gmail.com> To: John Baldwin <jhb@freebsd.org> Cc: Mateusz Guzik <mjg@freebsd.org>, src-committers@freebsd.org, dev-commits-src-all@freebsd.org, dev-commits-src-main@freebsd.org Subject: Re: git: 87ee63bac69d - main - locks: add a runtime check for missing turnstile Message-ID: <CAGudoHEV1HLxQw1KPx4W0vaNLs4EpvvrT9Gq5YFaVNidnXjVpw@mail.gmail.com> In-Reply-To: <CAGudoHGbjgq9k%2B4VOGoyNSCiOxi69y5CZDmz75BK-85dfknkgA@mail.gmail.com> References: <202407111107.46BB7uSb007077@gitrepo.freebsd.org> <81cfe7ad-cbee-4122-abef-e47ce2b34f05@FreeBSD.org> <CAGudoHGuCtvKE_OxRAbZCnuO7zJUo6vMwrUopU681YQQMkBxkg@mail.gmail.com> <f7b90d7f-2115-435f-a62c-7186b800913b@FreeBSD.org> <CAGudoHGbjgq9k%2B4VOGoyNSCiOxi69y5CZDmz75BK-85dfknkgA@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Jul 15, 2024 at 8:33=E2=80=AFPM Mateusz Guzik <mjguzik@gmail.com> w= rote: > > On Mon, Jul 15, 2024 at 8:21=E2=80=AFPM John Baldwin <jhb@freebsd.org> wr= ote: > > > > On 7/15/24 13:59, Mateusz Guzik wrote: > > > On Mon, Jul 15, 2024 at 6:22=E2=80=AFPM John Baldwin <jhb@freebsd.org= > wrote: > > >> > > >> On 7/11/24 07:07, Mateusz Guzik wrote: > > >>> The branch main has been updated by mjg: > > >>> > > >>> URL: https://cgit.FreeBSD.org/src/commit/?id=3D87ee63bac69dc49291f5= 5590b8baa57cad6c7d85 > > >>> > > >>> commit 87ee63bac69dc49291f55590b8baa57cad6c7d85 > > >>> Author: Mateusz Guzik <mjg@FreeBSD.org> > > >>> AuthorDate: 2024-07-11 00:17:27 +0000 > > >>> Commit: Mateusz Guzik <mjg@FreeBSD.org> > > >>> CommitDate: 2024-07-11 11:06:52 +0000 > > >>> > > >>> locks: add a runtime check for missing turnstile > > >>> > > >>> There are sometimes bugs which result in the unlock fast path= failing, > > >>> which in turns causes a not-helpful crash report when derefer= encing a > > >>> NULL turnstile. Help debugging such cases by pointing out wha= t happened > > >>> along with some debug. > > >>> > > >>> Sponsored by: Rubicon Communications, LLC ("Netgate") > > >>> --- > > >>> sys/kern/kern_mutex.c | 4 +++- > > >>> sys/kern/kern_rwlock.c | 16 ++++++++++++---- > > >>> 2 files changed, 15 insertions(+), 5 deletions(-) > > >>> > > >>> diff --git a/sys/kern/kern_mutex.c b/sys/kern/kern_mutex.c > > >>> index 90361b23c09a..0fa624cc4bb1 100644 > > >>> --- a/sys/kern/kern_mutex.c > > >>> +++ b/sys/kern/kern_mutex.c > > >>> @@ -1053,7 +1053,9 @@ __mtx_unlock_sleep(volatile uintptr_t *c, uin= tptr_t v) > > >>> turnstile_chain_lock(&m->lock_object); > > >>> _mtx_release_lock_quick(m); > > >>> ts =3D turnstile_lookup(&m->lock_object); > > >>> - MPASS(ts !=3D NULL); > > >>> + if (__predict_false(ts =3D=3D NULL)) { > > >>> + panic("got NULL turnstile on mutex %p v %zx", m, v); > > >>> + } > > >> > > >> Hmm, this is just an expanded KASSERT() but always on rather than co= nditional on INVARIANTS? > > >> > > >> Do you have examples of the type of bugs that cause this? (Is it un= locking a freed mutex > > >> or the like?) We generally hide all these types of checks under INV= ARIANTS rather than > > >> shipping them in release kernels. > > >> > > > > > > Use-after-free, overflow, underflow, bitflip or what have you all can > > > fail the fast path. > > > > > > Once that happens and the kernel crashes with a null pointer deref, > > > here is a crash at netgate which prodded this: > > > calltrap() at calltrap+0x8/frame 0xfffffe0106720920 > > > --- trap 0xc, rip =3D 0xffffffff80d5ab70, rsp =3D 0xfffffe01067209f0,= rbp > > > =3D 0xfffffe0106720a00 --- > > > turnstile_broadcast() at turnstile_broadcast+0x40/frame 0xfffffe01067= 20a00 > > > __rw_wunlock_hard() at __rw_wunlock_hard+0x9e/frame 0xfffffe0106720a3= 0 > > > nd6_resolve_slow() at nd6_resolve_slow+0x2d7/frame 0xfffffe0106720aa0 > > > nd6_resolve() at nd6_resolve+0x125/frame 0xfffffe0106720b10 > > > ether_output() at ether_output+0x4e7/frame 0xfffffe0106720ba0 > > > ip_output_send() at ip_output_send+0xdc/frame 0xfffffe0106720be0 > > > ip_output() at ip_output+0x1295/frame 0xfffffe0106720ce0 > > > ip_forward() at ip_forward+0x3c2/frame 0xfffffe0106720d90 > > > ip_input() at ip_input+0x705/frame 0xfffffe0106720df0 > > > swi_net() at swi_net+0x138/frame 0xfffffe0106720e60 > > > ithread_loop() at ithread_loop+0x257/frame 0xfffffe0106720ef0 > > > fork_exit() at fork_exit+0x7f/frame 0xfffffe0106720f30 > > > fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0106720f30 > > > --- trap 0, rip =3D 0, rsp =3D 0, rbp =3D 0 --- > > > > > > Neither the register dump nor anything in the backtrace indicate what= happened. > > > > > > Since the kernel is going down anyway, one may as well get some debug= from it. > > > > If you don't mind the extra branches for sanity checks, why not just ru= n with > > INVARIANTS? That is, what makes these particular assertions different = from > > other assertions such that they should be on unconditionally? The last= line below > > applies to pretty much every other assertion in the tree. > > > > This adds a branch in the slowpath, a spot which should relatively > rarely execute compared to the fast path. On top of that the branch at > hand does not do any extra memory accesses or complex arithmetic. > > So no, I don't think I may as well run with INVARIANTS. How about this: if you strongly about this branch, feel free to revert the commit, I'm just going to keep the change at Netgate. But should you do it, make sure to not add avoidable branches in your stuff= . Cheers. --=20 Mateusz Guzik <mjguzik gmail.com>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAGudoHEV1HLxQw1KPx4W0vaNLs4EpvvrT9Gq5YFaVNidnXjVpw>