Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 14 Jun 2016 22:26:14 -0500
From:      Eric Badger <eric@badgerio.us>
To:        FreeBSD Current <freebsd-current@freebsd.org>
Subject:   Kqueue races causing crashes
Message-ID:  <34035bf6-8b3c-d15c-765b-94bcc919ea2e@badgerio.us>

next in thread | raw e-mail | index | archive | help
Hi there,

There seems to be some racy code in kern_event.c which is causing me to=20
run into some crashes. I=E2=80=99ve attached the test program used to gen=
erate=20
these crashes (build it and run the =E2=80=9Cgo=E2=80=9D script). They we=
re produced in=20
a VM with 4 cores on 11 Alpha 3 (and originally 10.3). The crashes I=E2=80=
=99ve=20
seen come in a few varieties:

1. =E2=80=9Cuserret: returning with the following locks held=E2=80=9D. Th=
is one is the=20
easiest to hit (assuming witness is enabled).

userret: returning with the following locks held:
exclusive sleep mutex process lock (process lock) r =3D 0=20
(0xfffff80006956120) locked @ /usr/src/sys/kern/kern_event.c:2125
panic: witness_warn
cpuid =3D 2
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame=20
0xfffffe000039d8e0
vpanic() at vpanic+0x182/frame 0xfffffe000039d960
kassert_panic() at kassert_panic+0x126/frame 0xfffffe000039d9d0
witness_warn() at witness_warn+0x3c6/frame 0xfffffe000039daa0
userret() at userret+0x9d/frame 0xfffffe000039dae0
amd64_syscall() at amd64_syscall+0x406/frame 0xfffffe000039dbf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe000039dbf0
--- syscall (1, FreeBSD ELF64, sys_sys_exit), rip =3D 0x800b8a0ba, rsp =3D=
=20
0x7fffffffea98, rbp =3D 0x7fffffffeae0 ---
KDB: enter: panic
[ thread pid 64855 tid 100106 ]
Stopped at      kdb_enter+0x3b: movq    $0,kdb_why
db> show all locks
Process 64855 (watch) thread 0xfffff800066c3000 (100106)
exclusive sleep mutex process lock (process lock) r =3D 0=20
(0xfffff80006956120) locked @ /usr/src/sys/kern/kern_event.c:2125
Process 64855 (watch) thread 0xfffff8000696a500 (100244)
exclusive sleep mutex pmap (pmap) r =3D 0 (0xfffff800068c3138) locked @=20
/usr/src/sys/amd64/amd64/pmap.c:4067
exclusive sx vm map (user) (vm map (user)) r =3D 0 (0xfffff800068f6080)=20
locked @ /usr/src/sys/vm/vm_map.c:3315
exclusive sx vm map (user) (vm map (user)) r =3D 0 (0xfffff800068c3080)=20
locked @ /usr/src/sys/vm/vm_map.c:3311
db> ps
   pid  ppid  pgrp   uid   state   wmesg         wchan        cmd
64855   690   690     0  R+      (threaded)                  watch
100106                   Run     CPU 2                       main
100244                   Run     CPU 1 procmaker
100245                   Run     CPU 3 reaper

2.  =E2=80=9CSleeping thread owns a non-sleepable lock=E2=80=9D. This one=
 first drew my=20
attention by showing up in a real world application at work.

Sleeping thread (tid 100101, pid 76857) owns a non-sleepable lock
KDB: stack backtrace of thread 100101:
sched_switch() at sched_switch+0x2a5/frame 0xfffffe0000257690
mi_switch() at mi_switch+0xe1/frame 0xfffffe00002576d0
sleepq_catch_signals() at sleepq_catch_signals+0x16c/frame=20
0xfffffe0000257730
sleepq_timedwait_sig() at sleepq_timedwait_sig+0xf/frame 0xfffffe00002577=
60
_sleep() at _sleep+0x234/frame 0xfffffe00002577e0
kern_kevent_fp() at kern_kevent_fp+0x38a/frame 0xfffffe00002579d0
kern_kevent() at kern_kevent+0x9f/frame 0xfffffe0000257a30
sys_kevent() at sys_kevent+0x12a/frame 0xfffffe0000257ae0
amd64_syscall() at amd64_syscall+0x2d4/frame 0xfffffe0000257bf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0000257bf0
--- syscall (363, FreeBSD ELF64, sys_kevent), rip =3D 0x800b6afea, rsp =3D=
=20
0x7fffffffea88, rbp =3D 0x7fffffffead0 ---
panic: sleeping thread
cpuid =3D 3
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame=20
0xfffffe0000225590
kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0000225640
vpanic() at vpanic+0x126/frame 0xfffffe0000225680
panic() at panic+0x43/frame 0xfffffe00002256e0
propagate_priority() at propagate_priority+0x166/frame 0xfffffe0000225710
turnstile_wait() at turnstile_wait+0x282/frame 0xfffffe0000225750
__mtx_lock_sleep() at __mtx_lock_sleep+0x26b/frame 0xfffffe00002257d0
__mtx_lock_flags() at __mtx_lock_flags+0x5e/frame 0xfffffe00002257f0
proc_to_reap() at proc_to_reap+0x46/frame 0xfffffe0000225840
kern_wait6() at kern_wait6+0x202/frame 0xfffffe00002258f0
sys_wait4() at sys_wait4+0x72/frame 0xfffffe0000225ae0
amd64_syscall() at amd64_syscall+0x2d4/frame 0xfffffe0000225bf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0000225bf0
--- syscall (7, FreeBSD ELF64, sys_wait4), rip =3D 0x800b209ba, rsp =3D=20
0x7fffdfdfcf48, rbp =3D 0x7fffdfdfcf80 ---
KDB: enter: panic
[ thread pid 76857 tid 100225 ]
Stopped at      kdb_enter+0x3e: movq    $0,kdb_why
db> show allchains
chain 1:
  thread 100225 (pid 76857, reaper) blocked on lock 0xfffff800413105f0=20
(sleep mutex) "process lock"
  thread 100101 (pid 76857, main) inhibited

(3./4.) There are a few others that I hit less frequently (=E2=80=9Cpage =
fault=20
while in kernel mode=E2=80=9D,  "Kernel page fault with the following=20
non-sleepable locks held=E2=80=9D. I don=E2=80=99t have a backtrace handy=
 for these.

I believe they all have more or less the same cause. The crashes occur=20
because we acquire a knlist lock via the KN_LIST_LOCK macro, but when we=20
call KN_LIST_UNLOCK, the knote=E2=80=99s knlist reference (kn->kn_knlist)=
 has=20
been cleared by another thread. Thus we are unable to unlock the=20
previously acquired lock and hold it until something causes us to crash=20
(such as the witness code noticing that we=E2=80=99re returning to userla=
nd with=20
the lock still held).

A walkthrough of what happens in the test program:

There are 3 threads: 1 forks off short-lived child processes, 2 reaps=20
the child processes, and 3 tracks the child processes via a kqueue=20
(NOTE_EXIT | NOTE_EXEC | NOTE_FORK | NOTE_TRACK). I believe a crash=20
generally looks like this:

1. Forker thread creates a short lived child. That child dies and=20
triggers a NOTE_EXIT event.
2. Kqueue thread is somewhere in kqueue_scan(), probably blocked at a=20
KN_LIST_LOCK call.
3. The dying process calls into filt_proc() and notices that the=20
KN_DETACHED flag is not set. It therefore decides to call=20
knlist_remove_inevent() to take the knote out of the knlist.=20
Importantly, this sets kn->kn_knlist to NULL, meaning we can no longer=20
access the knlist lock from the knote.
4. Kqueue thread, still in kqueue_scan(), is able to acquire the lock=20
via KN_LIST_LOCK. It does some work and then calls the KN_LIST_UNLOCK=20
macro. This macro checks and finds that the knote does not have a=20
reference to a knlist, and thus takes no action, leaving the lock in the=20
locked state.

I believe there=E2=80=99s also a small window where the KN_LIST_LOCK macr=
o=20
checks kn->kn_knlist and finds it to be non-NULL, but by the time it=20
actually dereferences it, it has become NULL. This would produce the=20
=E2=80=9Cpage fault while in kernel mode=E2=80=9D crash.

If someone familiar with this code sees an obvious fix, I=E2=80=99ll be h=
appy to=20
test it. Otherwise, I=E2=80=99d appreciate any advice on fixing this. My =
first=20
thought is that a =E2=80=98struct knote=E2=80=99 ought to have its own mu=
tex for=20
controlling access to the flag fields and ideally the =E2=80=9Ckn_knlist=E2=
=80=9D field.=20
I.e., you would first acquire a knote=E2=80=99s lock and then the knlist =
lock,=20
thus ensuring that no one could clear the kn_knlist variable while you=20
hold the knlist lock. The knlist lock, however, usually comes from=20
whichever event producing entity the knote tracks, so getting lock=20
ordering right between the per-knote mutex and this other lock seems=20
potentially hard. (Sometimes we call into functions in kern_event.c with=20
the knlist lock already held, having been acquired in code outside of=20
kern_event.c. Consider, for example, calling KNOTE_LOCKED from=20
kern_exit.c; the PROC_LOCK macro has already been used to acquire the=20
process lock, also serving as the knlist lock).

Apropos of the knlist lock and its provenance: why is a lock from the=20
event producing entity used to control access to the knlist and knote?=20
Is it generally desirable to, for example, hold the process lock while=20
operating on a knlist attached to that process? It=E2=80=99s not obvious =
to me=20
that this is required or even desirable. This might suggest that a=20
knlist should have its own lock rather than using a lock from the event=20
producing entity, which might make addressing this problem more=20
straightforward.

Many thanks for any help, and please let me know if I=E2=80=99ve failed t=
o make=20
anything clear.

Cheers,
Eric






Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?34035bf6-8b3c-d15c-765b-94bcc919ea2e>