Date: Tue, 14 Jun 2016 22:26:14 -0500 From: Eric Badger <eric@badgerio.us> To: FreeBSD Current <freebsd-current@freebsd.org> Subject: Kqueue races causing crashes Message-ID: <34035bf6-8b3c-d15c-765b-94bcc919ea2e@badgerio.us>
next in thread | raw e-mail | index | archive | help
Hi there, There seems to be some racy code in kern_event.c which is causing me to=20 run into some crashes. I=E2=80=99ve attached the test program used to gen= erate=20 these crashes (build it and run the =E2=80=9Cgo=E2=80=9D script). They we= re produced in=20 a VM with 4 cores on 11 Alpha 3 (and originally 10.3). The crashes I=E2=80= =99ve=20 seen come in a few varieties: 1. =E2=80=9Cuserret: returning with the following locks held=E2=80=9D. Th= is one is the=20 easiest to hit (assuming witness is enabled). userret: returning with the following locks held: exclusive sleep mutex process lock (process lock) r =3D 0=20 (0xfffff80006956120) locked @ /usr/src/sys/kern/kern_event.c:2125 panic: witness_warn cpuid =3D 2 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame=20 0xfffffe000039d8e0 vpanic() at vpanic+0x182/frame 0xfffffe000039d960 kassert_panic() at kassert_panic+0x126/frame 0xfffffe000039d9d0 witness_warn() at witness_warn+0x3c6/frame 0xfffffe000039daa0 userret() at userret+0x9d/frame 0xfffffe000039dae0 amd64_syscall() at amd64_syscall+0x406/frame 0xfffffe000039dbf0 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe000039dbf0 --- syscall (1, FreeBSD ELF64, sys_sys_exit), rip =3D 0x800b8a0ba, rsp =3D= =20 0x7fffffffea98, rbp =3D 0x7fffffffeae0 --- KDB: enter: panic [ thread pid 64855 tid 100106 ] Stopped at kdb_enter+0x3b: movq $0,kdb_why db> show all locks Process 64855 (watch) thread 0xfffff800066c3000 (100106) exclusive sleep mutex process lock (process lock) r =3D 0=20 (0xfffff80006956120) locked @ /usr/src/sys/kern/kern_event.c:2125 Process 64855 (watch) thread 0xfffff8000696a500 (100244) exclusive sleep mutex pmap (pmap) r =3D 0 (0xfffff800068c3138) locked @=20 /usr/src/sys/amd64/amd64/pmap.c:4067 exclusive sx vm map (user) (vm map (user)) r =3D 0 (0xfffff800068f6080)=20 locked @ /usr/src/sys/vm/vm_map.c:3315 exclusive sx vm map (user) (vm map (user)) r =3D 0 (0xfffff800068c3080)=20 locked @ /usr/src/sys/vm/vm_map.c:3311 db> ps pid ppid pgrp uid state wmesg wchan cmd 64855 690 690 0 R+ (threaded) watch 100106 Run CPU 2 main 100244 Run CPU 1 procmaker 100245 Run CPU 3 reaper 2. =E2=80=9CSleeping thread owns a non-sleepable lock=E2=80=9D. This one= first drew my=20 attention by showing up in a real world application at work. Sleeping thread (tid 100101, pid 76857) owns a non-sleepable lock KDB: stack backtrace of thread 100101: sched_switch() at sched_switch+0x2a5/frame 0xfffffe0000257690 mi_switch() at mi_switch+0xe1/frame 0xfffffe00002576d0 sleepq_catch_signals() at sleepq_catch_signals+0x16c/frame=20 0xfffffe0000257730 sleepq_timedwait_sig() at sleepq_timedwait_sig+0xf/frame 0xfffffe00002577= 60 _sleep() at _sleep+0x234/frame 0xfffffe00002577e0 kern_kevent_fp() at kern_kevent_fp+0x38a/frame 0xfffffe00002579d0 kern_kevent() at kern_kevent+0x9f/frame 0xfffffe0000257a30 sys_kevent() at sys_kevent+0x12a/frame 0xfffffe0000257ae0 amd64_syscall() at amd64_syscall+0x2d4/frame 0xfffffe0000257bf0 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0000257bf0 --- syscall (363, FreeBSD ELF64, sys_kevent), rip =3D 0x800b6afea, rsp =3D= =20 0x7fffffffea88, rbp =3D 0x7fffffffead0 --- panic: sleeping thread cpuid =3D 3 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame=20 0xfffffe0000225590 kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0000225640 vpanic() at vpanic+0x126/frame 0xfffffe0000225680 panic() at panic+0x43/frame 0xfffffe00002256e0 propagate_priority() at propagate_priority+0x166/frame 0xfffffe0000225710 turnstile_wait() at turnstile_wait+0x282/frame 0xfffffe0000225750 __mtx_lock_sleep() at __mtx_lock_sleep+0x26b/frame 0xfffffe00002257d0 __mtx_lock_flags() at __mtx_lock_flags+0x5e/frame 0xfffffe00002257f0 proc_to_reap() at proc_to_reap+0x46/frame 0xfffffe0000225840 kern_wait6() at kern_wait6+0x202/frame 0xfffffe00002258f0 sys_wait4() at sys_wait4+0x72/frame 0xfffffe0000225ae0 amd64_syscall() at amd64_syscall+0x2d4/frame 0xfffffe0000225bf0 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe0000225bf0 --- syscall (7, FreeBSD ELF64, sys_wait4), rip =3D 0x800b209ba, rsp =3D=20 0x7fffdfdfcf48, rbp =3D 0x7fffdfdfcf80 --- KDB: enter: panic [ thread pid 76857 tid 100225 ] Stopped at kdb_enter+0x3e: movq $0,kdb_why db> show allchains chain 1: thread 100225 (pid 76857, reaper) blocked on lock 0xfffff800413105f0=20 (sleep mutex) "process lock" thread 100101 (pid 76857, main) inhibited (3./4.) There are a few others that I hit less frequently (=E2=80=9Cpage = fault=20 while in kernel mode=E2=80=9D, "Kernel page fault with the following=20 non-sleepable locks held=E2=80=9D. I don=E2=80=99t have a backtrace handy= for these. I believe they all have more or less the same cause. The crashes occur=20 because we acquire a knlist lock via the KN_LIST_LOCK macro, but when we=20 call KN_LIST_UNLOCK, the knote=E2=80=99s knlist reference (kn->kn_knlist)= has=20 been cleared by another thread. Thus we are unable to unlock the=20 previously acquired lock and hold it until something causes us to crash=20 (such as the witness code noticing that we=E2=80=99re returning to userla= nd with=20 the lock still held). A walkthrough of what happens in the test program: There are 3 threads: 1 forks off short-lived child processes, 2 reaps=20 the child processes, and 3 tracks the child processes via a kqueue=20 (NOTE_EXIT | NOTE_EXEC | NOTE_FORK | NOTE_TRACK). I believe a crash=20 generally looks like this: 1. Forker thread creates a short lived child. That child dies and=20 triggers a NOTE_EXIT event. 2. Kqueue thread is somewhere in kqueue_scan(), probably blocked at a=20 KN_LIST_LOCK call. 3. The dying process calls into filt_proc() and notices that the=20 KN_DETACHED flag is not set. It therefore decides to call=20 knlist_remove_inevent() to take the knote out of the knlist.=20 Importantly, this sets kn->kn_knlist to NULL, meaning we can no longer=20 access the knlist lock from the knote. 4. Kqueue thread, still in kqueue_scan(), is able to acquire the lock=20 via KN_LIST_LOCK. It does some work and then calls the KN_LIST_UNLOCK=20 macro. This macro checks and finds that the knote does not have a=20 reference to a knlist, and thus takes no action, leaving the lock in the=20 locked state. I believe there=E2=80=99s also a small window where the KN_LIST_LOCK macr= o=20 checks kn->kn_knlist and finds it to be non-NULL, but by the time it=20 actually dereferences it, it has become NULL. This would produce the=20 =E2=80=9Cpage fault while in kernel mode=E2=80=9D crash. If someone familiar with this code sees an obvious fix, I=E2=80=99ll be h= appy to=20 test it. Otherwise, I=E2=80=99d appreciate any advice on fixing this. My = first=20 thought is that a =E2=80=98struct knote=E2=80=99 ought to have its own mu= tex for=20 controlling access to the flag fields and ideally the =E2=80=9Ckn_knlist=E2= =80=9D field.=20 I.e., you would first acquire a knote=E2=80=99s lock and then the knlist = lock,=20 thus ensuring that no one could clear the kn_knlist variable while you=20 hold the knlist lock. The knlist lock, however, usually comes from=20 whichever event producing entity the knote tracks, so getting lock=20 ordering right between the per-knote mutex and this other lock seems=20 potentially hard. (Sometimes we call into functions in kern_event.c with=20 the knlist lock already held, having been acquired in code outside of=20 kern_event.c. Consider, for example, calling KNOTE_LOCKED from=20 kern_exit.c; the PROC_LOCK macro has already been used to acquire the=20 process lock, also serving as the knlist lock). Apropos of the knlist lock and its provenance: why is a lock from the=20 event producing entity used to control access to the knlist and knote?=20 Is it generally desirable to, for example, hold the process lock while=20 operating on a knlist attached to that process? It=E2=80=99s not obvious = to me=20 that this is required or even desirable. This might suggest that a=20 knlist should have its own lock rather than using a lock from the event=20 producing entity, which might make addressing this problem more=20 straightforward. Many thanks for any help, and please let me know if I=E2=80=99ve failed t= o make=20 anything clear. Cheers, Eric
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?34035bf6-8b3c-d15c-765b-94bcc919ea2e>
