From owner-freebsd-bugs@freebsd.org Tue Feb 28 00:09:24 2017 Return-Path: Delivered-To: freebsd-bugs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id CCC42CF0EDD for ; Tue, 28 Feb 2017 00:09:24 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id A40BDD6A for ; Tue, 28 Feb 2017 00:09:24 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id v1S09OFX054763 for ; Tue, 28 Feb 2017 00:09:24 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-bugs@FreeBSD.org Subject: [Bug 217408] Crash on boot due to race between vt_upgrade() and vt_resume_flush_timer() Date: Tue, 28 Feb 2017 00:09:24 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: CURRENT X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: jtl@freebsd.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-bugs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version rep_platform op_sys bug_status bug_severity priority component assigned_to reporter attachments.created Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Feb 2017 00:09:24 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D217408 Bug ID: 217408 Summary: Crash on boot due to race between vt_upgrade() and vt_resume_flush_timer() Product: Base System Version: CURRENT Hardware: amd64 OS: Any Status: New Severity: Affects Some People Priority: --- Component: kern Assignee: freebsd-bugs@FreeBSD.org Reporter: jtl@freebsd.org Created attachment 180352 --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=3D180352&action= =3Dedit This patch makes it more likely we will hit the race condition on bootup. It can be used for seeing the bug and testing fixes. It appears this is more fallout in the vt driver from EARLY_AP_STARTUP allo= wing SMP prior to vt_upgrade() running. Crash at boot: Fatal trap 12: page fault while in kernel mode cpuid =3D 0; apic id =3D 00 fault virtual address =3D 0x0 fault code =3D supervisor read instruction, page not prese= nt instruction pointer =3D 0x20:0x0 stack pointer =3D 0x28:0xfffffe3ec2392908 frame pointer =3D 0x28:0xfffffe3ec23929b0 code segment =3D base 0x0, limit 0xfffff, type 0x1b =3D DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags =3D interrupt enabled, resume, IOPL =3D 0 current process =3D 12 (swi4: clock (0)) [ thread pid 12 tid 100028 ] Stopped at 0 db> bt Tracing pid 12 tid 100028 td 0xfffff8018e3a5a00 ??() at 0 softclock_call_cc() at softclock_call_cc+0x14a/frame 0xfffffe3ec23929b0 softclock() at softclock+0x94/frame 0xfffffe3ec23929e0 intr_event_execute_handlers() at intr_event_execute_handlers+0x20f/frame 0xfffffe3ec2392a20 ithread_loop() at ithread_loop+0xc6/frame 0xfffffe3ec2392a70 fork_exit() at fork_exit+0x85/frame 0xfffffe3ec2392ab0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe3ec2392ab0 --- trap 0, rip =3D 0, rsp =3D 0, rbp =3D 0 --- db> show registers cs 0x20 ds 0x3b es 0x3b fs 0x13 gs 0x28 ss 0x28 rax 0x4 rcx 0x1 rdx 0 rbx 0xfffff8018e3a5a00 rsp 0xfffffe3ec2392908 rbp 0xfffffe3ec23929b0 rsi 0xfffff8018e3a5a00 rdi 0 r8 0xffffffff8029baa0 cam_iosched_ticker r9 0x40002 r10 0xffffffff80d637f8 vt_consdev+0xd8 r11 0x1e r12 0xffffffff8102d7b8 cc_cpu+0x78 r13 0x84 rbp 0xfffffe3ec23929b0 rsi 0xfffff8018e3a5a00 rdi 0 r8 0xffffffff8029baa0 cam_iosched_ticker r9 0x40002 r10 0xffffffff80d637f8 vt_consdev+0xd8 r11 0x1e r12 0xffffffff8102d7b8 cc_cpu+0x78 r13 0x84 r14 0 r15 0xffffffff8102d740 cc_cpu rip 0 rflags 0x10246 _binary_t4fw_cfg_uwire_fw_size+0xae81 It is crashing on the line in softclock_call_cc that calls the function. It= is 0 in this case. The current callout struct is stored at -0x40(%rbp). Examining that locatio= n, we see that this is stored in the vt console driver. (Thanks, DDB!) db> x/gx 0xfffffe3ec2392970 0xfffffe3ec2392970: ffffffff80d63828 db> x/gx 0xffffffff80d63828,8 vt_consdev+0x108: 0 ffffffff8102d818 vt_consdev+0x118: 1f62b6ace8 a3d709 vt_consdev+0x128: 0 0 vt_consdev+0x138: ffffffff80d637f8 800002 db>=20 I suspect we are seeing a race between callout_schedule and callout_reset_o= n. CPU1: Running vt_upgrade() CPU2: Trying to print to the VT device CPU1 (running vt_upgrade()) is running this code: 2613 if (!(vd->vd_flags & VDF_ASYNC)) { 2614 /* Attach keyboard. */ 2615 vt_allocate_keyboard(vd); 2616 2617 /* Init 25 Hz timer. */ 2618 callout_init_mtx(&vd->vd_timer, &vd->vd_lock, 0); 2619 2620 /* Start timer when everything ready. */ 2621 vd->vd_flags |=3D VDF_ASYNC; 2622 callout_reset(&vd->vd_timer, hz / VT_TIMERFREQ, vt_timer, vd); 2623 vd->vd_timer_armed =3D 1; 2624 register_handlers =3D 1; 2625 } In particular, I suspect it is running line 2622 when CPU2 tries to print a '\n'. CPU2 follows this callchain: {trimmed} -> termcn_cnputc -> teken_input -> teken_input_byte -> teken_input_char -> teken_subr_newline -> teken_funcs_cursor -> vt_resume_flush_timer vt_resume_flush_timer is pretty simple (but you also need to see vt_schedule_flush() to fully understand the code: 276static void 277vt_schedule_flush(struct vt_device *vd, int ms) 278{ 279 280 if (ms <=3D 0) 281 /* Default to initial value. */ 282 ms =3D 1000 / VT_TIMERFREQ; 283 284 callout_schedule(&vd->vd_timer, hz / (1000 / ms)); 285} 286 287void 288vt_resume_flush_timer(struct vt_device *vd, int ms) 289{ 290 291 if (!(vd->vd_flags & VDF_ASYNC) || 292 !atomic_cmpset_int(&vd->vd_timer_armed, 0, 1)) 293 return; 294 295 vt_schedule_flush(vd, ms); 296} Note that this thread does not hold the vd lock at this point. (And, it probably can't hold the vd lock, since the vd lock is a normal mutex and termcn_cnputc() may hold a spin lock.) However, also note that the code told the callout subsystem that this callout structure was protected by the vd l= ock. Yet, this code path touches the callout structure without holding the vd lo= ck. That means the callout subsystem will make erroneous assumptions about how = to handle callout events. (This is a larger problem. And, it is difficult to f= ix because of the spin lock issue.) One more relevant piece of code: 1170int 1171callout_schedule(struct callout *c, int to_ticks) 1172{ 1173 return callout_reset_on(c, to_ticks, c->c_func, c->c_arg, c->c_cpu); 1174} So, I suspect that what is happening is: CPU1: Sets (vd->vd_flags |=3D VDF_ASYNC) and calls callout_reset(). CPU2: Checks (vd->vd_flags & VDF_ASYNC) and finds it is set (line 291). CPU2: Does atomic_cmpset_int(&vd->vd_timer_armed, 0, 1) and finds it succee= ds (line 292). CPU2: Runs callout_schedule. callout_schedule pulls out the current c->c_fu= nc and c->c_arg. CPU2: Calls callout_reset_on with the c->c_func and c->c_arg it just acquir= ed. (Because CPU1 has not initialized these yet, they are NULL.) CPU1: callout_reset sets c->c_func and c->c_arg to the correct values and completes. CPU2: callout_reset_on overwrites c->c_func and c->c_arg with the NULL valu= es it acquired. Eventually, the callout fires and things blow up. The easy fix for this problem is to reorder the way the variables are assig= ned in vt_upgrade(). However, this will not fix the larger error in using the callout system APIs with respect to locking. I have a reliable replication of this where I lengthen the window to hit the race condition. (I'll attach the reproduction patch to this bug.) It seems = to reliably produce a crash without a fix. With a patch to vt_upgrade(), the c= rash goes away. I've opened review D9828 with the fix. --=20 You are receiving this mail because: You are the assignee for the bug.=