Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 20 Jan 2013 12:29:08 +0200
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Ryan Stone <rysto32@gmail.com>
Cc:        FreeBSD Current <freebsd-current@freebsd.org>
Subject:   Re: ULE can leak TDQ_LOCK() if statclock() called outside of critical_enter()
Message-ID:  <20130120102908.GC2522@kib.kiev.ua>
In-Reply-To: <CAFMmRNz7AQMRr2%2B-59702k_mr=3=DsFP9c7ejPEBC7d=4w0bTA@mail.gmail.com>
References:  <CAFMmRNz7AQMRr2%2B-59702k_mr=3=DsFP9c7ejPEBC7d=4w0bTA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

--9uNR01GrImESOVvg
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Jan 18, 2013 at 11:56:27AM -0500, Ryan Stone wrote:
> I have been experiencing occasional deadlocks on FreeBSD 8.2 systems using
> the ULE scheduler.  The root cause in every case has been that ULE's
> TDQ_LOCK for cpu 0 is owned by a thread that is not running.  I have been
> investigating the issue, and I believe that I see the issue.  The problem
> occurs if the interrupt that drives statclock does not call
> critical_enter() upon calling into statclock().  The lapic timer does use
> critical_enter(), so default configurations would not see this.  I have
> local patches to use the RTC to drive statclock, and from a quick reading
> of the eventtimer code in -CURRENT the same thing is possible there.  The
> RTC code does not call statclock within a critical section.  So here's the
> bug:
>=20
> 1) A thread with interrupts enabled, running on CPU 0, with td_owepreempt=
=3D1
> and td_critnest=3D1 calls critical_exit():
>=20
> void
> critical_exit(void)
> {
>    // ...
>     if (td->td_critnest =3D=3D 1) {
>         td->td_critnest =3D 0;
>         if (td->td_owepreempt && !kdb_active) {
> // Irrelevant bits snipped
>=20
> 2) td_critnest is set to 0, and then the RTC interrupt fires.
>=20
> 3) rtcintr calls into statclock (8.2) or statclock_cnt(head) with
> td_critnest still 0 (on head it goes through the eventtimer code, but it
> ends up in statclock eventually).
>=20
> 4) statclock takes the thread_lock on curthread, then calls sched_clock().
> sched_clock calls sched_balance();
>=20
> static void
> sched_balance(void)
> {
>     // snip...
>     tdq =3D TDQ_SELF();
>     TDQ_UNLOCK(tdq);
>     sched_balance_group(cpu_top);
>     TDQ_LOCK(tdq);
> }
I think that current code deserves at least CRITICAL_ASSERT() on
the entry to sched_balance(). The same assert should be added to
kern_clocksource.c::timercb().

>=20
> TDQ_UNLOCK does a spinlock_exit which does a critical_exit.  td_critnest
> will be decremented back to 0 and td_owepreempt is still 1, so this
> triggers a preemption.  Note that this TDQ_UNLOCK is (intentionally)
> unlocking the thread_lock done by statclock.
>=20
> 5) thread migrates to any other CPU, call it CPU n.  tdq is now stale.
> TDQ_LOCK takes the lock for CPU 0 (but really it's intending to re-take t=
he
> thread_lock, although a thread_lock() here would be equally incorrect --
> sched_balance's caller is going to be mucking around with the TDQ when
> sched_balance returns).
>=20
> 6) The thread returns to statclock.  statclock does a thread_unlock(). The
> td_lock is TDQ_LOCK(n), which we don't hold.  We mangle the stat of
> TDQ_LOCK(n) and leave TDQ_LOCK(0) held.
>=20
>=20
> The simplest solution would be to do a critical_enter() in sched_balance,
> although that would be superfluous in the normal case where the lapic tim=
er
> is driving statclock.  I'm not sure if there's other code in the
> eventtimers infrastructure that's assuming that a preemption or migration
> is impossible while handling an event.  A quick look at kern_clocksource.c
> turns up worrying comments like "Handle all events for specified time on
> this CPU" and uses of curcpu, so there may well be other issues lurking
> here.
>=20
> It looks to me that the safest thing to do would be to push the
> critical_enter() into the eventtimer code or even all the way back to the
> interrupt handlers (mirroring what the lapic code already does).

Both atrtc and hpet register the interrupt handler as the filter.
The filters call loop enters critical section around handlers, see
kern_intr.c:intr_event_handle(). At least on HEAD it is so, and I see
the same code in the 8.

--9uNR01GrImESOVvg
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (FreeBSD)

iQIcBAEBAgAGBQJQ+8bzAAoJEJDCuSvBvK1Bzf0P/i6Pn20zXYAF3J/4pOjfxs8n
IXCy9Oww+1tKBOYnfZoWjvcllJqmGg7xI8rwNWrmJE9ctPRLEkBJK4nEh1fg6ySu
9hxvNwEXofOHGzifHX61OzKkr3on0OeRvru9V8p8Rcd5p97tKaQJK1nLyHGD8df6
WIUKkQJDAjrGImQ4T1JzIvWOA/M60cYGMJEYhXeUeqTJsY0O1nYQAyxv11Nk7day
wtc3cO2AtvVKETyAO1ghqR7Uc42b0+IYbYqc/CnyLCfFn9+ibZMCT+eL/Jfkgeur
62kt6L0qxGLy/fKTlK2cddWFfPyBXOLT6PxYlcpGshGaZD0srHIN+HiJ921Heuqa
bWWd3dxMAsJq/u3z+IUw0+gzUbw/rlLt++zpVQN6bGW6R+FPMyHeaD4ZrWryPmOO
GikeQG4IzkpcQXjTr+V23fvbojHHosTVZocrhS8yKUxWshOGOEZNVntjejD0RwnQ
O9nKdt3H5xPBWTmn4pKHQp212gzNAbA5z+nRMQsDtyMhk15VbqQw+Akdp/I+Px6k
99lKoGiUi0Ft2QECac+nyAGdnQpBxe0h9wrmgyUkXr20f60N9G9hazC6mQqpTOPv
zPvVOtDpBarVIrSs0mx9kbI/zwq2CzEQBO+4ZBSALI07QccrtD9OxGR9elojMUk/
HogeXqujI8vFr6Z/L/mJ
=Orve
-----END PGP SIGNATURE-----

--9uNR01GrImESOVvg--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130120102908.GC2522>