Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 30 Jul 2010 10:31:34 -0400
From:      John Baldwin <jhb@freebsd.org>
To:        freebsd-hackers@freebsd.org
Cc:        mdf@freebsd.org
Subject:   Re: sched_pin() versus PCPU_GET
Message-ID:  <201007301031.34266.jhb@freebsd.org>
In-Reply-To: <201007301008.22501.jhb@freebsd.org>
References:  <AANLkTikY20TxyeyqO5zP3zC-azb748kV-MdevPfm%2B8cq@mail.gmail.com> <201007301008.22501.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Friday, July 30, 2010 10:08:22 am John Baldwin wrote:
> On Thursday, July 29, 2010 7:39:02 pm mdf@freebsd.org wrote:
> > We've seen a few instances at work where witness_warn() in ast()
> > indicates the sched lock is still held, but the place it claims it was
> > held by is in fact sometimes not possible to keep the lock, like:
> >=20
> > 	thread_lock(td);
> > 	td->td_flags &=3D ~TDF_SELECT;
> > 	thread_unlock(td);
> >=20
> > What I was wondering is, even though the assembly I see in objdump -S
> > for witness_warn has the increment of td_pinned before the PCPU_GET:
> >=20
> > ffffffff802db210:	65 48 8b 1c 25 00 00 	mov    %gs:0x0,%rbx
> > ffffffff802db217:	00 00
> > ffffffff802db219:	ff 83 04 01 00 00    	incl   0x104(%rbx)
> > 	 * Pin the thread in order to avoid problems with thread migration.
> > 	 * Once that all verifies are passed about spinlocks ownership,
> > 	 * the thread is in a safe path and it can be unpinned.
> > 	 */
> > 	sched_pin();
> > 	lock_list =3D PCPU_GET(spinlocks);
> > ffffffff802db21f:	65 48 8b 04 25 48 00 	mov    %gs:0x48,%rax
> > ffffffff802db226:	00 00
> > 	if (lock_list !=3D NULL && lock_list->ll_count !=3D 0) {
> > ffffffff802db228:	48 85 c0             	test   %rax,%rax
> > 	 * Pin the thread in order to avoid problems with thread migration.
> > 	 * Once that all verifies are passed about spinlocks ownership,
> > 	 * the thread is in a safe path and it can be unpinned.
> > 	 */
> > 	sched_pin();
> > 	lock_list =3D PCPU_GET(spinlocks);
> > ffffffff802db22b:	48 89 85 f0 fe ff ff 	mov    %rax,-0x110(%rbp)
> > ffffffff802db232:	48 89 85 f8 fe ff ff 	mov    %rax,-0x108(%rbp)
> > 	if (lock_list !=3D NULL && lock_list->ll_count !=3D 0) {
> > ffffffff802db239:	0f 84 ff 00 00 00    	je     ffffffff802db33e
> > <witness_warn+0x30e>
> > ffffffff802db23f:	44 8b 60 50          	mov    0x50(%rax),%r12d
> >=20
> > is it possible for the hardware to do any re-ordering here?
> >=20
> > The reason I'm suspicious is not just that the code doesn't have a
> > lock leak at the indicated point, but in one instance I can see in the
> > dump that the lock_list local from witness_warn is from the pcpu
> > structure for CPU 0 (and I was warned about sched lock 0), but the
> > thread id in panic_cpu is 2.  So clearly the thread was being migrated
> > right around panic time.
> >=20
> > This is the amd64 kernel on stable/7.  I'm not sure exactly what kind
> > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago IIRC.
> >=20
> > So... do we need some kind of barrier in the code for sched_pin() for
> > it to really do what it claims?  Could the hardware have re-ordered
> > the "mov    %gs:0x48,%rax" PCPU_GET to before the sched_pin()
> > increment?
>=20
> Hmmm, I think it might be able to because they refer to different locatio=
ns.
>=20
> Note this rule in section 8.2.2 of Volume 3A:
>=20
>   =E2=80=A2 Reads may be reordered with older writes to different locatio=
ns but not
>     with older writes to the same location.
>=20
> It is certainly true that sparc64 could reorder with RMO.  I believe ia64=
=20
> could reorder as well.  Since sched_pin/unpin are frequently used to prov=
ide=20
> this sort of synchronization, we could use memory barriers in pin/unpin
> like so:
>=20
> sched_pin()
> {
> 	td->td_pinned =3D atomic_load_acq_int(&td->td_pinned) + 1;
> }
>=20
> sched_unpin()
> {
> 	atomic_store_rel_int(&td->td_pinned, td->td_pinned - 1);
> }
>=20
> We could also just use atomic_add_acq_int() and atomic_sub_rel_int(), but=
 they=20
> are slightly more heavyweight, though it would be more clear what is happ=
ening=20
> I think.

However, to actually get a race you'd have to have an interrupt fire and
migrate you so that the speculative read was from the other CPU.  However, I
don't think the speculative read would be preserved in that case.  The CPU
has to return to a specific PC when it returns from the interrupt and it has
no way of storing the state for what speculative reordering it might be
doing, so presumably it is thrown away?  I suppose it is possible that it
actually retires both instructions (but reordered) and then returns to the =
PC
value after the read of listlocks after the interrupt.  However, in that ca=
se
the scheduler would not migrate as it would see td_pinned !=3D 0.  To get t=
he
race you have to have the interrupt take effect prior to modifying td_pinne=
d,
so I think the processor would have to discard the reordered read of
listlocks so it could safely resume execution at the 'incl' instruction.

The other nit there on x86 at least is that the incl instruction is doing
both a read and a write and another rule in the section 8.2.2 is this:

  =E2=80=A2 Reads are not reordered with other reads.

That would seem to prevent the read of listlocks from passing the read of
td_pinned in the incl instruction on x86.

=2D-=20
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201007301031.34266.jhb>