From owner-freebsd-current@FreeBSD.ORG  Tue May 17 12:16:33 2011
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 45D1C1065672;
	Tue, 17 May 2011 12:16:33 +0000 (UTC) (envelope-from jhb@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id E38948FC14;
	Tue, 17 May 2011 12:16:32 +0000 (UTC)
Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net
	[66.111.2.69])
	by cyrus.watson.org (Postfix) with ESMTPSA id 95E3246B03;
	Tue, 17 May 2011 08:16:32 -0400 (EDT)
Received: from jhbmac.hudson-trading.com (unknown [209.249.190.124])
	by bigwig.baldwin.cx (Postfix) with ESMTPSA id 2A6A68A04F;
	Tue, 17 May 2011 08:16:32 -0400 (EDT)
Message-ID: <4DD26720.3000001@FreeBSD.org>
Date: Tue, 17 May 2011 08:16:32 -0400
From: John Baldwin <jhb@FreeBSD.org>
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US;
	rv:1.9.2.17) Gecko/20110414 Thunderbird/3.1.10
MIME-Version: 1.0
To: Max Laier <max@love2party.net>
References: <4DCD357D.6000109@FreeBSD.org>
	<201105161738.53366.max@love2party.net>
	<BANLkTik8CqtiP9OgvBpL08dqK6Aj+LQ3OA@mail.gmail.com>
	<201105161805.51188.max@love2party.net>
In-Reply-To: <201105161805.51188.max@love2party.net>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6
	(bigwig.baldwin.cx); Tue, 17 May 2011 08:16:32 -0400 (EDT)
Cc: neel@freebsd.org, Andriy Gapon <avg@freebsd.org>,
	Attilio Rao <attilio@freebsd.org>,
	FreeBSD current <freebsd-current@freebsd.org>,
	Stephan Uphoff <ups@FreeBSD.org>, Peter Grehan <grehan@freebsd.org>
Subject: Re: proposed smp_rendezvous change
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 17 May 2011 12:16:33 -0000

On 5/16/11 6:05 PM, Max Laier wrote:
> On Monday 16 May 2011 17:54:54 Attilio Rao wrote:
>> 2011/5/16 Max Laier<max@love2party.net>:
>>> On Monday 16 May 2011 16:46:03 John Baldwin wrote:
>>>> On Monday, May 16, 2011 4:30:44 pm Max Laier wrote:
>>>>> On Monday 16 May 2011 14:21:27 John Baldwin wrote:
>>>>>> Yes, we need to fix that.  Humm, it doesn't preempt when you do a
>>>>>> critical_exit() though?  Or do you use a hand-rolled critical exit
>>>>>> that doesn't do a deferred preemption?
>>>>>
>>>>> Right now I just did a manual td_critnest++/--, but I guess ...
>>>>
>>>> Ah, ok, so you would "lose" a preemption.  That's not really ideal.
>>>>
>>>>>> Actually, I'm curious how the spin unlock inside the IPI could yield
>>>>>> the CPU.  Oh, is rmlock doing a wakeup inside the IPI handler?  I
>>>>>> guess that is ok as long as the critical_exit() just defers the
>>>>>> preemption to the end of the IPI handler.
>>>>>
>>>>> ... the earliest point where it is safe to preempt is after doing the
>>>>>
>>>>>     atomic_add_int(&smp_rv_waiters[2], 1);
>>>>>
>>>>> so that we can start other IPIs again.  However, since we don't accept
>>>>> new IPIs until we signal EOI in the MD code (on amd64), this might
>>>>> still not be a good place to do the yield?!?
>>>>
>>>> Hmm, yeah, you would want to do the EOI before you yield.  However, we
>>>> could actually move the EOI up before calling the MI code so long as we
>>>> leave interrupts disabled for the duration of the handler (which we do).
>>>>
>>>>> The spin unlock boils down to a critical_exit() and unless we did a
>>>>> critical_enter() at some point during the redenvouz setup, we will
>>>>> yield() if we owepreempt.  I'm not quite sure how that can happen, but
>>>>> it seems like there is a path that allows the scheduler to set it from
>>>>> a foreign CPU.
>>>>
>>>> No, it is only set on curthread by curthread.  This is actually my main
>>>> question.  I've no idea how this could happen unless the rmlock code is
>>>> actually triggering a wakeup or sched_add() in its rendezvous handler.
>>>>
>>>> I don't see anything in rm_cleanIPI() that would do that however.
>>>>
>>>> I wonder if your original issue was really fixed  just by the first
>>>> patch you had which fixed the race in smp_rendezvous()?
>>>
>>> I found the stack that lead me to this patch in the first place:
>>>
>>> #0  sched_switch (td=0xffffff011a970000, newtd=0xffffff006e6784b0,
>>> flags=4) at src/sys/kern/sched_ule.c:1939
>>> #1  0xffffffff80285c7f in mi_switch (flags=6, newtd=0x0) at
>>> src/sys/kern/kern_synch.c:475
>>> #2  0xffffffff802a2cb3 in critical_exit () at
>>> src/sys/kern/kern_switch.c:185 #3  0xffffffff80465807 in spinlock_exit
>>> () at
>>> src/sys/amd64/amd64/machdep.c:1458
>>> #4  0xffffffff8027adea in rm_cleanIPI (arg=<value optimized out>) at
>>> src/sys/kern/kern_rmlock.c:180
>>> #5  0xffffffff802b9887 in smp_rendezvous_action () at
>>> src/sys/kern/subr_smp.c:402
>>> #6  0xffffffff8045e2a4 in Xrendezvous () at
>>> src/sys/amd64/amd64/apic_vector.S:235
>>> #7  0xffffffff802a2c6e in critical_exit () at
>>> src/sys/kern/kern_switch.c:179 #8  0xffffffff804365ba in uma_zfree_arg
>>> (zone=0xffffff009ff4b5a0, item=0xffffff000f34cd40,
>>> udata=0xffffff000f34ce08) at
>>> src/sys/vm/uma_core.c:2370
>>> .
>>> .
>>> .
>>>
>>> and now that I look at it again, it is clear that critical_exit() just
>>> isn't interrupt safe.  I'm not sure how to fix that, yet ... but this:
>>>
>>>
>>>         if (td->td_critnest == 1) {
>>>                 td->td_critnest = 0;
>>>                 if (td->td_owepreempt) {
>>>                         td->td_critnest = 1;
>>>
>>> clearly doesn't work.
>>
>> I'm sorry if I didn't reply to the whole rendezvous thread, but as
>> long as there is so many people taking care of it, I'll stay hidden in
>> my corner.
>>
>> I just wanted to tell that I think you are misunderstanding what
>> critical section is supposed to do.
>>
>> When an interrupt fires, it goes on the old "interrupt/kernel context"
>> which means it has not a context of his own. That is the reason why we
>> disable interrupts on spinlocks (or similar workaround for !x86
>> architectures) and this is why spinlocks are the only protection
>> usable in code that runs in interrupt context.
>>
>> Preempting just means another thread will be scheduler in the middle
>> of another thread execution path.
>>
>> This code is perfectly fine if you consider curthread won't be
>> descheduled while it is executing.
>
> Well, no - it is not.  With this you can end up with a curthread that has
> td_critnest=0 and td_owepreempt=1 in interrupt context.  If you use a spinlock
> on such a thread, it will do the preemption at the point where you drop the
> spinlock, this is bad in some circumstances.  One example is the smp_rendevous
> case we are discussing here.

Yes, the 'owepreempt' flag can leak because we allow it to be set while 
td_critnest is 0.  That is the problem, and I think we added this as an 
optimization in the last round of changes to this routine to avoid 
excessive context switches.  However, I think we will just have to 
revert that optimization and prevent that state from occurring.

Hmm, the log message for the change said it was to avoid races.

http://svnweb.freebsd.org/base?view=revision&revision=144777

That is what introduced the bug of mixing those two states.  Hmmm, the 
reason for moving the td_critnest == 0 up according to my old e-mails 
was to avoid a problem with losing preemptions.  I think the best way to 
fix this is to clear owepreempt earlier while td_critnest is still 1 and 
cache it locally in critical_exit().  So something like this:

Index: kern/sched_ule.c
===================================================================
--- kern/sched_ule.c	(revision 221536)
+++ kern/sched_ule.c	(working copy)
@@ -1785,7 +1785,7 @@
  	td->td_oncpu = NOCPU;
  	if (!(flags & SW_PREEMPT))
  		td->td_flags &= ~TDF_NEEDRESCHED;
-	td->td_owepreempt = 0;
+	MPASS(td->td_owepreempt == 0);
  	tdq->tdq_switchcnt++;
  	/*
  	 * The lock pointer in an idle thread should never change.  Reset it
Index: kern/kern_switch.c
===================================================================
--- kern/kern_switch.c	(revision 221536)
+++ kern/kern_switch.c	(working copy)
@@ -192,15 +192,22 @@
  critical_exit(void)
  {
  	struct thread *td;
-	int flags;
+	int flags, owepreempt;

  	td = curthread;
  	KASSERT(td->td_critnest != 0,
  	    ("critical_exit: td_critnest == 0"));

  	if (td->td_critnest == 1) {
+		owepreempt = td->td_owepreempt;
+		td->td_owepreempt = 0;
+		/*
+		 * XXX: Should move compiler_memory_barrier() from
+		 * rmlock to a header.
+		 */
+		__asm __volatile("":::"memory");
  		td->td_critnest = 0;
-		if (td->td_owepreempt) {
+		if (owepreempt) {
  			td->td_critnest = 1;
  			thread_lock(td);
  			td->td_critnest--;
Index: kern/kern_synch.c
===================================================================
--- kern/kern_synch.c	(revision 221536)
+++ kern/kern_synch.c	(working copy)
@@ -400,9 +400,7 @@
  	if (!TD_ON_LOCK(td) && !TD_IS_RUNNING(td))
  		mtx_assert(&Giant, MA_NOTOWNED);
  #endif
-	KASSERT(td->td_critnest == 1 || (td->td_critnest == 2 &&
-	    (td->td_owepreempt) && (flags & SW_INVOL) != 0 &&
-	    newtd == NULL) || panicstr,
+	KASSERT(td->td_critnest == 1 || panicstr,
  	    ("mi_switch: switch in a critical section"));
  	KASSERT((flags & (SW_INVOL | SW_VOL)) != 0,
  	    ("mi_switch: switch must be voluntary or involuntary"));
Index: kern/sched_4bsd.c
===================================================================
--- kern/sched_4bsd.c	(revision 221536)
+++ kern/sched_4bsd.c	(working copy)
@@ -947,7 +947,7 @@
  	td->td_lastcpu = td->td_oncpu;
  	if (!(flags & SW_PREEMPT))
  		td->td_flags &= ~TDF_NEEDRESCHED;
-	td->td_owepreempt = 0;
+	MPASS(td->td_owepreempt == 0);
  	td->td_oncpu = NOCPU;

  	/*

Hmm, I'm also not sure if I really need the compiler barrier since 
td_owepreempt and td_critnest are marked volatile?

-- 
John Baldwin