From owner-freebsd-current@FreeBSD.ORG  Tue May 17 20:35:19 2011
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 92D3D1065672;
	Tue, 17 May 2011 20:35:19 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 3AC588FC0C;
	Tue, 17 May 2011 20:35:19 +0000 (UTC)
Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net
	[66.111.2.69])
	by cyrus.watson.org (Postfix) with ESMTPSA id B978A46B2C;
	Tue, 17 May 2011 16:35:18 -0400 (EDT)
Received: from jhbbsd.localnet (unknown [209.249.190.124])
	by bigwig.baldwin.cx (Postfix) with ESMTPSA id 5703C8A04F;
	Tue, 17 May 2011 16:35:18 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: Max Laier <max@love2party.net>
Date: Tue, 17 May 2011 16:35:17 -0400
User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110325; KDE/4.5.5; amd64; ; )
References: <4DCD357D.6000109@FreeBSD.org> <201105171256.41091.jhb@freebsd.org>
	<4DD2C99D.50203@love2party.net>
In-Reply-To: <4DD2C99D.50203@love2party.net>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="utf-8"
Content-Transfer-Encoding: 7bit
Message-Id: <201105171635.17704.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6
	(bigwig.baldwin.cx); Tue, 17 May 2011 16:35:18 -0400 (EDT)
Cc: neel@freebsd.org, Andriy Gapon <avg@freebsd.org>,
	Attilio Rao <attilio@freebsd.org>,
	FreeBSD current <freebsd-current@freebsd.org>,
	Stephan Uphoff <ups@freebsd.org>, Peter Grehan <grehan@freebsd.org>
Subject: Re: proposed smp_rendezvous change
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 17 May 2011 20:35:19 -0000

On Tuesday, May 17, 2011 3:16:45 pm Max Laier wrote:
> On 05/17/2011 09:56 AM, John Baldwin wrote:
> > On Tuesday, May 17, 2011 12:20:40 pm Max Laier wrote:
> >> On 05/17/2011 05:16 AM, John Baldwin wrote:
> >> ...
> >>> Index: kern/kern_switch.c
> >>> ===================================================================
> >>> --- kern/kern_switch.c (revision 221536)
> >>> +++ kern/kern_switch.c (working copy)
> >>> @@ -192,15 +192,22 @@
> >>> critical_exit(void)
> >>> {
> >>> struct thread *td;
> >>> - int flags;
> >>> + int flags, owepreempt;
> >>>
> >>> td = curthread;
> >>> KASSERT(td->td_critnest != 0,
> >>> ("critical_exit: td_critnest == 0"));
> >>>
> >>> if (td->td_critnest == 1) {
> >>> + owepreempt = td->td_owepreempt;
> >>> + td->td_owepreempt = 0;
> >>> + /*
> >>> + * XXX: Should move compiler_memory_barrier() from
> >>> + * rmlock to a header.
> >>> + */
> >>
> >> XXX: If we get an interrupt at this point and td_owepreempt was zero,
> >> the new interrupt will re-set it, because td_critnest is still non-zero.
> >>
> >> So we still end up with a thread that is leaking an owepreempt *and*
> >> lose a preemption.
> >
> > I don't see how this can still leak owepreempt.  The nested interrupt 
should
> > do nothing (except for possibly set owepreempt) until td_critnest is 0.
> 
> Exactly.  The interrupt sets owepreempt and after we return here, we set 
> td_critnest to 0 and exit without clearing owepreempt.  Hence we leak 
> the owepreempt.
> 
> > However, we can certainly lose preemptions.
> >
> > I wonder if we can abuse the high bit of td_critnest for the owepreempt 
flag
> > so it is all stored in one cookie.  We only set owepreempt while holding
> > thread_lock() (so interrupts are disabled), so I think we would be ok and 
not
> > need atomic ops.
> >
> > Hmm, actually, the top-half code would have to use atomic ops.  Nuts.  Let 
me
> > think some more.
> 
> I think these two really belong into one single variable.  Setting the 
> owepreempt flag can be a normal RMW.  Increasing and decreasing critnest 
> must be atomic (otherwise we could lose the flag) and dropping the final 
> reference would work like this:
> 
>    if ((curthread->td_critnest & TD_CRITNEST_MASK) == 1) {
>      unsigned int owe;
>      owe = atomic_readandclear(&curthread->td_critnest);
>      if (owe & TD_OWEPREEMPT_FLAG) {
>        /* do the switch */
>    }
> 
> That should do it ... I can put that into a patch, if we agree that's 
> the right thing to do.

Yeah, I already have a patch to do that, but hadn't added atomic ops to 
critical_enter() and critical_exit().  But it also wasn't as fancy in the 
critical_exit() case.  Here is what I have and I think it might actually
be ok (it doesn't use an atomic read and clear, but I think it is safe).
Hmm, actually, it will need to use the read and clear:

Index: kern/sched_ule.c
===================================================================
--- kern/sched_ule.c	(revision 222024)
+++ kern/sched_ule.c	(working copy)
@@ -1785,7 +1785,6 @@
 	td->td_oncpu = NOCPU;
 	if (!(flags & SW_PREEMPT))
 		td->td_flags &= ~TDF_NEEDRESCHED;
-	td->td_owepreempt = 0;
 	tdq->tdq_switchcnt++;
 	/*
 	 * The lock pointer in an idle thread should never change.  Reset it
@@ -2066,7 +2065,7 @@
 
 		flags = SW_INVOL | SW_PREEMPT;
 		if (td->td_critnest > 1)
-			td->td_owepreempt = 1;
+			td->td_critnest |= TD_OWEPREEMPT;
 		else if (TD_IS_IDLETHREAD(td))
 			mi_switch(flags | SWT_REMOTEWAKEIDLE, NULL);
 		else
@@ -2261,7 +2260,7 @@
 		return;
 	if (!sched_shouldpreempt(pri, cpri, 0))
 		return;
-	ctd->td_owepreempt = 1;
+	ctd->td_critnest |= TD_OWEPREEMPT;
 }
 
 /*
Index: kern/kern_switch.c
===================================================================
--- kern/kern_switch.c	(revision 222024)
+++ kern/kern_switch.c	(working copy)
@@ -183,7 +183,7 @@
 	struct thread *td;
 
 	td = curthread;
-	td->td_critnest++;
+	atomic_add_int(&td->td_critnest, 1);
 	CTR4(KTR_CRITICAL, "critical_enter by thread %p (%ld, %s) to %d", td,
 	    (long)td->td_proc->p_pid, td->td_name, td->td_critnest);
 }
@@ -192,18 +192,16 @@
 critical_exit(void)
 {
 	struct thread *td;
-	int flags;
+	int flags, owepreempt;
 
 	td = curthread;
 	KASSERT(td->td_critnest != 0,
 	    ("critical_exit: td_critnest == 0"));
 
-	if (td->td_critnest == 1) {
-		td->td_critnest = 0;
-		if (td->td_owepreempt) {
-			td->td_critnest = 1;
+	if (td->td_critnest && ~TD_OWEPREEMPT == 1) {
+		owepreempt = atomic_readandclear_int(&td->td_owepreempt);
+		if (owepreempt & TD_OWEPREEMPT) {
 			thread_lock(td);
-			td->td_critnest--;
 			flags = SW_INVOL | SW_PREEMPT;
 			if (TD_IS_IDLETHREAD(td))
 				flags |= SWT_IDLE;
@@ -213,7 +211,7 @@
 			thread_unlock(td);
 		}
 	} else
-		td->td_critnest--;
+		atomic_subtract_int(&td->td_critnest, 1);
 
 	CTR4(KTR_CRITICAL, "critical_exit by thread %p (%ld, %s) to %d", td,
 	    (long)td->td_proc->p_pid, td->td_name, td->td_critnest);
Index: kern/kern_synch.c
===================================================================
--- kern/kern_synch.c	(revision 222024)
+++ kern/kern_synch.c	(working copy)
@@ -400,9 +400,7 @@
 	if (!TD_ON_LOCK(td) && !TD_IS_RUNNING(td))
 		mtx_assert(&Giant, MA_NOTOWNED);
 #endif
-	KASSERT(td->td_critnest == 1 || (td->td_critnest == 2 &&
-	    (td->td_owepreempt) && (flags & SW_INVOL) != 0 &&
-	    newtd == NULL) || panicstr,
+	KASSERT(td->td_critnest == 1 || panicstr,
 	    ("mi_switch: switch in a critical section"));
 	KASSERT((flags & (SW_INVOL | SW_VOL)) != 0,
 	    ("mi_switch: switch must be voluntary or involuntary"));
Index: kern/sched_4bsd.c
===================================================================
--- kern/sched_4bsd.c	(revision 222024)
+++ kern/sched_4bsd.c	(working copy)
@@ -335,7 +335,7 @@
 	if (ctd->td_critnest > 1) {
 		CTR1(KTR_PROC, "maybe_preempt: in critical section %d",
 		    ctd->td_critnest);
-		ctd->td_owepreempt = 1;
+		ctd->td_critnest |= TD_OWEPREEMPT;
 		return (0);
 	}
 	/*
@@ -949,7 +949,6 @@
 	td->td_lastcpu = td->td_oncpu;
 	if (!(flags & SW_PREEMPT))
 		td->td_flags &= ~TDF_NEEDRESCHED;
-	td->td_owepreempt = 0;
 	td->td_oncpu = NOCPU;
 
 	/*
@@ -1437,7 +1436,7 @@
 {
 	thread_lock(td);
 	if (td->td_critnest > 1)
-		td->td_owepreempt = 1;
+		td->td_critnest |= TD_OWEPREEMPT;
 	else
 		mi_switch(SW_INVOL | SW_PREEMPT | SWT_PREEMPT, NULL);
 	thread_unlock(td);
Index: kern/kern_rmlock.c
===================================================================
--- kern/kern_rmlock.c	(revision 222024)
+++ kern/kern_rmlock.c	(working copy)
@@ -366,7 +366,8 @@
 	 * Fast path to combine two common conditions into a single
 	 * conditional jump.
 	 */
-	if (0 == (td->td_owepreempt | (rm->rm_writecpus & pc->pc_cpumask)))
+	if (0 == ((td->td_critnest & TD_OWEPREEMPT) |
+	    (rm->rm_writecpus & pc->pc_cpumask)))
 		return (1);
 
 	/* We do not have a read token and need to acquire one. */
@@ -377,7 +378,7 @@
 _rm_unlock_hard(struct thread *td,struct rm_priotracker *tracker)
 {
 
-	if (td->td_owepreempt) {
+	if (td->td_critnest & TD_OWEPREEMPT) {
 		td->td_critnest++;
 		critical_exit();
 	}
@@ -418,7 +419,7 @@
 	td->td_critnest--;
 	sched_unpin();
 
-	if (0 == (td->td_owepreempt | tracker->rmp_flags))
+	if (0 == ((td->td_critnest & TD_OWEPREEMPT) | tracker->rmp_flags))
 		return;
 
 	_rm_unlock_hard(td, tracker);
Index: sys/proc.h
===================================================================
--- sys/proc.h	(revision 222024)
+++ sys/proc.h	(working copy)
@@ -228,7 +228,6 @@
 	const char	*td_wmesg;	/* (t) Reason for sleep. */
 	u_char		td_lastcpu;	/* (t) Last cpu we were on. */
 	u_char		td_oncpu;	/* (t) Which cpu we are on. */
-	volatile u_char td_owepreempt;  /* (k*) Preempt on last critical_exit */
 	u_char		td_tsqueue;	/* (t) Turnstile queue blocked on. */
 	short		td_locks;	/* (k) Count of non-spin locks. */
 	short		td_rw_rlocks;	/* (k) Count of rwlock read locks. */
@@ -465,6 +464,12 @@
 #define	TD_SET_CAN_RUN(td)	(td)->td_state = TDS_CAN_RUN
 
 /*
+ * The high bit of td_critnest is used to determine if a preemption is
+ * pending.
+ */
+#define	TD_OWEPREEMPT		((u_int)INT_MAX + 1)
+
+/*
  * Process structure.
  */
 struct proc {

-- 
John Baldwin