From owner-freebsd-current@FreeBSD.ORG Tue May 17 12:16:33 2011 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 45D1C1065672; Tue, 17 May 2011 12:16:33 +0000 (UTC) (envelope-from jhb@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id E38948FC14; Tue, 17 May 2011 12:16:32 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 95E3246B03; Tue, 17 May 2011 08:16:32 -0400 (EDT) Received: from jhbmac.hudson-trading.com (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 2A6A68A04F; Tue, 17 May 2011 08:16:32 -0400 (EDT) Message-ID: <4DD26720.3000001@FreeBSD.org> Date: Tue, 17 May 2011 08:16:32 -0400 From: John Baldwin User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.17) Gecko/20110414 Thunderbird/3.1.10 MIME-Version: 1.0 To: Max Laier References: <4DCD357D.6000109@FreeBSD.org> <201105161738.53366.max@love2party.net> <201105161805.51188.max@love2party.net> In-Reply-To: <201105161805.51188.max@love2party.net> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6 (bigwig.baldwin.cx); Tue, 17 May 2011 08:16:32 -0400 (EDT) Cc: neel@freebsd.org, Andriy Gapon , Attilio Rao , FreeBSD current , Stephan Uphoff , Peter Grehan Subject: Re: proposed smp_rendezvous change X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2011 12:16:33 -0000 On 5/16/11 6:05 PM, Max Laier wrote: > On Monday 16 May 2011 17:54:54 Attilio Rao wrote: >> 2011/5/16 Max Laier: >>> On Monday 16 May 2011 16:46:03 John Baldwin wrote: >>>> On Monday, May 16, 2011 4:30:44 pm Max Laier wrote: >>>>> On Monday 16 May 2011 14:21:27 John Baldwin wrote: >>>>>> Yes, we need to fix that. Humm, it doesn't preempt when you do a >>>>>> critical_exit() though? Or do you use a hand-rolled critical exit >>>>>> that doesn't do a deferred preemption? >>>>> >>>>> Right now I just did a manual td_critnest++/--, but I guess ... >>>> >>>> Ah, ok, so you would "lose" a preemption. That's not really ideal. >>>> >>>>>> Actually, I'm curious how the spin unlock inside the IPI could yield >>>>>> the CPU. Oh, is rmlock doing a wakeup inside the IPI handler? I >>>>>> guess that is ok as long as the critical_exit() just defers the >>>>>> preemption to the end of the IPI handler. >>>>> >>>>> ... the earliest point where it is safe to preempt is after doing the >>>>> >>>>> atomic_add_int(&smp_rv_waiters[2], 1); >>>>> >>>>> so that we can start other IPIs again. However, since we don't accept >>>>> new IPIs until we signal EOI in the MD code (on amd64), this might >>>>> still not be a good place to do the yield?!? >>>> >>>> Hmm, yeah, you would want to do the EOI before you yield. However, we >>>> could actually move the EOI up before calling the MI code so long as we >>>> leave interrupts disabled for the duration of the handler (which we do). >>>> >>>>> The spin unlock boils down to a critical_exit() and unless we did a >>>>> critical_enter() at some point during the redenvouz setup, we will >>>>> yield() if we owepreempt. I'm not quite sure how that can happen, but >>>>> it seems like there is a path that allows the scheduler to set it from >>>>> a foreign CPU. >>>> >>>> No, it is only set on curthread by curthread. This is actually my main >>>> question. I've no idea how this could happen unless the rmlock code is >>>> actually triggering a wakeup or sched_add() in its rendezvous handler. >>>> >>>> I don't see anything in rm_cleanIPI() that would do that however. >>>> >>>> I wonder if your original issue was really fixed just by the first >>>> patch you had which fixed the race in smp_rendezvous()? >>> >>> I found the stack that lead me to this patch in the first place: >>> >>> #0 sched_switch (td=0xffffff011a970000, newtd=0xffffff006e6784b0, >>> flags=4) at src/sys/kern/sched_ule.c:1939 >>> #1 0xffffffff80285c7f in mi_switch (flags=6, newtd=0x0) at >>> src/sys/kern/kern_synch.c:475 >>> #2 0xffffffff802a2cb3 in critical_exit () at >>> src/sys/kern/kern_switch.c:185 #3 0xffffffff80465807 in spinlock_exit >>> () at >>> src/sys/amd64/amd64/machdep.c:1458 >>> #4 0xffffffff8027adea in rm_cleanIPI (arg=) at >>> src/sys/kern/kern_rmlock.c:180 >>> #5 0xffffffff802b9887 in smp_rendezvous_action () at >>> src/sys/kern/subr_smp.c:402 >>> #6 0xffffffff8045e2a4 in Xrendezvous () at >>> src/sys/amd64/amd64/apic_vector.S:235 >>> #7 0xffffffff802a2c6e in critical_exit () at >>> src/sys/kern/kern_switch.c:179 #8 0xffffffff804365ba in uma_zfree_arg >>> (zone=0xffffff009ff4b5a0, item=0xffffff000f34cd40, >>> udata=0xffffff000f34ce08) at >>> src/sys/vm/uma_core.c:2370 >>> . >>> . >>> . >>> >>> and now that I look at it again, it is clear that critical_exit() just >>> isn't interrupt safe. I'm not sure how to fix that, yet ... but this: >>> >>> >>> if (td->td_critnest == 1) { >>> td->td_critnest = 0; >>> if (td->td_owepreempt) { >>> td->td_critnest = 1; >>> >>> clearly doesn't work. >> >> I'm sorry if I didn't reply to the whole rendezvous thread, but as >> long as there is so many people taking care of it, I'll stay hidden in >> my corner. >> >> I just wanted to tell that I think you are misunderstanding what >> critical section is supposed to do. >> >> When an interrupt fires, it goes on the old "interrupt/kernel context" >> which means it has not a context of his own. That is the reason why we >> disable interrupts on spinlocks (or similar workaround for !x86 >> architectures) and this is why spinlocks are the only protection >> usable in code that runs in interrupt context. >> >> Preempting just means another thread will be scheduler in the middle >> of another thread execution path. >> >> This code is perfectly fine if you consider curthread won't be >> descheduled while it is executing. > > Well, no - it is not. With this you can end up with a curthread that has > td_critnest=0 and td_owepreempt=1 in interrupt context. If you use a spinlock > on such a thread, it will do the preemption at the point where you drop the > spinlock, this is bad in some circumstances. One example is the smp_rendevous > case we are discussing here. Yes, the 'owepreempt' flag can leak because we allow it to be set while td_critnest is 0. That is the problem, and I think we added this as an optimization in the last round of changes to this routine to avoid excessive context switches. However, I think we will just have to revert that optimization and prevent that state from occurring. Hmm, the log message for the change said it was to avoid races. http://svnweb.freebsd.org/base?view=revision&revision=144777 That is what introduced the bug of mixing those two states. Hmmm, the reason for moving the td_critnest == 0 up according to my old e-mails was to avoid a problem with losing preemptions. I think the best way to fix this is to clear owepreempt earlier while td_critnest is still 1 and cache it locally in critical_exit(). So something like this: Index: kern/sched_ule.c =================================================================== --- kern/sched_ule.c (revision 221536) +++ kern/sched_ule.c (working copy) @@ -1785,7 +1785,7 @@ td->td_oncpu = NOCPU; if (!(flags & SW_PREEMPT)) td->td_flags &= ~TDF_NEEDRESCHED; - td->td_owepreempt = 0; + MPASS(td->td_owepreempt == 0); tdq->tdq_switchcnt++; /* * The lock pointer in an idle thread should never change. Reset it Index: kern/kern_switch.c =================================================================== --- kern/kern_switch.c (revision 221536) +++ kern/kern_switch.c (working copy) @@ -192,15 +192,22 @@ critical_exit(void) { struct thread *td; - int flags; + int flags, owepreempt; td = curthread; KASSERT(td->td_critnest != 0, ("critical_exit: td_critnest == 0")); if (td->td_critnest == 1) { + owepreempt = td->td_owepreempt; + td->td_owepreempt = 0; + /* + * XXX: Should move compiler_memory_barrier() from + * rmlock to a header. + */ + __asm __volatile("":::"memory"); td->td_critnest = 0; - if (td->td_owepreempt) { + if (owepreempt) { td->td_critnest = 1; thread_lock(td); td->td_critnest--; Index: kern/kern_synch.c =================================================================== --- kern/kern_synch.c (revision 221536) +++ kern/kern_synch.c (working copy) @@ -400,9 +400,7 @@ if (!TD_ON_LOCK(td) && !TD_IS_RUNNING(td)) mtx_assert(&Giant, MA_NOTOWNED); #endif - KASSERT(td->td_critnest == 1 || (td->td_critnest == 2 && - (td->td_owepreempt) && (flags & SW_INVOL) != 0 && - newtd == NULL) || panicstr, + KASSERT(td->td_critnest == 1 || panicstr, ("mi_switch: switch in a critical section")); KASSERT((flags & (SW_INVOL | SW_VOL)) != 0, ("mi_switch: switch must be voluntary or involuntary")); Index: kern/sched_4bsd.c =================================================================== --- kern/sched_4bsd.c (revision 221536) +++ kern/sched_4bsd.c (working copy) @@ -947,7 +947,7 @@ td->td_lastcpu = td->td_oncpu; if (!(flags & SW_PREEMPT)) td->td_flags &= ~TDF_NEEDRESCHED; - td->td_owepreempt = 0; + MPASS(td->td_owepreempt == 0); td->td_oncpu = NOCPU; /* Hmm, I'm also not sure if I really need the compiler barrier since td_owepreempt and td_critnest are marked volatile? -- John Baldwin