From owner-cvs-all@FreeBSD.ORG Fri Jun 15 20:46:45 2007 Return-Path: X-Original-To: cvs-all@FreeBSD.org Delivered-To: cvs-all@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9F9ED16A469; Fri, 15 Jun 2007 20:46:45 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au [211.29.132.184]) by mx1.freebsd.org (Postfix) with ESMTP id 16A7413C4AE; Fri, 15 Jun 2007 20:46:44 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248]) by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l5FKkUHX023606 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 16 Jun 2007 06:46:32 +1000 Date: Sat, 16 Jun 2007 06:46:30 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: John Baldwin In-Reply-To: <200706111154.31357.jhb@freebsd.org> Message-ID: <20070616054050.Q2037@besplex.bde.org> References: <200706051420.l55EKEih018925@repoman.freebsd.org> <200706081210.24393.jhb@freebsd.org> <20070609112753.C15075@besplex.bde.org> <200706111154.31357.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: src-committers@FreeBSD.org, Kip Macy , cvs-all@FreeBSD.org, Attilio Rao , Bruce Evans , cvs-src@FreeBSD.org, Kostik Belousov , Jeff Roberson Subject: Re: cvs commit: src/sys/kern kern_mutex.c X-BeenThere: cvs-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: CVS commit messages for the entire tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Jun 2007 20:46:45 -0000 On Mon, 11 Jun 2007, John Baldwin wrote: > On Friday 08 June 2007 10:20:19 pm Bruce Evans wrote: >> On Fri, 8 Jun 2007, John Baldwin wrote: >> >>> On Friday 08 June 2007 10:50:15 am Bruce Evans wrote: >>>> On Thu, 7 Jun 2007, John Baldwin wrote: ["this" is in vm_page_zero_idle() where the vm mutex is dropped] >> I think this is as good a place to preempt as any. In fact, the code >> can be simplified by preempting here and not in the main loop. The >> ... > > Preemption happens when a thread is made runnable, i.e. usually when an > interrupt comes in, but also when releasing locks or doing > wakeup/cv_broadcast, etc. The only thing the idle thread does other than > interrupts is release the lock. I'm not sure what you mean in the last sentence. This is a special idle thread, and it doesn't do interrupts. > One side effect of using a critical section > here is that we are potentially adding interrupt latency, so this may be a > rather bad thing. :-/ Probably you are seeing the number of context switches > go down because we are "batching" up switches. Say you get two interrupts > during an iteration of the main loop for this process, previously you'd get 4 > context switches out of that: zeroidle -> first ithread -> zeroidle, and then > later zeroidle -> second ithread -> zeroidle. Now you block those switches > until the critical exit, at which point both interrupt threads are pending, > so you do: zeroidle -> first ithread -> second ithread -> zeroidle, i.e. 3 > context switches. However, the cost is increased interrupt latency and that > may explain why the "real" time went up. Lots of context switches is fairly > lame, but if the box is idle, then I'm less worried about wasting time in a > context switch (well, I'm concerned because it means we can't put the CPU to > sleep since we are executing something still, but extra overhead in an idle > thread doing a background task is not near as worrisome as overhead > in "important" work like ithreads). I think I want to batch up switches a little in general, and and only a little batching occurs here. pmap_zero_page_idle() takes about 1uS on my main test system (Turion X2 with relatively slow DDR2 memory which can neverthless be zeroed at 4 or 5GB/S). An extra 1uS of interrupt latency here and there won't make much difference. It's less than the extra latency for 1 ATPIC access if not using the APIC. Also, for the makeworld benchmark, the interrupt handler runs for about 2% of the time, and pagezero runs for about 1% of the time. These threads just don't run long enough to have much contention. I think the main cause of extra context switches that I saw in some misconfigurations is switching while holding the vm mutex. The thread switched to is quite likely to block on this mutex and switch back. > As to why preemption doesn't work for SMP, a thread only knows to preempt if > it makes a higher priority thread runnable. This happens in mtx_unlock when > we wakeup a thread waiting on the lock, in wakeup, or when an interrupt > thread is scheduled (the interrupted thread "sees" the ithread being > scheduled). If another thread on another CPU makes a thread runnable, the > thread on the first CPU has no idea unless the second CPU explicitly sends a > message (i.e. IPI) to the first CPU asking it to yield instead. I believe SCHED_ULE does the IPI. > Specifically, suppose you have threads A, B, C and with priorities A < B < C. > Suppose A is running on CPU 0, and C is running on CPU 1. If C does a wakeup > that awakens B, C isn't going to preempt to B because C is more important > (assume > means more important, even though priority values are opposite > that, which is annoying). If A had awakened B it would have preempted > though, so in theory C should look at the other CPUs, notice that A < B, and > send an IPI to CPU 0 to ask it to preempt from A to B. One thing is that > IPI_PREEMPT should set td_owepreempt if the target A is in a critical > section, I haven't checked to see if we do that currently. Would it be worth checking a preemption flag in mtx_unlock()? This would bloat the macro a bit. However, we already have the check and the bloat for spinlocks, at least on i386's, by checking in critical_exit() via spinlock_exit(). Using critical sections should have the side effect of getting the flag checked in critical_exit(). This doesn't seem to work (for SCHED_4BSD), so there seems to be a problem setting the flag. >> Next I will try moving the PREEMPTION code to where the vm mutex is dropped >> naturally. I will try the following order: > > I like this idea a lot. > Actually, I would keep the sched_runnable(), etc. as #ifndef PREEMPTION, the > critical_exit() already does that check (modulo concerns above). Also, I Testing shows critical_exit() doesn't seem to be doing the preemption. On UP, depending on PREEMPTION makes little difference, but on 2-way SMP with no voluntary yielding, pagezero is too active. The critical sections don't seem to be doing much. > originally wanted to not hold the critical sectino while zeroing the page to > not impeded interrupts during that operation. I was actually trying to just > avoid preempting while holding the lock. However, given my comments about > how this harms interrupt latency, maybe this is a bad idea and we should just > let priority propagation handle that for us. Moving the context switch is > probably a good idea though. The 1 uS extra latency on my main test machine wouldn't matter, but go back to a 486 with 10MB/S memory and the latency would be a problem -- the 1uS becomes 400uS, which is a lot even for a 486. > the reason I wanted to avoid preempting while holding the lock is you can get > this case: > > zeroidle -> some ithread -> some top-half thread in kernel which needs the > vm page queue mtx -> zeroidle (with top-half thread's priority; until > mtx_unlock) -> top-half thread in kernel -> zeroidle > > which can be many context switches. By not switching while holding the lock, > one can reduce this to: > > zeroidle -> some ithread -> some top-half thread -> zeroidle > > but at the cost of adding latency to "some ithread" and "some top-half thread" Maybe preemption should be inhibited a bit when any mutex is held. Here is my current version. I got tired of using a dynamic enable for the PREEMPTION ifdef code and removed all the conditionals after the most recent test showed that the voluntary switch is still needed. % Index: vm_zeroidle.c % =================================================================== % RCS file: /home/ncvs/src/sys/vm/vm_zeroidle.c,v % retrieving revision 1.47 % diff -u -2 -r1.47 vm_zeroidle.c % --- vm_zeroidle.c 5 Jun 2007 00:00:57 -0000 1.47 % +++ vm_zeroidle.c 15 Jun 2007 19:30:13 -0000 % @@ -111,5 +111,13 @@ % mtx_unlock(&vm_page_queue_free_mtx); % pmap_zero_page_idle(m); % + if (sched_runnable()) { % + thread_lock(curthread); % + critical_exit(); % + mi_switch(SW_VOL, NULL); % + thread_unlock(curthread); % + } else % + critical_exit(); % mtx_lock(&vm_page_queue_free_mtx); % + critical_enter(); % m->flags |= PG_ZERO; % vm_pageq_enqueue(PQ_FREE + m->pc, m); % @@ -141,18 +149,14 @@ % % mtx_lock(&vm_page_queue_free_mtx); % + critical_enter(); % for (;;) { % if (vm_page_zero_check()) { % vm_page_zero_idle(); % -#ifndef PREEMPTION % - if (sched_runnable()) { % - thread_lock(curthread); % - mi_switch(SW_VOL, NULL); % - thread_unlock(curthread); % - } % -#endif % } else { % wakeup_needed = TRUE; % + critical_exit(); % msleep(&zero_state, &vm_page_queue_free_mtx, 0, % "pgzero", hz * 300); % + critical_enter(); % } % } Bruce