From owner-svn-src-projects@FreeBSD.ORG Mon Oct 29 15:34:27 2012 Return-Path: Delivered-To: svn-src-projects@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9F76DBC7; Mon, 29 Oct 2012 15:34:27 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au [211.29.132.184]) by mx1.freebsd.org (Postfix) with ESMTP id 242388FC0C; Mon, 29 Oct 2012 15:34:26 +0000 (UTC) Received: from c122-106-175-26.carlnfd1.nsw.optusnet.com.au (c122-106-175-26.carlnfd1.nsw.optusnet.com.au [122.106.175.26]) by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q9TFYNju014450 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 30 Oct 2012 02:34:24 +1100 Date: Tue, 30 Oct 2012 02:34:23 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Attilio Rao Subject: Re: svn commit: r238907 - projects/calloutng/sys/kern In-Reply-To: Message-ID: <20121030014250.D5191@besplex.bde.org> References: <201207301350.q6UDobCI099069@svn.freebsd.org> <201207301732.33474.jhb@freebsd.org> <20121029155136.O943@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-Cloudmark-Score: 0 X-Optus-Cloudmark-Analysis: v=2.0 cv=EbGQ24aC c=1 sm=1 a=zSpkIMvUouMA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=Y4pc3zAY36oA:10 a=6I5d2MoRAAAA:8 a=bszOWdyLQ-ZoIfDFDpYA:9 a=CjuIK1q_8ugA:10 a=bxQHXO5Py4tHmhUgaywp5w==:117 Cc: src-committers@FreeBSD.org, John Baldwin , Jeff Roberson , Florian Smeets , Bruce Evans , Bruce Evans , svn-src-projects@FreeBSD.org X-BeenThere: svn-src-projects@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "SVN commit messages for the src " projects" tree" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 29 Oct 2012 15:34:27 -0000 On Mon, 29 Oct 2012, Attilio Rao wrote: > On 10/29/12, Bruce Evans wrote: >> On Mon, 29 Oct 2012, Attilio Rao wrote: >> >>> Now that sched_pin()/sched_unpin() are fixed I would like to introduce >>> this new patch, making critical_enter()/critical_exit() inline: >>> http://www.freebsd.org/~attilio/inline_critical.patch >>> >>> The concept is pretty simple: simple add/dec for critical_enter, exit >>> are inlined, the rest is in an "hard path". Debugging enables the hard >>> paths by default (really I think that only KTR may be due here, but I >>> thought that in case of INVARIANTS this was also wanted, so I added >>> the check also for that case). >> ... >> Inlining of mtx_lock_spin() is bogus unless critical_enter() is inlined. >> Similarly for mtx_unlock_spin() and critical_exit(). It saves 1 function >> call. but critical_enter() does a function call anyway. critical_exit*( >> also has a branch in branch in it that might cost more than the function >> call just for mispredicting it. > > Correct, that is a further argument for having inlined > critical_enter(), And for inlining neither, ot the opposite one like I do. > even if the actual calling cames from > spinlock_enter(), not critical_enter(), which must be MD (that's on > FreeBSD, not sure what happens in your OS). I forgot that I don't have the slow functions spinlock_enter() and spinlock_exit() in mtx_[un]lock_spin(). (My mutexes don't block interrupts, as required for fast interrupt handling that is actually fast (really low-latency). My spinlocks just use critical*(), and critical*() doesn't block fast interrupt handling.) The spinlock_enter() calls mean that inlining mutex calls is even more bogus. Instead of just 1 function call which does not much more than increment or decrement a counter, there is a nested call to the critical*() one and another call to spinlock_enter(). spinlock_enter() is MD and might need to do lots of slow hardware things. critical_enter() does the following on i386: % void % spinlock_enter(void) % { % struct thread *td; % register_t flags; % % td = curthread; % if (td->td_md.md_spinlock_count == 0) { % flags = intr_disable(); This is a CPU control instruction and thus tends to be slow. It was very slow on Pentium4. It might involve some serialization although it is not a full serialization instruction. % td->td_md.md_spinlock_count = 1; % flags &= ~PSL_T; The previous line is from my version. It fixes spurious trace traps when the flags are popped in critical_exit(). Similar fixes are needed for the pushfl/popfl sequences in swtch.s. The spurious trace traps were and might still be more harmful than they should be since they exercise deadlock bugs in syscons and/or printf. Simply trace through a large amount of code in ddb, going through here a few times to set up spurious trace traps for several td's. It may also be necessary to have syscons and/or printf doing non-ddb i/o. Eventually the trace traps bite and demonstrate the deadlock. % td->td_md.md_saved_flags = flags; % } else % td->td_md.md_spinlock_count++; % critical_enter(); % } Everything else uses simple non-control instructions so it is quite fast. However, if this is not serialized, then it can run in parallel with mtx_lock_spin() and vice versa since there are no inter-dependencies. It is unclear whether the parallelism is helped or harmed by not inlining mtx_lock_spin(). >> My version goes the other way and uninlines mtx_lock_spin() and >> mtx_unlock_spin(). Then it inlines (open codes) critical_enter() and >> critical_exit() in them. This saves a lot of text space and thus >> hopefully saves time too. I couldn't find any cases where it either >> ... >> >> OTOH, I couldn't get uninlining of mtx_lock() and mtx_unlock() to work. >> .. > > I don't think that uninling mtx_lock()/unlock() (btw, on which hw are > you testing them if you are still able to catch performance penalties > by branch misprediction?!) is a good idea, likely what would be a > better one is to both inline critical_enter() and spinlock_enter(). Er, it is a good idea, as explained above. Whether it is better in practice is very MD. The mtx non-calls are already quite large, and adding critical*() and spinlock*() to them would make them larger. Above a certain MD size, inlining is just slower because it busts caches. spinlock*() is especially hard to inline because it does MD magic that might be even larger than the i386 version. Bruce