Date: Thu, 03 Jan 2002 09:28:09 -0800 (PST) From: John Baldwin <jhb@FreeBSD.org> To: Matthew Dillon <dillon@apollo.backplane.com> Cc: Peter Jeremy <peter.jeremy@alcatel.com.au>, Michal Mertl <mime@traveller.cz>, Bruce Evans <bde@zeta.org.au>, Mike Smith <msmith@FreeBSD.ORG>, Bernd Walter <ticso@cicely8.cicely.de>, arch@FreeBSD.ORG Subject: Re: When to use atomic_ functions? (was: 64 bit counters) Message-ID: <XFMail.020103092809.jhb@FreeBSD.org> In-Reply-To: <200201030410.g034AkP61865@apollo.backplane.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 03-Jan-02 Matthew Dillon wrote: >:> non-cacheable in terms of code generation. Every time you get it you >:> have to re-obtain it. So a great deal of our code does this in order >:> to work around the lack of optimization: >: >:No, this is incorrect. curthread and curpcb are a bit special here. They >:are >:valid even across migration. > > Yes they are... but the code that retrieves them can't be optimized by > GCC to deal with multiple references. That's one of the problems. To be honest, I find 'td' more readable then 'curthread' so for me this isn't that big of a problem. Are you saying you want the compiler to automatically cache the value in a temporary variable on its own? >:> ... use td multiple times ... >: >:This is done cause PCPU_GET() is slow. Perhaps that is what you are trying >:to >:say in that our PCPU_GET() macros are slow? On x86 they are one instruction >:for most cases, and on other archs they are 2 or 3 at most. > > No, what I am saying is that they are designed in a manner that > makes it pretty much impossible for GCC or a human to optimize. The > fact that you cannot get a stable pointer to the per-cpu area (without > being in a critical section) and then access the variables via standard > structural indirection has led to all manner of bloat in the API. > It turns what ought to be a simple matter of a cpu-relative access > into a mess of macros and makes it difficult to take advantage of > the per-cpu data area for anything general purpose. Err, PCPU_PTR returns a pointer. The fact that you have to be in a critical section is a separate issue due to the fact that we allow threads preempted by an interrupt to migrate. Well, there are two choices as far as allowing migration as I see it, either 'critical_exit()' is a migration point as well as any setrunqueue()'s, or you just make setrunqueue()'s migration points and pin a thread when it is preempted by critical_exit(). > If we had conceptually 'instant' critical section handling it would not > be as big a deal, but we have a bad combination here... critical sections > are expensive (almost as expensive as getting a mutex), and it is not > possible to optimize general per-cpu structural references. If we had cold fusion life would be peachy keen as well. :-P > The function/macroness is not the biggest problem here. It's > the cli/sti junk and the general overhead in the code that is the > problem. > What should be an extremely light weight function that nobody should have > a second thought about using when they need it isn't even close to > being light weight. Our critical section code could be two > (unsynchronized) instructions on nearly all architectures with > only a moderate (and very well known since it would be similar to > the SPL mechanism) amount of assembly. > > We have the same problem with these functions that we had with the spl*() > code before we optimized it. People were uncomfortable with the routines > and the API and often did all sorts of weird things to try to avoid > them due to their perceived overhead. We had (and still have) > unnecessarily long spl'd loops in -stable because of that. The spl code was only optimized on i386. It wasn't on alpha. However, you are blatantly ignoring one issue that I've brought up several times here: if you don't disable interrupts in a critical section, then you can't use locks in the bottom half of the kernel. However, we _need_ to use the icu_lock to disable an interrupt source. Also, if you want to allow other CPU's to snoop interrupts from other CPU's then you need a lock for that. Which must be acquired when writing the data, not just when the other CPU reads it. >:Fair enough. I suppose we could use a per-thread bitmask of pending >:interrupts. >:Hmm, except that that assumes a dense region of vectors which alpha doesn't >:have. Note that on the alpha we did spl's via the x86 equivalent of cli/sti >:for this reason. On the alpha vector numbers are rather sparse and won't >:fit into a bitmask. You can't use an embedded list in struct ithd in case >:multiple CPU's get the same interrupt in a critical section. Also, now we >:have >:the problem that we can't use the spin lock to protect the ICU on SMP >:systems, >:meaning that we will corrupt the ICU masks now when disabling interrupts >:(which >:we _have_ to do if interrupts are level-triggered as they are for PCI >:interrupts are on alpha). > > interrupt code == interrupts disabled on entry usually, which means > you can create a queue in the per-cpu data area of pending interrupts > which are to be executed when the critical section is left. And, I will > also add, that this also greatly improves the flexibility we have in > dealing with interrupts. It would even be possible to have an idle cpu > check an active cpu's pending interrupt queue and steal one. But when I exit the interrupt and return back to the kernel interrupts are enabled again. Combine this with a level-triggered interrupt and you get a deadlock. One idea buzzing around my head is to allow one interrupt in a critical section (save's an ithread pointer in the thread) and then disable interrupts in the trapframe so when we return we run with interrupts disabled until we exit the outermost critical_exit(). This could work as it would allow us to defer the ICU lock as well. This does mean that mulitple CPU's might schedule the same ithread (or run the same fast interrupt handler) but I think we will be fine with that. Humm, except that fast interrupt handlers want a trapframe to work with (at least clock interrupts do) so this would break clock interrupts. Eww. This we could work around by special saving a mini-clockframe (that just has the bits of a trapframe that a clock interrupt wants to actually use) along with the the ithread pointer and passing it to fast interrup handlers. Basically a intrframe which is much smaller than a trapframe. OTOH, we could just save a full trapframe though that's a bit large IMO but it would work for the first round without overly complicating things. >:I'm afraid the issues here are a bit more complex than it first seems. >:However, the idea of deferring interrupts is a nice one if we can do it. It >:is >:actually very similar to the way preemption will/would defer switches to >:real-time threads until the outermost critical_exit(). > > Of course we can do it. It's how we do SPLs in -stable after all. It > is a very well known and well proven mechanism. This is how you do SPL on _i386_ in stable. >:Err, I was going to use separate procedures precisely to avoid having to >:complicate critical_enter/exit. critical_enter/exit_spinlock or some such >:which would only be used in kern_mutex.c and sys/mutex.h in one place in each >:file. > > Yuch. It is a simple matter to standardize the meaning of > td->td_critnest but the full implementation of 'critical_enter' and > 'critical_exit' should not be split up into MD and MI parts. It just > makes the code less readable. They should simply be MD. Incrementing a counter is MD? :) They are split up like this because critical_exit() is going to grow some more MI code for preemption, and it is going to grow some more MI code for this optimization as well. >:> I don't understand what you are saying here. The algorithm I described >:> is platform independant (just as our spl mechanism was). What more can >:> one ask? It's great.. an API that everyone can depend on performing >:> well, exactly as advertised, across a multitude of platforms. >: >:No, I think your algorithm is very i386 specific to be honest. The spl >:mechanisms differend widely on i386 and alpha. It may be that this code >:really >:needs to be MD as the spl code was instead of MI to optimize it, but we don't >:need it right now. Your argument is that people will not use critical_enter >:because of its implementation details, and I'm saying that people should use >:the API because of the service it presents and should not be worried about >:how >:it is implemented. > > The basic _cpl concept was and still is platform independant. The > concept of 'defering an interrupt' is certainly platform independant. > I don't see how this could be any less independant. Err, the optimization of deferring interrupts is very i386 specific. The original spl's when spl's were first used disabled lower-priority interrupts on the PDP. We do the same on the Alpha. It's equivalent to cli/sti except _much_ more expensive. Each spl on the alpha involved 1 or 2 PAL calls. The way you are approaching this is very i386 specific. >:the icu lock, and the problem of figuring a way to store the per-thread state >:of triggered interrupts for each arch. If there is a MI way, that would be >:best. Well, the embedded list of ithreads might work actually. But that >:doesn't fix icu_lock. > > These are all problems we faced before. I had to deal with all of > these issues when I did that first pass on the 386 code when we first > started to scrap the _cpl stuff, incorporate Giant, and implement > the idle thread. Uh. The i386 code on SMP used a mplock that it _disabled interrupts_ with in order work around the problems I've brought up. Using cli/sti or their equivalents is how we handled these problems before. You weren't crying foul then. -- John Baldwin <jhb@FreeBSD.org> <>< http://www.FreeBSD.org/~jhb/ "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?XFMail.020103092809.jhb>