Date: Wed, 2 Jan 2002 20:10:46 -0800 (PST) From: Matthew Dillon <dillon@apollo.backplane.com> To: John Baldwin <jhb@FreeBSD.ORG> Cc: arch@FreeBSD.ORG, Bernd Walter <ticso@cicely8.cicely.de>, Mike Smith <msmith@FreeBSD.ORG>, Bruce Evans <bde@zeta.org.au>, Michal Mertl <mime@traveller.cz>, Peter Jeremy <peter.jeremy@alcatel.com.au> Subject: Re: When to use atomic_ functions? (was: 64 bit counters) Message-ID: <200201030410.g034AkP61865@apollo.backplane.com> References: <XFMail.020102183313.jhb@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
:> non-cacheable in terms of code generation. Every time you get it you :> have to re-obtain it. So a great deal of our code does this in order :> to work around the lack of optimization: : :No, this is incorrect. curthread and curpcb are a bit special here. They are :valid even across migration. Yes they are... but the code that retrieves them can't be optimized by GCC to deal with multiple references. That's one of the problems. :> ... use td multiple times ... : :This is done cause PCPU_GET() is slow. Perhaps that is what you are trying to :say in that our PCPU_GET() macros are slow? On x86 they are one instruction :for most cases, and on other archs they are 2 or 3 at most. No, what I am saying is that they are designed in a manner that makes it pretty much impossible for GCC or a human to optimize. The fact that you cannot get a stable pointer to the per-cpu area (without being in a critical section) and then access the variables via standard structural indirection has led to all manner of bloat in the API. It turns what ought to be a simple matter of a cpu-relative access into a mess of macros and makes it difficult to take advantage of the per-cpu data area for anything general purpose. If we had conceptually 'instant' critical section handling it would not be as big a deal, but we have a bad combination here... critical sections are expensive (almost as expensive as getting a mutex), and it is not possible to optimize general per-cpu structural references. :> * critical_enter() is a procedure, not a macro. So for something that :> should be as simple as '++curthread->td_critnest' which could be :> two instructions we instead have a 36 byte procedure which does :> pushes, moves, pops, comparisons, and a nasty cli instruction. : :It can be made a macro trivially if we want do it later. However, during :earlier work when one may want to put extra stuff in it such as assertions it :is ok for it to be a function for the time being. Microptimizing one function :is well and good and all, but in -current there are really bigger fish to fry. The function/macroness is not the biggest problem here. It's the cli/sti junk and the general overhead in the code that is the problem. What should be an extremely light weight function that nobody should have a second thought about using when they need it isn't even close to being light weight. Our critical section code could be two (unsynchronized) instructions on nearly all architectures with only a moderate (and very well known since it would be similar to the SPL mechanism) amount of assembly. We have the same problem with these functions that we had with the spl*() code before we optimized it. People were uncomfortable with the routines and the API and often did all sorts of weird things to try to avoid them due to their perceived overhead. We had (and still have) unnecessarily long spl'd loops in -stable because of that. It is also the same problem we have with mutexes. Look at what mutex (current) and simplelock (stable) overhead did to the filesystem update code? (SYNC/FSYNC). System with large amounts of memory and a moderate amount of dirty data wound up stalling long enough to be noticeable every 30 seconds. Just having a documented API isn't good enough. There is a lot more to it then that and it effects the way people code. The APIs in -current are nasty enough that I have a strong desire to throw the whole damn thing away because I'm afraid of the code people will wind up writing to get around their inefficiencies. :>:This requires the bottom-half code to not use any locks at all which might be :>:possible. However it means if I have an idle CPU, it can't pick up the :>:interrupt handler for the interrupt that came in until the CPU that gets its :>:interrupt leaves the handler. If you want to prevent that, you need to allow :> :> Yes, but so what? The cpu holding the critical section isn't going to :> be in that section for more then a few clock cycles most likely. : :Fair enough. I suppose we could use a per-thread bitmask of pending interrupts. :Hmm, except that that assumes a dense region of vectors which alpha doesn't :have. Note that on the alpha we did spl's via the x86 equivalent of cli/sti :for this reason. On the alpha vector numbers are rather sparse and won't :fit into a bitmask. You can't use an embedded list in struct ithd in case :multiple CPU's get the same interrupt in a critical section. Also, now we have :the problem that we can't use the spin lock to protect the ICU on SMP systems, :meaning that we will corrupt the ICU masks now when disabling interrupts (which :we _have_ to do if interrupts are level-triggered as they are for PCI :interrupts are on alpha). interrupt code == interrupts disabled on entry usually, which means you can create a queue in the per-cpu data area of pending interrupts which are to be executed when the critical section is left. And, I will also add, that this also greatly improves the flexibility we have in dealing with interrupts. It would even be possible to have an idle cpu check an active cpu's pending interrupt queue and steal one. :I'm afraid the issues here are a bit more complex than it first seems. :However, the idea of deferring interrupts is a nice one if we can do it. It is :actually very similar to the way preemption will/would defer switches to :real-time threads until the outermost critical_exit(). Of course we can do it. It's how we do SPLs in -stable after all. It is a very well known and well proven mechanism. :> in all the supposedly uncomplicated procedures you wind up with something :> that is unnecessarily complex and definitely overcomplicated. : :Err, I was going to use separate procedures precisely to avoid having to :complicate critical_enter/exit. critical_enter/exit_spinlock or some such :which would only be used in kern_mutex.c and sys/mutex.h in one place in each :file. Yuch. It is a simple matter to standardize the meaning of td->td_critnest but the full implementation of 'critical_enter' and 'critical_exit' should not be split up into MD and MI parts. It just makes the code less readable. They should simply be MD. :> I don't understand what you are saying here. The algorithm I described :> is platform independant (just as our spl mechanism was). What more can :> one ask? It's great.. an API that everyone can depend on performing :> well, exactly as advertised, across a multitude of platforms. : :No, I think your algorithm is very i386 specific to be honest. The spl :mechanisms differend widely on i386 and alpha. It may be that this code really :needs to be MD as the spl code was instead of MI to optimize it, but we don't :need it right now. Your argument is that people will not use critical_enter :because of its implementation details, and I'm saying that people should use :the API because of the service it presents and should not be worried about how :it is implemented. The basic _cpl concept was and still is platform independant. The concept of 'defering an interrupt' is certainly platform independant. I don't see how this could be any less independant. :No, I do not have half the code locked up by any means. I do have a change I :need to test related to the critical_enter/exit stuff that I want to commit in :a day or so though. As far as changing critical_enter/exit to defer, I'm not :going to work on that right now except to think about it. Specifically, not :only must critical_enter/exit change, but the various MD code that calls :ithread_schedule() needs to defer those calls. Also, there is the problem of :the icu lock, and the problem of figuring a way to store the per-thread state :of triggered interrupts for each arch. If there is a MI way, that would be :best. Well, the embedded list of ithreads might work actually. But that :doesn't fix icu_lock. These are all problems we faced before. I had to deal with all of these issues when I did that first pass on the 386 code when we first started to scrap the _cpl stuff, incorporate Giant, and implement the idle thread. As I said, I can do this. It would take me about a week for I386. The other architectures would be able to stick with the sti/cli equivalent until their assembly gurus clean them up (if they even care).... what I am proposing is completely compatible. -- In regards to mutex spins, this actually simplifies matters. The scheduler can schedule any pending interrupts while it has the scheduler mutex held, even if both the previous thread and what would be the next thread have a non-zero td_critnest. You can avoid nearly *ALL* hard interrupt disablements -- just one place in the core of the scheduler (that we already have) is all you would need. -Matt Matthew Dillon <dillon@backplane.com> To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200201030410.g034AkP61865>