From owner-freebsd-arch Thu Jan 3 12:44:21 2002 Delivered-To: freebsd-arch@freebsd.org Received: from pintail.mail.pas.earthlink.net (pintail.mail.pas.earthlink.net [207.217.120.122]) by hub.freebsd.org (Postfix) with ESMTP id 2C2B337B41A; Thu, 3 Jan 2002 12:44:08 -0800 (PST) Received: from pool0238.cvx40-bradley.dialup.earthlink.net ([216.244.42.238] helo=mindspring.com) by pintail.mail.pas.earthlink.net with esmtp (Exim 3.33 #1) id 16MEiB-0006kA-00; Thu, 03 Jan 2002 12:43:39 -0800 Message-ID: <3C34C27C.C7FE6713@mindspring.com> Date: Thu, 03 Jan 2002 12:43:40 -0800 From: Terry Lambert X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Bernd Walter Cc: Matthew Dillon , Alfred Perlstein , John Baldwin , arch@FreeBSD.ORG, Bernd Walter , Mike Smith , Bruce Evans , Michal Mertl , Peter Jeremy Subject: Re: When to use atomic_ functions? (was: 64 bit counters) References: <200201030002.g0302Eo60575@apollo.backplane.com> <20020102180734.A82406@elvis.mu.org> <200201030012.g030Cgp60752@apollo.backplane.com> <3C33D3E3.38E758CA@mindspring.com> <20020103094239.GH53199@cicely9.cicely.de> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Bernd Walter wrote: > > In any case, multiple reader, single write doesn't need locks > > unless it has to be done atomically. > > Single write means always with the same CPU or thread. Yes, of course. > If you always write with the same CPU the other might keep an > old value in his cache for an unknown long time. > You simply trust that caches gets syncronised automaticaly by > regular events like context switching. > I'm not shure this is valid. Actually, I'd suggest putting it in a non-cacheable page, if multiple CPUs needed to get at it, and it must be 100% accurate at all times; if it were in a cacheable page, there would be an Invalidate (MESI, remember?), and the cache line in the other CPUs that had it cached would be flushed. Personally, I'd suggest making the algorithms insensitive to the data, in the neighborhood of 2 times the update latency. This is what I suggested to Alfred, when I suggested per CPU run queues (and three years ago, when we had the threads design meeting at the Whistle facility FreeBSD user group). --- Example application and use of variable accuracy counts in a non-locking scheduler with CPU migration, negaffinity, and variable scheduling policy --- Basically, you would not need extremely accurate load numbers in order to make CPU load balancing decisions, if you did the load balancing "correctly": 1) Per CPU run queues are created 2) Processes do not migrate themselves between run queues 3) A single "figure of merit" is kept on a per CPU basis, where other CPUs can see it; their view of this figure potentially contains latency in its accuracy 4) An infrequent (relative to the scheduling quantum) timer averages "CPUs of interest" (this permits NUMA and cluster based migration, in the future, if desired, by keeping per CPU set statistics, if desired), and stores this, also, as a "figure of merit" which is accessible to the other CPUs. It might also keep a list of M CPUs, in sorted order of lowest to highest values, to make migration CPU target selection easier. 5) When a CPU is ready to pull a task off its run queue, perhaps every N iterations of this procedure, it takes its "figure of merit" and compares it against the average "figure of merit"; if the per CPU number is not some configurable "high water mark" over the average, then nothing happens: it simply schedules the task 6) If it decides action nees to be taken, then it polls the figures of merit for its interest group. The CPU with the lowest "figure of merit" is selected; this may be mediated by negaffinity; for example, if the task is a thread in a thread group (process), it might have a 32 bit value in the process structure indicating which CPUs have threads on them currently. In that case, only a CPU without its bit set in the 32 bit value could be selected, based on relative figure of merit. 7) The "scheduler insertion queue" for the selected target CPU (a simple linked list) has the task added to it (the task is migrated). This requires a lock on the queue; note that only one other CPU is tied up in this process, in an N CPU system, and the statistical likelihood of a collision is very low. 8) Each CPU going into its scheduler looks first at its per CPU scheduler insertion queue; this look does not require a lock, in most instances, since all it is doing is a check to see if a pointer is NULL. If the pointer is not NULL, then (and only then) is a lock on the queue asserted by the CPU that "owns" it, and the element is dequeued, the lock released, and run. This is only an example algorithm; but note that it operates completely without locks, except in the migration case, and the migration case is a relatively rare event. This approach gets rid of all the Linux and other OS's *mess*, where they try to ensure CPU affinity for processes by scheduler selection: the scheduler algorithm need not be nearly as complex, in order to acieve the same benefits of negaffinity for CPUs for threads in a single process, etc. (the more you look at it, the more benefits you will see). --- The point of this example is that the issue isn't about how much or how little locks cost, but about designs that don't require locks in the common case. So it really doesn't *matter* what the costs are up front, as long as the costs are incurrect so infrequently that their amortized value is lower than the lowest cost method being hit much more frequently. In other words: Matt was right, when he said "Maybe we are looking at this all wrong ... maybe we don't need locks ..." (paraphrasing). > > An easy way to do this atomically, even for fairly large items, > > is to toggle between an active and inactive structure of data, > > where the pointer to the strucutre is assigned atomically. > > Only you use rel/acq behavour - atomic alone isn't strong enough. It depends on your cycle time on the objects. So long as the update/read cost is very small relative to the update/read frequency, you're OK. Basically, you need N copies, where the update/read frequency divided by the update/read cost is greater than N*2. For my example of a zero system call gettimeofday() and time(), using a page with PG_U set on it to contain a reflection of the timecounter data, N == 2 (the minimum N permissable). For something else, N might be a larger value, and then instead of two pointers for "current" and "previous", you are talking two pointers for a two-handed clock through a circularly linked list of structures containing the reflected data (the hands are both moved on update-forward, so (1) the data is never incorrect, and (2) the data is always one-behind during the update process itself). If N gets larger than 2, in any case, you probably ought to be reconsidering your choice of algorithms, anyway. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message