Date: Fri, 20 Sep 2002 14:08:05 -0700 From: Terry Lambert <tlambert2@mindspring.com> To: Rik van Riel <riel@conectiva.com.br> Cc: Julian Elischer <julian@elischer.org>, Bill Huey <billh@gnuppy.monkey.org>, freebsd-arch@freebsd.org Subject: Re: New Linux threading model Message-ID: <3D8B8E35.EDAF4450@mindspring.com> References: <Pine.LNX.4.44L.0209201652330.1857-100000@imladris.surriel.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Rik van Riel wrote: > On Fri, 20 Sep 2002, Julian Elischer wrote: > > On Fri, 20 Sep 2002, Rik van Riel wrote: > > > There don't seem to be any O(n) loops left in or near this scheduler, > > > meaning that 1:1 threading with lots of threads becomes possible. > > > > The FreeBSD scheduler is moving towards a big rewrite but we want to > > change "one thing at a time" :-) in that area.. > > This is doable in a smallish number of steps, which don't > even need to be done in this order: > > 1) per-cpu runqueues instead of a global one, which wants ... Yes. > 2) ... load balancer between these per-cpu queues No. This is the problem with the Linux model. With an active load balancing algorithm, you end up with global contention for scheduler queue locks. In reality, you can get away from having *any* scheduler queue locks *at all*, for the normal case, and then only contend at the inter-CPU level for the CPUs using a push migration model. This basically depends on being able to do a "lockless list is empty test", which you can do with a simple pointer compare. The design is that the per CPU scheduler checks the push list for processed which it needs to account for in its scheduler, but which are not in the per CPU scheduler queue yet. If the pointer is non-NULL, then it locks access to the pointer, and drains the list into the local scheduler queue. For the push case, the process(es) to be migrated off the current CPU are selected, and then, based on a figure of merit which is only ever written by the CPU to which it applies, and which is atomically readable by all other CPUs, such that no locking is required, a "target" CPU is identified. The pointer lock for that CPU is acquired (notice that this is not in the global lock contention domain; it only involves the two CPUs in question), and the preassembled list is added to the list of the target CPU. In the normal course of events, this changes a value from NULL to a list of processes to move into the target scheduler queue; worst case, the CPU pushing has to assign the contents of a pointer to the pointer to the next entry in the last entry of the pushed list to the protected pointer, and the protected pointer t the head. In practice, the number of processes being migrated will never be more than 1, unless you write a replacement scheduler, which is migration happy. So the migrate away is one-behind the head of the scheduling queue, and the migrate-to is one ahead. On migration, an artificial inflation of the figure of merit can be handled -- or a the value of the pointer can be examined, and assumed to add a constant weighting to the target CPU's figure of merit, if the pointer value is non-NULL. Thus, it is only ever required to lock when you are actively doing process migration, and process migration is rare (as it should be, particularly if one of your target architectures is NUMA, but in the general case on small CPU count shared memory multiprocessors, as well). This also permits preference weighting based on locality, for affinity on hyper-threaded CPUs, and negaffinity in CPU sets under the same circumstances. > 3) two runqueue arrays (current and expired) instead of > just one, which enables ... Not required. > 4) ... event-driver priority recalculation, instead of > recalculating the priority of each task separately This actually doesn't work. The worst case failure is under overload, which is exactly where you don't want it to be. The scheduling for the BSD scheduler, as was pointed out, takes time not run into the priority weighting. A granularity of 3 seconds until the disctinction between the two queues for enqueueing delayed jobs is realized is really gross. 8-(. > These changes are probably small enough that they can be done > without the risk of destabilising anything. That's certainly true, regardless of the implementation. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3D8B8E35.EDAF4450>