Date: Fri, 20 Sep 2002 17:40:29 -0700 From: Terry Lambert <tlambert2@mindspring.com> To: "Bill Huey (Hui)" <billh@gnuppy.monkey.org> Cc: Rik van Riel <riel@conectiva.com.br>, Julian Elischer <julian@elischer.org>, freebsd-arch@freebsd.org Subject: Re: New Linux threading model Message-ID: <3D8BBFFD.E1CDEAD5@mindspring.com> References: <Pine.LNX.4.44L.0209201652330.1857-100000@imladris.surriel.com> <3D8B8E35.EDAF4450@mindspring.com> <20020920213629.GA1527@gnuppy.monkey.org>
next in thread | previous in thread | raw e-mail | index | archive | help
"Bill Huey (Hui)" wrote: > > No. > > > > This is the problem with the Linux model. With an active load > > balancing algorithm, you end up with global contention for > > scheduler queue locks. > > That's not true, they're all per CPU run queue locks. The locks are in the global contention domain; basically, they have to be shared among all processors, unless they are in uncacheable pages. Any contended resources are going to be effectively global, at least if they are contended by multiple CPUs. > > In reality, you can get away from having *any* scheduler queue > > locks *at all*, for the normal case, and then only contend at > > the inter-CPU level for the CPUs using a push migration model. > > That's what the Mingo's scheduler does. It only does a double lock > when the load balance computation makes it do a CPU migrate, otherwise > there's no contention and no per CPU locks are aquired. This isn't true, if a double-lock is involved. The issue is one of whether or not you need to acquire a lock on your local scheduler queue in order to use it, or not. You're saying that you have to acquire the lock, which means that you effectively have to do it in a globally contended space. In the design I'm talking about, there are is only a single lock, and that lock is on a migration queue, into which you *push* entries. This is different from both the Linux scheduler, and the current FreeBSD scheduler, and the current FreeBSD scheduler with Alfred's old (last year) affinity patches applied. Bu only scheduling out of your own queue, and only pushing from your own queue to a per CPU migration queue, you guarantee *one* lock on the migration, and you guarantee not needing *any* locks for your own queue access. Even if you place the scheduler locks in uncontended pages, and eat a factor of 5 access penalty for the barrier, having to get your own scheduler lock -- or some other CPU's -- leaves you in the position of having to eat a huge amount of overhead in the common (non-migration) case. > I don't normally ask this, but what the hell are you saying here ? Here's pseudo-code: /* handle migration *to* us */ IF !empty(my_migration_queue) LOCK(my_migration_queue) while( !empty(my_migration_queue)) tmp = my_migration_queue my_migration_queue = tmp->next insert(my_scheduler_queue, tmp) /* recalculate my load figure of merit */ /* update our figure of merit */ UNLOCK(my_migration_queue) ENDIF /* get next scheduled process to run for us */ next_run = head(my_scheduler_queue) remove(my_scheduler_queue, next_run) /* * compare our load figure of merit vs. all other CPUs to decide * whether to migrate a process from us to another CPU */ /* ...find lowest figure of merit of all neighbors... */ /* ...compare our figure of merit minus a low watermark to that... */ IF should_migrate migrate = head(my_scheduler_queue) remove(my_scheduler_queue, migrate) LOCK(target_migration_queue) migrate->next = target_migration_queue target_migration_queue = migrate UNLOCK(target_migration_queue) /* update our figure of merit */ ENDIF /* ...run "next_run" ... */ ...the local CPU scheduler queue is not a contended resource. Ever. Further, you are permitted any amount of complexity in your target choice you want. This includes preferring "hyperthreaded" CPUs on the same die for affinity's sake (to avoid cache shootdown), and preferring other physical CPUs for negaffinity's sake (to avoid stall barriers that effect more than one thread simultaneously, as would be the case on a hyperthreaded CPU running a process with two threads in a box with 2 real CPUs with 2 hyperthreded "CPUs" each). > > > 3) two runqueue arrays (current and expired) instead of > > > just one, which enables ... > > > > Not required. > > That's part of mingo's algorithm to avoid recalculation if I understand > it correctly. Not exactly sure, I'm stretching my knowledge of his > algorithm here. Yes, it's his recalculation avoidance, in order to make the scheduler "O(1)". The real effect, though, is to cause starvation to be possible under heavy load, at which point things which would have been scheduled preferentially are scheduled round-robin in the other queue. That this only triggers when the other queue head (what I would call "the full quantum queue", as opposed to "the partial quantum queue") is waiting "too long" is actually part of the problem, as I see things. > > > 4) ... event-driver priority recalculation, instead of > > > recalculating the priority of each task separately > > > > This actually doesn't work. The worst case failure is under > > overload, which is exactly where you don't want it to be. > > What kind of overload ? he does a number of things to make sure that > all processes behave properly by demoting priorities. And by requeueing them into the full quantum queue, rather than the "partial quantum queue", if the number of things which need to run end up exceeding an arbitrary pool retention time on the "full quantum queue". This basically results in mathematical artifacts, from treating a step function as if it were, in the limit, equivalent to a function which was continuous. 8-). The main thing this will do is, when the load gets high enough that this happens every time, the system will degrade to the same behaviour as if a "partial quantum queue" didn't exist. Or rather, that behaviour, plus an additional lock, plus an empty queue examination, plus an unlock. The other thing it will do is cause the artifacts, which will result in unexpected behaviour near the boundary cases. The worst case for this is ping-ponging between the queues for an interactive process, e.g. a game or an MP3 player or an auidio encoding process or a CD or DVD burner, etc., while it's sitting right at the boundary. This will not hurt it on average, but it will toggle between getting full vs. partial quantums (for example), which could make things, uh... "bursty". ;^). This degraded behaviour will be exactly equal to a complete lack of thread group affinity. Actually, the "partial quantum" case can only ever be 66% effective in any event, as a statistical probability of the best case that still exhibits contention: it's intent is really to ensure preferential scheduling of processes which are ready-to-run, and which have partial quantum remaining. This actually fails, when the processes have partial quantum remaining... but aren't ready to run. The assumption implicit in this is that the average blocking operation will take no more than 50% of the duration of a quantum in order to complete in the background. Anything longer than that, and the scheduling degrades, again, to round-robin. This is actually the primary reasoning behind my design of a test case for a benchmark, vs. a microbenchmark: the algorithm that's there will work *very well* for an uncontended system running a single multithreaded process, but degrade significantly under all other situations (IMO). I'm pretty sure that Peter and Matt will not let FreeBSD blindly implement the same model, without at least running the numbers on it, or the model I've suggested, or *any* model, for that matter. FreeBSD is pretty technically conservative, which, at times like this, is usually a good thing. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3D8BBFFD.E1CDEAD5>