From owner-freebsd-current Sat Feb 1 18:32: 6 2003 Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0D54837B406 for ; Sat, 1 Feb 2003 18:32:01 -0800 (PST) Received: from heron.mail.pas.earthlink.net (heron.mail.pas.earthlink.net [207.217.120.189]) by mx1.FreeBSD.org (Postfix) with ESMTP id BD11943F3F for ; Sat, 1 Feb 2003 18:31:59 -0800 (PST) (envelope-from tlambert2@mindspring.com) Received: from pool0036.cvx40-bradley.dialup.earthlink.net ([216.244.42.36] helo=mindspring.com) by heron.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 18f9vB-0007Lh-00; Sat, 01 Feb 2003 18:31:50 -0800 Message-ID: <3E3C82B9.88740A1B@mindspring.com> Date: Sat, 01 Feb 2003 18:30:17 -0800 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Matthew Dillon Cc: Bosko Milekic , "Daniel C. Sobral" , Trish Lynch , freebsd-current@FreeBSD.ORG Subject: Re: Hyperthreading and machdep.cpu_idle_hlt References: <20030131125804.E1357-100000@femme> <200301311824.h0VIOtmF095380@apollo.backplane.com> <3E3AC33E.9060204@tcoip.com.br> <200301311908.h0VJ8cNZ007396@apollo.backplane.com> <20030131141700.A7526@unixdaemons.com> <200301311952.h0VJqrMB076135@apollo.backplane.com> <20030201100412.B11945@unixdaemons.com> <3E3C327F.FD9E26F7@mindspring.com> <20030201160547.A13169@unixdaemons.com> <3E3C3C0D.15722918@mindspring.com> <200302012157.h11Lvo7f017280@apollo.backplane.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a40871b8a538799038e5c1b0d744966200387f7b89c61deb1d350badd9bab72f9c350badd9bab72f9c Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Matthew Dillon wrote: > The HLT/clock interrupt issue is precisely what I describe in the > idle_hlt comments in i386/i386/machdep.c (last July). I wish we had a > better mechanism then the stupid IPI stuff, like a simple per-cpu > latch/acknowledge level interrupt (softint), but we don't. Thanks, Intel. 8-). > I don't think we want to over-engineer per-cpu scheduling. The > system really doesn't know what cpu a task is going to wind up running > on until a cpu scheduler (sched_choose()) comes along and needs to > locate the next task to run. Too many things can happen in between the > initiation of the wait, the wakeup, and the task actually getting cpu. > Introducing a complex per-cpu wait queue or trying to do something > complex at wakeup time instead of at sched_choose() time is just going > to be a waste of time. I think it is best to wakeup a task by placing > it on the same cpu run queue as it was previously on (which is what Jeffs > code does for the most part), The approach I've *always* advocated involves per-CPU run queues. While it's possible to implement affinity and negaffinity with a global queue, like Linux did, it's not really a good idea. I know that the current Linux code works, because Ingo Molnar put a huge amount of effort into it. However, the problem of affinity is not NP-complete, and it's not really possible to deal with it with a single queue. Noting the process which is awoken gets on the same CPU ready-to-run queue is just an obvious way of dealing with CPU affinity. It fails to deal with negaffinity, which is necessary for parallel execution, and it fails to deal with the issue of eliminating locks: you still get the TLB shootdowns, cache line invalidations, and inter-APIC arbitration traffic, even if you don't issue an explicit IPI. > and deal with task stealing in > sched_choose(). A "task stealing" or "pull" approach to balancing is just plain wrong. It can't deal with transient load spikes, which should be expected in any heterogeneous load situation. This is the reason you never see anyone running benchmarks on such a system, without the system being otherwise idle, and without homogeneous test loads: hetergenous test loads would fail to show "improvement", due to excessive bouncing. The other problem with a "pull" model is that it requires that the per CPU scheduler queue be locked on all accesses, so a different CPU can lock it to traverse it to "pull" new tasks for itself. Finally, the "pull" model is very hard to deal with SMT, in terms of making negaffinity decisions (you have to know too much), or giving an intentional affinity bias (e.g. stay on the same physical CPU, but go to "the other side" to avoid cache-busting). By going to a push model, you limit locking to the migration code path, and a per-CPU check for a non-NULL per-CPU migration list head can be done without acquiring the lock that must be acquired to pull data off the migration queue, or to push tasks off to another CPU (locking the target's queue for the operation). What that means is no locking in the common case. And no cache line invalidation, and no TLB shootdown, and no inter-APIC arbitration. Actually, it's very easy to compare these models implicitly, by running a single CPU with SMT enabled, and comparing it to two identical CPUs, with SMT disabled, such that the only real change is the shared parts, and the lack of the cache, TLB, and APIC crap in the first case. With the inherent stalls from shared resources in the SMT case, you'd expect a "true" SMP to have significantly higher performance than SMT. But that's not what you see, even with Jeff's new scheduler. > The scheduler, when it comes time to actually switch > in the next runnable task, then deals with complexities associated with > misbalancing (i.e. cpu A is idle and ready to accept a new task, and > cpu B's run-queue has a task ready to be run). Yes, currently, and this is wrong. Or rather, the underutilized CPU should not be doing the balancing, if it has to stall the most heavily loaded CPU in order to do it... and therefore, "pull" is wrong. What *should* happen is that, at the time of the next context switch, the heavily loaded CPU should decide to "shed load". Proper hysteresis really wants that decision to be made periodically, with a periodicity equal to once per (N+1) entries into the per CPU scheduler, where N is the number of CPUs participating in the SMP region in which tasks may be migrated (as part of the strategy for avoiding excessive migration). If this decision can be made on a single figure of merit, which we'll call "load", and the figure is stored in an atomic (e.g. integer) data area, then it can be read without locks. Since it's only written by the CPU for whom it's the figure of merit, again, no locks are required to arbitrate the process of determining a load imbalace. > While it is true that we would like a cpu to predominantly use the > per-cpu run-queue that it owns, we don't really lose anything in the > way of performance by allowing cpu A to add a task to cpu B's > run queue or for cpu A to steal a task from cpu B's run queue. Sure > we have the overhead of a per-cpu mutex, but the reason we don't lose > anything is that this sort of mechanism will *STILL* scale linearly > with the number of cpus in the system The prblem is in the arbitration of the regions of memory in which these locks exist. In fact, the scaling is *not* linear with the number of CPU's, since the arbitration overhead goes up exponentially with the number of CPUs, when they share a contention domain. > (whereas the global run queue in > sched_4bsd.c constricts at a single sched_mtx and does not scale). Preaching to the choir... we all know that a global queue, and the global locks, suck. But in truth, any locks suck, and replacing one suckage with another is not really a useful exercise. 8-). > The > overhead of a per-cpu run-queue with a per-cpu mutex is *STILL* > effectively O(1) and the more complex overheads involved with locating > a new task to schedule from some other cpu's run queue when the current > cpu's run-queue is empty are irrelevant because you are only eating > into cycles which would otherwise be idle anyway. It's not actually O(1). The problem is that the lock arbitration overhead grows arbitrarily large. Also, the migration can not be between nodes in a cluster, or non-local CPU's in a NUMA system, because the scheduler must be able to access the scheduling queue of the CPU from which it wishes to be able to "pull" tasks, and it must hold a lock on that queue over the process. Given a "push" model, this isn't a problem, since pushing a task is, effectively, a mesaging operation, and it could go so far as to suspend the task, move it to memory in another node entirely, and resume it by placing it in the scheduler queue on that node. As an old Amiga guy (;^)), you should be all for messaging, right? 8-). -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message