From owner-freebsd-current@FreeBSD.ORG Fri Jan 19 22:07:08 2007 Return-Path: X-Original-To: current@freebsd.org Delivered-To: freebsd-current@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 80DFB16A404 for ; Fri, 19 Jan 2007 22:07:08 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from webaccess-cl.virtdom.com (webaccess-cl.virtdom.com [216.240.101.25]) by mx1.freebsd.org (Postfix) with ESMTP id 5910213C448 for ; Fri, 19 Jan 2007 22:07:08 +0000 (UTC) (envelope-from jroberson@chesapeake.net) Received: from [10.0.0.1] (63-226-247-187.tukw.qwest.net [63.226.247.187]) (authenticated bits=0) by webaccess-cl.virtdom.com (8.13.6/8.13.6) with ESMTP id l0JM74bP036281 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO) for ; Fri, 19 Jan 2007 17:07:06 -0500 (EST) (envelope-from jroberson@chesapeake.net) Date: Fri, 19 Jan 2007 14:07:21 -0800 (PST) From: Jeff Roberson X-X-Sender: jroberson@10.0.0.1 To: current@freebsd.org Message-ID: <20070119135849.D558@10.0.0.1> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Subject: Improved ULE load balancing. X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Jan 2007 22:07:08 -0000 I'd like those of you that reported relatively poor SMP performance on ULE to update to revision 1.179. This improved performance on my dual xeon to about 10% better than 4BSD running supersmack. It is also highly tunable. Some options of interest: kern.sched. : pick_pri - The default is on. Turning this off will revert to the older algorithm which is now called pickidle. pick_pri tries to always run the highest priority threads. pickidle really just tries to balance cpu load and doesn't take priority into consideration. pick_pri_affinity - Number of ticks a thread has slept for before we stop considering it as having affinity for a given cpu. busy_thresh - Length of run queue allowed before idle cpus will try to steal some of our work. This defaults to 4 but on some workloads I see improvement with values as low as 2. ipi_thresh - Priorities below this generate IPIs to preempt the target cpu. Can decrease latency for some workloads but at the expense of extra context switches and interrupt overhead. The default configuration was fastest on the most workloads on my 8way opteron and 2x xeon (+2xHTT). I tested parallel compiles and super-smack with select-key.smack doing different workloads on both machines and with different numbers of processors enabled on the 8way opteron. The opteron in 8way mode shows about 300% speedup compared to 4BSD on super-smack. compile times are nearly identical across all schedulers and platforms. I get a more modest 5-10% faster on super-smack on my xeon running super-smack depending on the configuration. Please report back your findings. Hopefully with the tunables present I can experiment and get the settings ride for a wide array of machines. Thanks, Jeff ---------- Forwarded message ---------- Date: Fri, 19 Jan 2007 21:56:08 +0000 (UTC) From: Jeff Roberson To: src-committers@FreeBSD.org, cvs-src@FreeBSD.org, cvs-all@FreeBSD.org Subject: cvs commit: src/sys/kern sched_ule.c jeff 2007-01-19 21:56:08 UTC FreeBSD src repository Modified files: sys/kern sched_ule.c Log: Major revamp of ULE's cpu load balancing: - Switch back to direct modification of remote CPU run queues. This added a lot of complexity with questionable gain. It's easy enough to reimplement if it's shown to help on huge machines. - Re-implement the old tdq_transfer() call as tdq_pickidle(). Change sched_add() so we have selectable cpu choosers and simplify the logic a bit here. - Implement tdq_pickpri() as the new default cpu chooser. This algorithm is similar to Solaris in that it tries to always run the threads with the best priorities. It is actually slightly more complex than solaris's algorithm because we also tend to favor the local cpu over other cpus which has a boost in latency but also potentially enables cache sharing between the waking thread and the woken thread. - Add a bunch of tunables that can be used to measure effects of different load balancing strategies. Most of these will go away once the algorithm is more definite. - Add a new mechanism to steal threads from busy cpus when we idle. This is enabled with kern.sched.steal_busy and kern.sched.busy_thresh. The threshold is the required length of a tdq's run queue before another cpu will be able to steal runnable threads. This prevents most queue imbalances that contribute the long latencies. Revision Changes Path 1.179 +293 -240 src/sys/kern/sched_ule.c