Date: Mon, 1 Oct 2007 23:57:20 -0700 (PDT) From: Jeff Roberson <jroberson@chesapeake.net> To: Bruce Evans <brde@optusnet.com.au> Cc: cvs-all@FreeBSD.org, src-committers@FreeBSD.org, cvs-src@FreeBSD.org, Jeff Roberson <jeff@FreeBSD.org>, Garance A Drosehn <gad@FreeBSD.org>, Ben Kaduk <minimarmot@gmail.com> Subject: Re: cvs commit: src/sys/kern sched_ule.c Message-ID: <20071001234448.A539@10.0.0.1> In-Reply-To: <20071001172620.X1839@besplex.bde.org> References: <20070930040318.094E345018@ptavv.es.net> <20070930153430.U583@10.0.0.1> <20071001172620.X1839@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 1 Oct 2007, Bruce Evans wrote: > On Sun, 30 Sep 2007, Jeff Roberson wrote: > >> On Sat, 29 Sep 2007, Kevin Oberman wrote: > >>> YMMV, but ULE seems to generally work better then 4BSD for interactive >>> uniprocessor systems. The preferred scheduler for uniprocessor servers >>> is less clear, but many test have shown ULE does better for those >>> systems in the majority of cases. >> >> I feel it's safe to say desktop behavior on UP is definitely superior. > > This is unsafe to say. > >> I think there is no significant difference on UP between 4BSD and ULE > > This may be safe to say, but is inconsistent with the above. > >> except perhaps in context switching microbenchmarks where ULE falls behind. > > It is safe to say that interactive users cannot notice insignificant > differences. It takes a micro-benchmark to notice possibly-significant > differences of hundreds or even thousands of nanonseconds for context > switching. Well speaking of context switch microbenchmarks... I recently looked at lmbench but was disatisfied with the way it measures. Specifically, I want to see how context switch times scale as you add lots of threads that are running concurrently. The #procs argument to lat_ctx does not run these processes concurrently. They each are woken in turn as a token passes through a chain of pipes. I wrote a simple tool that does a given number of switches with a given number of processes. I then simply time to the total execution with 'time'. This avoids the overhead of pipes, sleep/wakeup, and other complexities. Instead, it uses sched_yield(). The tool is available at: http://people.freebsd.org/~jeff/yield.c and yield.sh is what I have been using to measure. I found that ule on UP was 10% slower than 4BSD at 1 and 10 concurrent threads and 5% slower at 100. It broke even at 1000 and was about 22% faster at 5,000. Then I wrote: http://people.freebsd.org/~jeff/ulefaster.diff This is indistinguishable from 4bsd at 1, 10, 100, and 1000 threads while being 24% faster at 5,000. The 5,000 case is anomolous. I think after 100 we must no longer fit in cache. At 5,000 the time to fork() and wait() actually shows up significantly. Here's output for 4BSD on UP: 5.69 real 1.17 user 4.48 sys 7.66 real 1.60 user 6.02 sys 8.37 real 1.90 user 6.43 sys 37.96 real 14.28 user 23.26 sys 68.50 real 14.16 user 45.20 sys And ULE with the above patch: 5.62 real 1.23 user 4.36 sys 7.73 real 1.97 user 5.74 sys 8.34 real 2.01 user 6.30 sys 38.00 real 13.60 user 24.20 sys 52.42 real 13.84 user 38.32 sys I did multiple runs but didn't average them. They always ended up in the same ballpark and the patch made such a significant change that I didn't bother to record and analyze multiple runs. On SMP ULE pays a price for the per-cpu run queue locks. How well does that pay off? Here's ULE on an 8 core opteron: 3.91 real 0.35 user 3.55 sys 1.70 real 0.44 user 6.63 sys 1.25 real 1.77 user 8.10 sys 4.49 real 14.46 user 21.43 sys 14.32 real 25.58 user 88.07 sys And 4BSD on the same: 39.38 real 0.59 user 38.77 sys 62.47 real 0.84 user 493.07 sys 66.42 real 12.23 user 517.77 sys 69.38 real 25.13 user 523.52 sys 131.33 real 33.33 user 930.52 sys The combination of reduced scheduler locking and improved cache affinity pays off at about 10x the switch throughput of 4BSD. The actual cost of the extra synchronization in ULE is about a 5% penalty as measured with smp.disabled = 1, however, I lost that data and am not interested in rebooting 3 more times to reclaim it. Cheers, Jeff > > ULE may give higher priority to interactive processes, but most loss of > interactivity is caused by blocking on I/O, and there is nothing nothing > a scheduler can do to speed up slow or overloaded devices. > > Bruce >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20071001234448.A539>