Date: Thu, 07 Jul 2011 22:42:39 +0300 From: Andriy Gapon <avg@FreeBSD.org> To: Steve Kargl <sgk@troutmask.apl.washington.edu> Cc: FreeBSD Current <freebsd-current@FreeBSD.org>, "Hartmann, O." <ohartman@zedat.fu-berlin.de>, Nathan Whitehorn <nwhitehorn@FreeBSD.org> Subject: Re: Heavy I/O blocks FreeBSD box for several seconds Message-ID: <4E160C2F.8020001@FreeBSD.org> In-Reply-To: <20110707151440.GA75537@troutmask.apl.washington.edu> References: <20110706170132.GA68775@troutmask.apl.washington.edu> <5080.1309971941@critter.freebsd.dk> <20110706180001.GA69157@troutmask.apl.washington.edu> <4E14A54A.4050106@freebsd.org> <4E155FF9.5090905@FreeBSD.org> <20110707151440.GA75537@troutmask.apl.washington.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
on 07/07/2011 18:14 Steve Kargl said the following: > On Thu, Jul 07, 2011 at 10:27:53AM +0300, Andriy Gapon wrote: >> on 06/07/2011 21:11 Nathan Whitehorn said the following: >>> On 07/06/11 13:00, Steve Kargl wrote: >>>> AFAICT, it is a cpu affinity issue. If I launch n+1 MPI images >>>> on a system with n cpus/cores, then 2 (and sometimes 3) images >>>> are stuck on a cpu and those 2 (or 3) images ping-pong on that >>>> cpu. I recall trying to use renice(8) to force some load >>>> balancing, but vaguely remember that it did not help. >>> >>> I've seen exactly this problem with multi-threaded math libraries, as well. >> >> Exactly the same? Let's see. >> >>> Using parallel GotoBLAS on FreeBSD gives terrible performance because the >>> threads keep migrating between CPUs, causing frequent cache misses. [*]-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> So Steve reports that if he has Nthr > Ncpu, then some threads are "over-glued" >> to a particular CPU, which results in sub-optimal scheduling for those threads. >> I have to guess that Steve would want to see the threads being shuffled between >> CPUs to produce more even CPU load. > > I'm using OpenMPI. These are N > Ncpu processes not threads, I used 'thread' in a sense of a kernel thread. It shouldn't actually matter if it's a process or a thread in userland in this context. > and without > the loss of generality let N = Ncpu + 1. It is a classic master-slave > situation where 1 process initializes all others. The n-1 slave processes > are then independent of each other. After 20 minutes or so of number > crunching, each slave sends a few 10s of KB of data to the master. The > master collects all the data, writes it to disk, and then sends the > slaves the next set of computations to do. The computations are nearly > identical, so each slave finishes it task in the same amount of time. The > problem appears to be that 2 slaves are bound to the same cpu and the > remaining N - 3 slaves are bound to a specific cpu. The N - 3 slaves > finish their task, send data to the master, and then spin (chewing up > nearly 100% cpu) waiting for the 2 ping-ponging slaves to finishes. > This causes a stall in the computation. When a complete computation > takes days to complete, theses stall become problematic. So, yes, I > want the processes to get a more uniform access to cpus via migration > to other cpus. This is what 4BSD appears to do. I would imagine that periodic rebalancing would take care of this, but probably the ULE rebalancing algorithm is not perfect. There was a suggestion on performance@ to try to use a lower value for kern.sched.steal_thresh, a value of 1 was recommended: http://article.gmane.org/gmane.os.freebsd.performance/3459 >> On the other hand, you report that your threads keep being shuffled between CPUs >> (I presume for Nthr == Ncpu case, where Nthr is a count of the number-crunching >> threads). And I guess that you want them to stay glued to particular CPUs. >> >> So how is this the same problem? In fact, it sounds like somewhat opposite. >> The only thing in common is that you both don't like how ULE works. > > Well, it may be similar in that N - 2 threads are bound to N - 2 > cpus, and the remaining 2 threads are ping ponging on the last It could be, but Nathan has never said this [*] and I also have never seen this in my very limited experiments with GotoBLAS. > remaining cpu. I suspect that GotoBLAS has a large amount > communication between threads, and once again the computations > stalls waiting of the 2 threads to either finish battling for the > 1 cpu or perhaps the process uses pthread_yield() in some clever > way to try to get load balancing. > -- Andriy Gapon
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4E160C2F.8020001>