Date: Sat, 10 May 2008 12:53:15 +1000 From: Aristedes Maniatis <ari@ish.com.au> To: John Baldwin <jhb@FreeBSD.org> Cc: bzeeb+freebsd+lor@zabbadoz.net, jeff@freebsd.org, Jurgen Weber <jurgen@ish.com.au>, freebsd-stable@freebsd.org, davidxu@freebsd.org Subject: Re: LOR on sleepqueue chain locks, Was: LOR sleepq/scrlock Message-ID: <0DB3A235-DF87-4413-90ED-E38BC44CA2B3@ish.com.au> In-Reply-To: <200804221334.35001.jhb@freebsd.org> References: <77E81AD6-FBCC-4D30-A5CB-A9B918D4793F@ish.com.au> <200804181314.24974.jhb@freebsd.org> <D3B47B32-BB24-4DE9-A609-D2BB66AD5A95@ish.com.au> <200804221334.35001.jhb@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 23/04/2008, at 3:34 AM, John Baldwin wrote: >>> The >>> real problem at the bottom of the screen though is a real issue. >>> It's a LOR >>> of two different sleepqueue chain locks. The problem is that when >>> setrunnable() encounters a swapped out thread it tries to wakeup >>> proc0, but >>> if proc0 is asleep (which is typical) then its thread lock is a >>> sleep queue >>> chain lock, so waking up a swapped out thread from wakeup() will >>> usually >>> trigger this LOR. >>> >>> I think the best fix is to not have setrunnable() kick proc0 >>> directly. >>> Perhaps setrunnable() should return an int and return true if proc0 >>> needs to >>> be awakened and false otherwise. Then the the sleepq code (b/c only >>> sleeping >>> threads can be swapped out anyway) can return that value from >>> sleepq_resume_thread() and can call kick_proc0() directly once it >>> has dropped >>> all of its own locks. >>> >>> -- >>> John Baldwin >> >> The way you describe it, it almost sounds like this LOR should be >> happening for everyone, all the time. To try and eliminate the >> factors >> which trigger it for us, we tried the following: removed PAE from >> kernel, disabled PF. Neither of these things made any difference and >> the error is fairly quickly reproducible (within a couple of hours >> running various things to load the machine). The one thing we did not >> test yet is removing ZFS from the picture. Note also that this box >> ran >> for years and years on FreeBSD 4.x without a hiccup (non PAE, ipfw >> instead of pf and no ZFS of course). > > There are two things. 1) Most people who run witness (that I know > of) don't > run it on spinlocks because of the overhead, so LORs of spin locks > are less > well-reported than LORs of other locks (mutexes, rwlocks, etc.). 2) > You have > to have enough load on the box to swap out active processes to get > into this > situation. Between those I think that is why this is not more widely > reported. Hi John, Thanks for your efforts so far to track this LOR down. I've been keeping an eye on cvs logs, but haven't seen anything which looks like a patch for this. * is this still outstanding? * or will it be addressed soon? * if not, should I create a PR so that it doesn't get forgotten? * in our case, although we can trigger it quickly with some load, the problem occurs (and causes a complete machine lock) even under < 10% load. Not sure if the combination of PAE/ZFS/SCHED ULE exacerbates that in any way compared to a 'standard' build. Thank you Ari Maniatis --------------------------> ish http://www.ish.com.au Level 1, 30 Wilson Street Newtown 2042 Australia phone +61 2 9550 5001 fax +61 2 9550 4001 GPG fingerprint CBFB 84B4 738D 4E87 5E5C 5EFA EF6A 7D2E 3E49 102A
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?0DB3A235-DF87-4413-90ED-E38BC44CA2B3>