From owner-freebsd-hackers@FreeBSD.ORG Tue Jun 8 17:21:08 2004 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7C18A16A4CE; Tue, 8 Jun 2004 17:21:08 +0000 (GMT) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id D292443D48; Tue, 8 Jun 2004 17:21:07 +0000 (GMT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (localhost [127.0.0.1]) by fledge.watson.org (8.12.11/8.12.11) with ESMTP id i58HK86O077539; Tue, 8 Jun 2004 13:20:08 -0400 (EDT) (envelope-from robert@fledge.watson.org) Received: from localhost (robert@localhost)i58HK8WM077536; Tue, 8 Jun 2004 13:20:08 -0400 (EDT) (envelope-from robert@fledge.watson.org) Date: Tue, 8 Jun 2004 13:20:08 -0400 (EDT) From: Robert Watson X-Sender: robert@fledge.watson.org To: Ali Niknam In-Reply-To: <00bd01c44cb5$ccf5f840$0400a8c0@redguy> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: freebsd-hackers@FreeBSD.org cc: John Baldwin Subject: Re: FreeBSD 5.2.1: Mutex/Spinlock starvation? X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Jun 2004 17:21:08 -0000 On Mon, 7 Jun 2004, Ali Niknam wrote: > > There isn't a timeout. Rather, the lock spins so long as the current > > owning thread is executing on another CPU. > > Interesting. Is there a way to 'lock' CPU's so that they always run on > 'another' CPU ? > > Unfortunately as we speak the server is down again :( This all makes me > wonder wether I should simply go back to 4.10. No one would blame you for backing off -CURRENT to -STABLE. On the other hand, having high workloads against -CURRENT is going to be critical to identifying weaknesses in -CURRENT so we can improve them. Unfortunately, it's something of a chicken-and-egg problem... > I decreased the maximum number of apache children to 1400 and the server > seems to be barely holding on: > last pid: 2483; load averages: 75.77, 28.63, 11.40 up 0+00:04:32 > 19:35:07 > 1438 processes:2 running, 294 sleeping, 1142 lock > CPU states: 6.2% user, 0.0% nice, 62.6% system, 7.5% interrupt, 23.8% > idle > Mem: 698M Active, 27M Inact, 209M Wired, 440K Cache, 96M Buf, 1068M Free > Swap: 512M Total, 512M Free > > Are there anymore quite stable things to do ? That is except for upping > to current, which I frankly feel is too dangerous... There are a number of known weaknesses in 5.2.1 that are resolved in -CURRENT, but the update would also involve substantial risk as there's some heavy moving going on in -CURRENT to improve network performance, etc. I haven't followed some of your system description in details, but it seems like the primary thing to do right now, assuming you are still able to keep 5.2.1 running on the box and are able to futz with the configuration some, is to identify the specific source of the problem you're experiencing. Clearly, too much work is going on in the kernel. The question is, what work. It's likely you're running into an expensive edge case, it's possible it's resolved in HEAD, and it could be that a low risk back port would resolve it. It's also possible you're running into an unresolved problem in HEAD. The best case scenario from my perspective would be that you could provide an equivilent workload against a test box where we could experiment with a number of debugging settings, as well as simply trying -CURRENT... It sounds like we've tried some of the easy plugs, such as switching schedulers, enabling adaptive mutexes, etc. Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Senior Research Scientist, McAfee Research