From owner-freebsd-hackers@FreeBSD.ORG Thu Jun 3 14:43:22 2004 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 1931F16A4CE for ; Thu, 3 Jun 2004 14:43:22 -0700 (PDT) Received: from relay.transip.nl (relay.transip.nl [80.69.66.68]) by mx1.FreeBSD.org (Postfix) with ESMTP id ACE3D43D49 for ; Thu, 3 Jun 2004 14:43:21 -0700 (PDT) (envelope-from ali@transip.nl) Received: from redguy (peris.demon.nl [212.238.139.202]) by relay.transip.nl (Postfix) with SMTP id 5D45C34A71A for ; Thu, 3 Jun 2004 23:43:19 +0200 (CEST) Message-ID: <00dd01c449b3$ca5a0f90$0400a8c0@redguy> From: "Ali Niknam" To: Date: Thu, 3 Jun 2004 23:43:18 +0200 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2800.1409 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1409 X-Mailman-Approved-At: Fri, 04 Jun 2004 05:16:33 -0700 Subject: FreeBSD 5.2.1: Mutex/Spinlock starvation? X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 03 Jun 2004 21:43:22 -0000 Hi Guys, First of all: this is my first posting in this group so please be gentil :) The other day I was upgrading a system from FreeBSD 4.5 single CPU to FreeBSD 5.2.1 dual CPU and I came across a terrible problem. The system is used as a rather busy webserver, with continuesly about 1200 apache processes, and about 200 mysql pthreads. The problem i ran into is that when apache starts it needs to create a lot of childs quickly. When it does so at a given time, after about a minute or so, a couple of childs go into "Giant" status mode. After a few seconds more and more processes go into Giant mode up until the point that the system will become totally unresponsive (even for keyboard innput). The only remedy is to disconnect the utp and wait a few seconds; then kill everything. Now the nice part is: this happens only if i set apache's maxclients > 1250. Under 1250 the same scenario happens but after a minute or so the system recovers! Now i unfortunately do not know enough about the internals of BSD to do a very estimated guess, but i'll give a shot nevertheless: my estimate is that due to the tremendous amount of 'locked' processes the system simply starves of CPU to do anything. My guess is the Locking mechanism probably uses some kind of 'spin' to wait until the resource is unlocked (whichever resource it is, probably something network related, though). This is based upon the fact that this does not happen if you slightly decrease the number of apache's; what happens in that case is that the same scenario goes on; however after a minute or so the system recovers! (probably because it has just enough CPU to handle everything as apache hits its limit?) Now if this is indeed the case i was thinking of something like a sysctl MUTEX_BLOCK_THRESHOLD set to something like 50. If the system detects that the number of processes locked is higher than this number, then it stops 'spinning' for resources, but instead uses a 'blocking' mechanism (simply puts the processes in a 'waiting' queue). I would be very interested to hear what this problem could be; perhaps i can test a little if someone has solutions (i cant test much unfortunately, it's a production system). Best Regards, Ali Niknam