From owner-freebsd-performance@FreeBSD.ORG Wed Aug 11 21:43:47 2010 Return-Path: Delivered-To: freebsd-performance@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id ACF5C1065674 for ; Wed, 11 Aug 2010 21:43:47 +0000 (UTC) (envelope-from markham_breitbach@ssimicro.com) Received: from mail.ssimicro.com (mail.ssimicro.com [64.247.129.10]) by mx1.freebsd.org (Postfix) with ESMTP id 797D48FC17 for ; Wed, 11 Aug 2010 21:43:47 +0000 (UTC) Received: from beaver.ssimicro.com (beaver.ssimicro.com [199.247.84.12]) (authenticated bits=0) by mail.ssimicro.com (8.14.4/8.14.4) with ESMTP id o7BLdCGY058963 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Wed, 11 Aug 2010 15:39:12 -0600 (MDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.96.1 at mail.ssimicro.com Message-ID: <4C63198F.4040003@ssimicro.com> Date: Wed, 11 Aug 2010 15:43:43 -0600 From: markham breitbach User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.2.8) Gecko/20100802 Thunderbird/3.1.2 MIME-Version: 1.0 To: Julian Elischer References: <4C62D827.2030409@ssimicro.com> <949C0FF2-04AA-4440-82B0-F44A13B8F0C2@mac.com> <4C62F272.4030703@ssimicro.com> <4C630156.6060203@elischer.org> In-Reply-To: <4C630156.6060203@elischer.org> X-Enigmail-Version: 1.1.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-performance@freebsd.org Subject: Re: massive load average spikes X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Aug 2010 21:43:47 -0000 > load average is a time averaged thing and in the case of a > 'thundering herd' problem you will see the LA spike up and > come down again over time. > > Do you see any problem as a result of this? Or is it just curiosity? > > you might want to use KTR or ktrace with scheduling events if you > really want to see the reason for this. It could just be a sampling > error when some 'tick' coincides with the sampling.. > > I have not seen any noticeable performance degradation when the LA spikes like this, and the main nuisance of this was Sendmail's behaviour. I have since set the options "RefuseLA=0" and "QueueLA=0" to avoid long stretches of SMTP being unavailable while the load averaged itself out. At this point it is really just a nagging feeling that something is misbehaving and it's going to bite me when I least expect it (it always does!), so I would like to try and track down the source of the problems, but I'm not even sure where to begin looking. I have run some ktrace on sendmail and dovecot, but did not see anything that stood out, although I don't really know if I would recognize the problem in a kdump anyway (Too much information!) I'm not at all familiar with KTR, however. Is this something that can be run on a production host or should it be isolated to a dev box? I have cloned the jail into a dev environment on identical hardware, but only see the issue under production. I'm not sure if this is a factor of insufficient load or just not enough random strangeness outside of production. Any suggestions for how KTR might help pin this down or what to look for?