From owner-freebsd-performance@FreeBSD.ORG Wed Aug 11 17:39:17 2010 Return-Path: Delivered-To: freebsd-performance@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CFB9D1065674 for ; Wed, 11 Aug 2010 17:39:17 +0000 (UTC) (envelope-from markham_breitbach@ssimicro.com) Received: from mail.ssimicro.com (mail.ssimicro.com [64.247.129.10]) by mx1.freebsd.org (Postfix) with ESMTP id 9FAF38FC17 for ; Wed, 11 Aug 2010 17:39:17 +0000 (UTC) Received: from beaver.ssimicro.com (beaver.ssimicro.com [199.247.84.12]) (authenticated bits=0) by mail.ssimicro.com (8.14.4/8.14.4) with ESMTP id o7BH09vK089100 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Wed, 11 Aug 2010 11:00:09 -0600 (MDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.96.1 at mail.ssimicro.com Message-ID: <4C62D827.2030409@ssimicro.com> Date: Wed, 11 Aug 2010 11:04:39 -0600 From: markham breitbach User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.2.8) Gecko/20100802 Thunderbird/3.1.2 MIME-Version: 1.0 To: freebsd-performance@freebsd.org X-Enigmail-Version: 1.1.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Subject: massive load average spikes X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Aug 2010 17:39:17 -0000 Good Day, I am running into an issue where I am seeing load average on a server suddenly jump from nominal values around 0.5 to anywhere from 10 up over 70 in under 1 second. This does not seem to be related to CPU overload, and LA immediately begins to fall back again to nominal. This does not seem to happen with any regular frequency, and can happen several times an hour or not for hours. I am running 6.4-RELEASE-p8 with the SMP kernel and interrupt polling enabled. (I have also tried without either) The server is running a mail server in a jail (sendmail, dovecot, etc) with the jail being a full "build world" and servicing about 2000 users for SMTP, POP3 and IMAP. The hardware is a Silicon Mechanics (Super Micro) dual 4core Xeon (E5405) with 4GB RAM and multiple 7200 RPM sata. I have tried watching top, vmstat, iostat and systat to see if I can correlate something to these spikes in load average, but nothing really stands out. Can anyone suggest what may be causing this or how to track that down? Many Thanks, Markham Breitbach SSi Network Operations From owner-freebsd-performance@FreeBSD.ORG Wed Aug 11 18:00:00 2010 Return-Path: Delivered-To: freebsd-performance@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 22DFB1065673 for ; Wed, 11 Aug 2010 18:00:00 +0000 (UTC) (envelope-from cswiger@mac.com) Received: from asmtpout028.mac.com (asmtpout028.mac.com [17.148.16.103]) by mx1.freebsd.org (Postfix) with ESMTP id 0A92D8FC0C for ; Wed, 11 Aug 2010 17:59:59 +0000 (UTC) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; charset=us-ascii Received: from cswiger1.apple.com ([17.209.4.71]) by asmtp028.mac.com (Sun Java(tm) System Messaging Server 6.3-8.01 (built Dec 16 2008; 32bit)) with ESMTPSA id <0L70006FM1Z2CT70@asmtp028.mac.com> for freebsd-performance@freebsd.org; Wed, 11 Aug 2010 10:59:27 -0700 (PDT) X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx engine=6.0.2-1004200000 definitions=main-1008110142 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.0.10011,1.0.148,0.0.0000 definitions=2010-08-11_07:2010-08-11, 2010-08-11, 1970-01-01 signatures=0 From: Chuck Swiger In-reply-to: <4C62D827.2030409@ssimicro.com> Date: Wed, 11 Aug 2010 10:59:26 -0700 Message-id: <949C0FF2-04AA-4440-82B0-F44A13B8F0C2@mac.com> References: <4C62D827.2030409@ssimicro.com> To: markham breitbach X-Mailer: Apple Mail (2.1081) Cc: freebsd-performance@freebsd.org Subject: Re: massive load average spikes X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Aug 2010 18:00:00 -0000 Hi-- On Aug 11, 2010, at 10:04 AM, markham breitbach wrote: > I am running into an issue where I am seeing load average on a server suddenly jump from > nominal values around 0.5 to anywhere from 10 up over 70 in under 1 second. This does not > seem to be related to CPU overload, and LA immediately begins to fall back again to > nominal. This does not seem to happen with any regular frequency, and can happen several > times an hour or not for hours. [ ... ] > Can anyone suggest what may be causing this or how to track that down? >From the (limited) available data, I'd imagine someone is doing wardialling of your mail service to try common username/password combinations and break in. Especially if they are connecting via POP3S / IMAPS ports and doing SSL negotiation, there's a very high burst of CPU load, as imap or pop daemons get forked to handle the requests, then quit immediately afterwards when the login attempt fails. You won't see much change in memory loading unless they do get a valid login since the Dovecot daemons are already resident & there's no real I/O made to disk until it looks up a real user's mail. Looking at tcpdump for new connection requests or checking the Dovecot mail logs for a slew of attempted logins for invalid users, and correlating with your load spikes would be a way of checking on this theory.... Regards, -- -Chuck From owner-freebsd-performance@FreeBSD.ORG Wed Aug 11 18:56:52 2010 Return-Path: Delivered-To: freebsd-performance@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2C4E71065674 for ; Wed, 11 Aug 2010 18:56:52 +0000 (UTC) (envelope-from markham_breitbach@ssimicro.com) Received: from mail.ssimicro.com (mail.ssimicro.com [64.247.129.10]) by mx1.freebsd.org (Postfix) with ESMTP id 090218FC13 for ; Wed, 11 Aug 2010 18:56:51 +0000 (UTC) Received: from beaver.ssimicro.com (beaver.ssimicro.com [199.247.84.12]) (authenticated bits=0) by mail.ssimicro.com (8.14.4/8.14.4) with ESMTP id o7BIqJ9S095973 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT) for ; Wed, 11 Aug 2010 12:52:20 -0600 (MDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.96.1 at mail.ssimicro.com Message-ID: <4C62F272.4030703@ssimicro.com> Date: Wed, 11 Aug 2010 12:56:50 -0600 From: markham breitbach User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.2.8) Gecko/20100802 Thunderbird/3.1.2 MIME-Version: 1.0 To: freebsd-performance@freebsd.org References: <4C62D827.2030409@ssimicro.com> <949C0FF2-04AA-4440-82B0-F44A13B8F0C2@mac.com> In-Reply-To: <949C0FF2-04AA-4440-82B0-F44A13B8F0C2@mac.com> X-Enigmail-Version: 1.1.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Subject: Re: massive load average spikes X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Aug 2010 18:56:52 -0000 On 11/08/10 11:59 AM, Chuck Swiger wrote: > Hi-- > > On Aug 11, 2010, at 10:04 AM, markham breitbach wrote: >> I am running into an issue where I am seeing load average on a server suddenly jump from >> nominal values around 0.5 to anywhere from 10 up over 70 in under 1 second. This does not >> seem to be related to CPU overload, and LA immediately begins to fall back again to >> nominal. This does not seem to happen with any regular frequency, and can happen several >> times an hour or not for hours. > [ ... ] >> Can anyone suggest what may be causing this or how to track that down? > >From the (limited) available data, I'd imagine someone is doing wardialling of your mail service to try common username/password combinations and break in. Especially if they are connecting via POP3S / IMAPS ports and doing SSL negotiation, there's a very high burst of CPU load, as imap or pop daemons get forked to handle the requests, then quit immediately afterwards when the login attempt fails. You won't see much change in memory loading unless they do get a valid login since the Dovecot daemons are already resident & there's no real I/O made to disk until it looks up a real user's mail. > > Looking at tcpdump for new connection requests or checking the Dovecot mail logs for a slew of attempted logins for invalid users, and correlating with your load spikes would be a way of checking on this theory.... > > Regards, Sorry for the limited data, It's hard to know where to draw the line between useful data and information overload, but I'm more than happy to supply whatever other info you might find useful. I did take a look at my dovecot logs, and there are not more than a couple of failed auth attempts in any given minute. Sendmail logs don't show any excessive activity when LA spikes either. "vmstat -w1" shows occasional spikes of processes in the run queue, but that doesn't usually correlate to spikes in load average (although sometimes it is close). here is a sample of vmstat and an approximately correlating output of load average for ~20s. Notice the load average spikes >40, but there are virtually no processes in the run queue. (/var/mail is isolated on ad10) procs memory page disks faults cpu r b w avm fre flt re pi po fr sr ad4 ad6 ad8 ad10 in sy cs us sy id 0 1 2 1852712 141476 2535 1 1 0 1834 0 0 0 0 7 31232 5202 4403 1 2 97 0 1 1 1852176 141184 2022 0 0 0 1673 0 14 14 0 16 31278 4706 3826 0 1 98 0 1 2 1850264 142468 2213 0 0 0 2234 0 29 29 0 10 31394 5251 4948 0 2 98 0 1 0 1851364 142584 1717 0 0 0 1407 0 0 0 0 0 31251 3869 4753 0 3 97 0 1 2 1852200 141712 2054 0 0 0 1500 0 4 4 0 0 31197 3893 3393 0 1 99 1 1 1 1857440 138980 2306 0 0 0 1384 0 7 6 0 0 31420 6814 5436 0 2 97 0 1 2 1857984 138380 2631 0 0 0 1992 0 8 9 0 10 31469 6318 4227 0 3 97 0 1 0 1856708 138576 2372 0 0 0 2032 0 1 1 0 0 31496 6473 4839 0 2 98 0 1 3 1857044 138176 3602 0 0 0 2899 0 1 1 0 0 31573 9621 6006 1 3 96 0 1 0 1856836 138208 1120 0 0 0 1106 0 1 1 0 270 32221 6226 5031 0 1 99 2 1 1 1855824 138500 2522 0 0 0 2196 0 15 15 0 11 31619 9254 5394 0 2 97 0 1 0 1854304 138936 2380 0 0 0 2671 0 22 22 0 20 31484 8465 5864 2 3 96 3 1 1 1857960 136608 3026 0 0 0 1941 0 0 0 0 0 31331 9048 5327 0 3 97 0 1 1 1865832 133420 7232 0 0 0 5103 0 14 14 0 12 31721 19322 11197 1 9 90 3 1 0 1872044 129148 3629 0 0 0 1982 0 4 5 0 0 31904 11714 5716 0 3 97 0 1 0 1868948 131136 4417 0 0 0 4303 0 39 39 0 38 31937 12498 7073 1 4 95 0 1 2 1868220 131748 2117 0 0 0 1905 0 2 2 0 0 31203 4858 3604 1 2 98 0 1 1 1867152 132172 1518 0 0 0 1367 0 0 0 0 3 31202 3190 3923 0 2 98 0 1 2 1867016 132296 1556 0 0 0 1325 0 0 0 0 0 31133 2802 3568 0 2 98 0 1 1 1864572 132672 2020 0 0 0 1715 0 0 0 0 0 31286 4487 5098 0 3 97 0 1 0 1869548 130208 2117 0 0 0 1235 0 1 1 0 1 31283 4378 3211 0 1 99 0 1 0 1868416 130040 1767 0 0 0 1485 0 0 0 0 2 31379 4929 4294 0 1 98 Wed Aug 11 12:40:55 MDT 2010 0.60 3.38 3.79 0.60 3.38 3.79 0.55 3.33 3.77 0.55 3.33 3.77 0.55 3.33 3.77 0.55 3.33 3.77 40.94 11.66 6.70 40.94 11.66 6.70 40.94 11.66 6.70 40.94 11.66 6.70 Wed Aug 11 12:41:05 MDT 2010 40.94 11.66 6.70 40.94 11.66 6.70 37.67 11.46 6.66 37.67 11.46 6.66 37.67 11.46 6.66 37.67 11.46 6.66 37.67 11.46 6.66 34.65 11.27 6.63 34.65 11.27 6.63 34.65 11.27 6.63 From owner-freebsd-performance@FreeBSD.ORG Wed Aug 11 20:15:36 2010 Return-Path: Delivered-To: freebsd-performance@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A3FD51065672 for ; Wed, 11 Aug 2010 20:15:36 +0000 (UTC) (envelope-from julian@elischer.org) Received: from out-0.mx.aerioconnect.net (out-0-15.mx.aerioconnect.net [216.240.47.75]) by mx1.freebsd.org (Postfix) with ESMTP id 5E1438FC0C for ; Wed, 11 Aug 2010 20:15:36 +0000 (UTC) Received: from idiom.com (postfix@mx0.idiom.com [216.240.32.160]) by out-0.mx.aerioconnect.net (8.13.8/8.13.8) with ESMTP id o7BK0B03026245; Wed, 11 Aug 2010 13:00:11 -0700 X-Client-Authorized: MaGic Cook1e X-Client-Authorized: MaGic Cook1e Received: from julian-mac.elischer.org (h-67-100-89-137.snfccasy.static.covad.net [67.100.89.137]) by idiom.com (Postfix) with ESMTP id 902432D6012; Wed, 11 Aug 2010 13:00:10 -0700 (PDT) Message-ID: <4C630156.6060203@elischer.org> Date: Wed, 11 Aug 2010 13:00:22 -0700 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US; rv:1.9.1.11) Gecko/20100711 Thunderbird/3.0.6 MIME-Version: 1.0 To: markham breitbach References: <4C62D827.2030409@ssimicro.com> <949C0FF2-04AA-4440-82B0-F44A13B8F0C2@mac.com> <4C62F272.4030703@ssimicro.com> In-Reply-To: <4C62F272.4030703@ssimicro.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.67 on 216.240.47.51 Cc: freebsd-performance@freebsd.org Subject: Re: massive load average spikes X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Aug 2010 20:15:36 -0000 On 8/11/10 11:56 AM, markham breitbach wrote: > > > On 11/08/10 11:59 AM, Chuck Swiger wrote: >> Hi-- [...] > > Sorry for the limited data, It's hard to know where to draw the line between useful data > and information overload, but I'm more than happy to supply whatever other info you might > find useful. > > I did take a look at my dovecot logs, and there are not more than a couple of failed auth > attempts in any given minute. Sendmail logs don't show any excessive activity when LA > spikes either. > > "vmstat -w1" shows occasional spikes of processes in the run queue, but that doesn't > usually correlate to spikes in load average (although sometimes it is close). [...] > load average is a time averaged thing and in the case of a 'thundering herd' problem you will see the LA spike up and come down again over time. Do you see any problem as a result of this? Or is it just curiosity? you might want to use KTR or ktrace with scheduling events if you really want to see the reason for this. It could just be a sampling error when some 'tick' coincides with the sampling.. > > Wed Aug 11 12:40:55 MDT 2010 > 0.60 3.38 3.79 > 0.60 3.38 3.79 > 0.55 3.33 3.77 > 0.55 3.33 3.77 > 0.55 3.33 3.77 > 0.55 3.33 3.77 > 40.94 11.66 6.70 > 40.94 11.66 6.70 > 40.94 11.66 6.70 > 40.94 11.66 6.70 > Wed Aug 11 12:41:05 MDT 2010 > 40.94 11.66 6.70 > 40.94 11.66 6.70 > 37.67 11.46 6.66 > 37.67 11.46 6.66 > 37.67 11.46 6.66 > 37.67 11.46 6.66 > 37.67 11.46 6.66 > 34.65 11.27 6.63 > 34.65 11.27 6.63 > 34.65 11.27 6.63 > > > > From owner-freebsd-performance@FreeBSD.ORG Wed Aug 11 21:43:47 2010 Return-Path: Delivered-To: freebsd-performance@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id ACF5C1065674 for ; Wed, 11 Aug 2010 21:43:47 +0000 (UTC) (envelope-from markham_breitbach@ssimicro.com) Received: from mail.ssimicro.com (mail.ssimicro.com [64.247.129.10]) by mx1.freebsd.org (Postfix) with ESMTP id 797D48FC17 for ; Wed, 11 Aug 2010 21:43:47 +0000 (UTC) Received: from beaver.ssimicro.com (beaver.ssimicro.com [199.247.84.12]) (authenticated bits=0) by mail.ssimicro.com (8.14.4/8.14.4) with ESMTP id o7BLdCGY058963 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Wed, 11 Aug 2010 15:39:12 -0600 (MDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.96.1 at mail.ssimicro.com Message-ID: <4C63198F.4040003@ssimicro.com> Date: Wed, 11 Aug 2010 15:43:43 -0600 From: markham breitbach User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.2.8) Gecko/20100802 Thunderbird/3.1.2 MIME-Version: 1.0 To: Julian Elischer References: <4C62D827.2030409@ssimicro.com> <949C0FF2-04AA-4440-82B0-F44A13B8F0C2@mac.com> <4C62F272.4030703@ssimicro.com> <4C630156.6060203@elischer.org> In-Reply-To: <4C630156.6060203@elischer.org> X-Enigmail-Version: 1.1.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-performance@freebsd.org Subject: Re: massive load average spikes X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Aug 2010 21:43:47 -0000 > load average is a time averaged thing and in the case of a > 'thundering herd' problem you will see the LA spike up and > come down again over time. > > Do you see any problem as a result of this? Or is it just curiosity? > > you might want to use KTR or ktrace with scheduling events if you > really want to see the reason for this. It could just be a sampling > error when some 'tick' coincides with the sampling.. > > I have not seen any noticeable performance degradation when the LA spikes like this, and the main nuisance of this was Sendmail's behaviour. I have since set the options "RefuseLA=0" and "QueueLA=0" to avoid long stretches of SMTP being unavailable while the load averaged itself out. At this point it is really just a nagging feeling that something is misbehaving and it's going to bite me when I least expect it (it always does!), so I would like to try and track down the source of the problems, but I'm not even sure where to begin looking. I have run some ktrace on sendmail and dovecot, but did not see anything that stood out, although I don't really know if I would recognize the problem in a kdump anyway (Too much information!) I'm not at all familiar with KTR, however. Is this something that can be run on a production host or should it be isolated to a dev box? I have cloned the jail into a dev environment on identical hardware, but only see the issue under production. I'm not sure if this is a factor of insufficient load or just not enough random strangeness outside of production. Any suggestions for how KTR might help pin this down or what to look for?