From owner-freebsd-stable@FreeBSD.ORG Thu Mar 25 19:31:00 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8DC281065670 for ; Thu, 25 Mar 2010 19:31:00 +0000 (UTC) (envelope-from bra@fsn.hu) Received: from people.fsn.hu (people.fsn.hu [195.228.252.137]) by mx1.freebsd.org (Postfix) with ESMTP id 3DEF88FC1D for ; Thu, 25 Mar 2010 19:30:59 +0000 (UTC) Received: by people.fsn.hu (Postfix, from userid 1001) id E59DB23B1E5; Thu, 25 Mar 2010 20:30:57 +0100 (CET) X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MF-ACE0E1EA [pR: 23.8687] X-CRM114-CacheID: sfid-20100325_20305_63671EE7 X-CRM114-Status: Good ( pR: 23.8687 ) Message-ID: <4BABB9F0.6010506@fsn.hu> Date: Thu, 25 Mar 2010 20:30:56 +0100 From: Attila Nagy User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.23) Gecko/20090817 Thunderbird/2.0.0.23 Mnenhy/0.7.6.0 MIME-Version: 1.0 To: pyunyh@gmail.com References: <4BAB718C.3090001@fsn.hu> <20100325183628.GD1278@michelle.cdnetworks.com> In-Reply-To: <20100325183628.GD1278@michelle.cdnetworks.com> X-Stationery: 0.4.10 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.3 (people.fsn.hu); Thu, 25 Mar 2010 20:30:56 +0100 (CET) Cc: Mailing List FreeBSD Stable Subject: Re: 8-STABLE freezes on UDP traffic (DNS), 7.x doesn't X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Mar 2010 19:31:00 -0000 Pyun YongHyeon wrote: > On Thu, Mar 25, 2010 at 03:22:04PM +0100, Attila Nagy wrote: > >> Hi, >> >> I have some recursive nameservers, running unbound and 7.2-STABLE #0: >> Wed Sep 2 13:37:17 CEST 2009 on a bunch of HP BL460c machines (bce >> interfaces). >> These work OK. >> >> During the process of migrating to 8.x, I've upgraded one of these >> machines to 8.0-STABLE #25: Tue Mar 9 18:15:34 CET 2010 (the dates >> indicate an approximate time, when the source was checked out from >> cvsup.hu.freebsd.org, I don't know the exact revision). >> >> The first problem was that the machine occasionally lost network access >> for some minutes. I could log in on the console, and I could see the >> processes, involved in network IO in "keglim" state, but couldn't do any >> network IO. This lasted for some minutes, then everything came back to >> normal. >> I could fix this issue by raising kern.ipc.nmbclusters to 51200 >> (doubling from its default size), when I can't see these blackouts. >> >> But now the machine freezes. It can run for about a day, and then it >> just freezes. I can't even break in to the debugger with sending NMI to it. >> top says: >> last pid: 92428; load averages: 0.49, 0.40, 0.38 up 0+21:13:18 >> 07:41:43 >> 43 processes: 2 running, 38 sleeping, 1 zombie, 2 lock >> CPU: 1.3% user, 0.0% nice, 1.3% system, 26.0% interrupt, 71.3% idle >> Mem: 1682M Active, 99M Inact, 227M Wired, 5444K Cache, 44M Buf, 5899M Free >> Swap: >> >> PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND >> 45011 bind 4 49 0 1734M 1722M RUN 2 37:42 22.17% unbound >> 712 bind 3 44 0 70892K 19904K uwait 0 71:07 3.86% >> python2.6 >> >> The common in these freezes seems to be the high interrupt count. >> Normally, during load the CPU times look like this: >> CPU: 3.5% user, 0.0% nice, 1.8% system, 0.4% interrupt, 94.4% idle >> >> I could observe a "freeze", where top remained running and everything >> was 0%, except interrupt, which was 25% exactly (the machine has four >> cores), and another, where I could save the following console output: >> CPU: 0.0% user, 0.0% nice, 0.2% system, 50.0% interrupt, 49.8% idle >> > > When you see high number of interrupts, could you check this comes > from bce(4)? I guess you can use systat(1) to check how many number > interrupts are generated from bce(4). > I've tried it multiple times, but couldn't yet catch the moment when the machine was still alive (so the script could run) and there were increased amount of interrupts.