From owner-freebsd-stable@FreeBSD.ORG Tue May 3 09:22:59 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 639CE106567A; Tue, 3 May 2011 09:22:59 +0000 (UTC) (envelope-from dhartmei@insomnia.benzedrine.cx) Received: from insomnia.benzedrine.cx (106-30.3-213.fix.bluewin.ch [213.3.30.106]) by mx1.freebsd.org (Postfix) with ESMTP id 8EE768FC12; Tue, 3 May 2011 09:22:57 +0000 (UTC) Received: from insomnia.benzedrine.cx (localhost.benzedrine.cx [127.0.0.1]) by insomnia.benzedrine.cx (8.14.1/8.13.4) with ESMTP id p439MvwH015703 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO); Tue, 3 May 2011 11:22:57 +0200 (MEST) Received: (from dhartmei@localhost) by insomnia.benzedrine.cx (8.14.1/8.12.10/Submit) id p439Mvm4009848; Tue, 3 May 2011 11:22:57 +0200 (MEST) Date: Tue, 3 May 2011 11:22:57 +0200 From: Daniel Hartmeier To: Jeremy Chadwick Message-ID: <20110503092257.GC9657@insomnia.benzedrine.cx> References: <20110503015854.GA31444@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20110503015854.GA31444@icarus.home.lan> User-Agent: Mutt/1.5.12-2006-07-14 Cc: freebsd-stable@freebsd.org, freebsd-pf@freebsd.org Subject: Re: RELENG_8 pf stack issue (state count spiraling out of control) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 03 May 2011 09:22:59 -0000 On Mon, May 02, 2011 at 06:58:54PM -0700, Jeremy Chadwick wrote: > Here's one piece of core.0.txt which makes no sense to me -- the "rate" > column. I have a very hard time believing that was the interrupt rate > of all the relevant devices at the time (way too high). Maybe this data > becomes wrong only during a coredump? The total column I could believe. > > ------------------------------------------------------------------------ > vmstat -i > > interrupt total rate > irq4: uart0 54768 912 > irq6: fdc0 1 0 > irq17: uhci1+ 172 2 > irq23: uhci3 ehci1+ 2367 39 > cpu0: timer 13183882632 219731377 > irq256: em0 260491055 4341517 > irq257: em1 127555036 2125917 > irq258: ahci0 225923164 3765386 > cpu2: timer 13183881837 219731363 > cpu1: timer 13002196469 216703274 > cpu3: timer 13183881783 219731363 > Total 53167869284 886131154 > ------------------------------------------------------------------------ > > Here's what a normal "vmstat -i" shows from the command-line: > > # vmstat -i > interrupt total rate > irq4: uart0 518 0 > irq6: fdc0 1 0 > irq23: uhci3 ehci1+ 145 0 > cpu0: timer 19041199 1999 > irq256: em0 614280 64 > irq257: em1 168529 17 > irq258: ahci0 355536 37 > cpu2: timer 19040462 1999 > cpu1: timer 19040458 1999 > cpu3: timer 19040454 1999 > Total 77301582 8119 The cpu0-3 timer totals seem consistent in the first output: 13183881783/1999/60/60/24 matches 76 days of uptime. The high rate in the first output comes from vmstat.c dointr()'s division of the total by the uptime: struct timespec sp; clock_gettime(CLOCK_MONOTONIC, &sp); uptime = sp.tv_sec; for (i = 0; i < nintr; i++) { printf("%-*s %20lu %10lu\n", istrnamlen, intrname, *intrcnt, *intrcnt / uptime); } >From this we can deduce that the value of uptime must have been 13183881783/219731363 = 60 (seconds). Since the uptime was 76 days (and not just 60 seconds), the CLOCK_MONOTONIC clock must have reset, wrapped, or been overwritten. I don't know how that's possible, but if this means that the kernel variable time_second was possibly going back, that could very well have messed up pf's state purging. Daniel