From owner-freebsd-stable@FreeBSD.ORG Tue Sep 28 17:43:27 2010 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 29CDF106564A for ; Tue, 28 Sep 2010 17:43:27 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta09.emeryville.ca.mail.comcast.net (qmta09.emeryville.ca.mail.comcast.net [76.96.30.96]) by mx1.freebsd.org (Postfix) with ESMTP id 0E3778FC08 for ; Tue, 28 Sep 2010 17:43:26 +0000 (UTC) Received: from omta15.emeryville.ca.mail.comcast.net ([76.96.30.71]) by qmta09.emeryville.ca.mail.comcast.net with comcast id CDaN1f00B1Y3wxoA9HjSiM; Tue, 28 Sep 2010 17:43:26 +0000 Received: from koitsu.dyndns.org ([98.248.41.155]) by omta15.emeryville.ca.mail.comcast.net with comcast id CHjR1f00C3LrwQ28bHjRSN; Tue, 28 Sep 2010 17:43:26 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 7231F9B418; Tue, 28 Sep 2010 10:43:25 -0700 (PDT) Date: Tue, 28 Sep 2010 10:43:25 -0700 From: Jeremy Chadwick To: Don Lewis Message-ID: <20100928174325.GA69044@icarus.home.lan> References: <20100928143955.GA33940@libertas.local.camdensoftware.com> <201009281715.o8SHFY2W059599@gw.catspoiler.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201009281715.o8SHFY2W059599@gw.catspoiler.org> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: stable@FreeBSD.org, sterling@camdensoftware.com Subject: Re: CPU time accounting broken on 8-STABLE machine after a few hours of uptime X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Sep 2010 17:43:27 -0000 On Tue, Sep 28, 2010 at 10:15:34AM -0700, Don Lewis wrote: > My time source is another FreeBSD box with a GPS receiver on my LAN. My > other client machine isn't seeing these time jumps. The only messages > from ntp in its log from this period are these: > > Sep 23 04:12:23 mousie ntpd[1111]: kernel time sync status change 6001 > Sep 23 04:29:29 mousie ntpd[1111]: kernel time sync status change 2001 > Sep 24 03:55:24 mousie ntpd[1111]: kernel time sync status change 6001 > Sep 24 04:12:28 mousie ntpd[1111]: kernel time sync status change 2001 I'm speaking purely about ntpd below this point -- almost certainly a separate problem/issue, but I'll explain it anyway. I'm not under the impression that the calcru messages indicate RTC clock drift, but I'd need someone like John Baldwin to validate my statement. Back to ntpd: you can addressing the above messages by adding "maxpoll 9" to your "server" lines in ntp.conf. The comment we use in our ntp.conf that documents the well-known problem: # maxpoll 9 is used to work around PLL/FLL flipping, which happens at # exactly 1024 seconds (the default maxpoll value). Another FreeBSD # user recommended using 9 instead: # http://lists.freebsd.org/pipermail/freebsd-stable/2006-December/031512.html > > I don't know if that has any connection to time(1) running slower -- but > > perhaps ntpd is aggressively adjusting your clock? > > It seems to be pretty stable when the machine is idle: > > % ntpq -c pe > remote refid st t when poll reach delay offset jitter > ============================================================================== > *gw.catspoiler.o .GPS. 1 u 8 64 377 0.168 -0.081 0.007 > > Not too much degradation under CPU load: > > % ntpq -c pe > remote refid st t when poll reach delay offset jitter > ============================================================================== > *gw.catspoiler.o .GPS. 1 u 40 64 377 0.166 -0.156 0.026 > > I/O (dd if=/dev/ad6 of=/dev/null bs=512) doesn't appear to bother it > much, either. > > % ntpq -c pe > remote refid st t when poll reach delay offset jitter > ============================================================================== > *gw.catspoiler.o .GPS. 1 u 35 64 377 0.169 -0.106 0.009 Still speaking purely about ntpd: The above doesn't indicate a single problem. The deltas shown in both delay, offset, and jitter are all 100% legitimate. A dd (to induce more interrupt use) isn't going to exacerbate the problem (depending on your system configuration, IRQ setup, local APIC, etc.). How about writing a small shell script that runs every minute in a cronjob that does vmstat -i >> /some/file.log? Then when you see calcru messages, look around the time frame where vmstat -i was run. Look for high interrupt rates, aside from those associated with cpuX devices. Next, you need to let ntpd run for quite a bit longer than what you did above. Your poll maximum is only 64, indicating ntpd had recently been restarted, or that your offset deviates greatly (my guess is ntpd being restarted). poll will increase over time (64, 128, 256, 512, and usually max out at 1024), depending on how "stable" the clock is. when is a counter that increments, and does clock syncing (if needed) once it reaches poll. You'd see unstable system clock indications in your syslog as well (indicated by actual +/- clock drift lines occurring regularly. These aren't the same as 2001/6001 PLL/FLL mode flipping). Sorry if this is a bit paragraph/much to take in. You might also try stopping ntpd, removing /var/db/ntpd.drift, and restarting ntpd -- then check back in about 48 hours (no I'm not kidding). This is especially necessary if you've replaced the motherboard or taken the disks from System A and stuck them in System B. All that said: I'm not convinced ntpd has anything to do with your problem. EIST or EIST-like capabilities (such as Cool'n'Quiet) are often the source of the problem. "device cpufreq" might solve your issue entirely, hard to say. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |