From owner-freebsd-stable@FreeBSD.ORG Tue Sep 28 17:15:45 2010 Return-Path: Delivered-To: stable@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A7B6E1065673 for ; Tue, 28 Sep 2010 17:15:45 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (adsl-75-1-14-242.dsl.scrm01.sbcglobal.net [75.1.14.242]) by mx1.freebsd.org (Postfix) with ESMTP id 6B9A88FC16 for ; Tue, 28 Sep 2010 17:15:45 +0000 (UTC) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id o8SHFY2W059599; Tue, 28 Sep 2010 10:15:38 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Message-Id: <201009281715.o8SHFY2W059599@gw.catspoiler.org> Date: Tue, 28 Sep 2010 10:15:34 -0700 (PDT) From: Don Lewis To: sterling@camdensoftware.com In-Reply-To: <20100928143955.GA33940@libertas.local.camdensoftware.com> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii Cc: stable@FreeBSD.org Subject: Re: CPU time accounting broken on 8-STABLE machine after a few hours of uptime X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Sep 2010 17:15:45 -0000 On 28 Sep, Chip Camden wrote: > Quoth Don Lewis on Monday, 27 September 2010: >> CPU time accounting is broken on one of my machines running 8-STABLE. I >> ran a test with a simple program that just loops and consumes CPU time: >> >> % time ./a.out >> 94.544u 0.000s 19:14.10 8.1% 62+2054k 0+0io 0pf+0w >> >> The display in top shows the process with WCPU at 100%, but TIME >> increments very slowly. >> >> Several hours after booting, I got a bunch of "calcru: runtime went >> backwards" messages, but they stopped right away and never appeared >> again. >> >> Aug 23 13:40:07 scratch ntpd[1159]: ntpd 4.2.4p5-a (1) >> Aug 23 13:43:18 scratch ntpd[1160]: kernel time sync status change 2001 >> Aug 23 18:05:57 scratch dbus-daemon: [system] Reloaded configuration >> Aug 23 18:06:16 scratch dbus-daemon: [system] Reloaded configuration >> Aug 23 18:12:40 scratch ntpd[1160]: time reset +18.059948 s >> [snip] >> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 6836685136 usec to 5425839798 usec for pid 1526 (csh) >> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 4747 usec to 2403 usec for pid 1519 (csh) >> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 5265 usec to 2594 usec for pid 1494 (hald-addon-mouse-sy) >> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 7818 usec to 3734 usec for pid 1488 (console-kit-daemon) >> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 977 usec to 459 usec for pid 1480 (getty) >> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 958 usec to 450 usec for pid 1479 (getty) >> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 957 usec to 449 usec for pid 1478 (getty) >> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 952 usec to 447 usec for pid 1477 (getty) >> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 959 usec to 450 usec for pid 1476 (getty) >> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 975 usec to 458 usec for pid 1475 (getty) >> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 1026 usec to 482 usec for pid 1474 (getty) >> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 1333 usec to 626 usec for pid 1473 (getty) >> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 2469 usec to 1160 usec for pid 1440 (inetd) >> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 719 usec to 690 usec for pid 1402 (sshd) >> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 120486 usec to 56770 usec for pid 1360 (cupsd) >> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 6204 usec to 2914 usec for pid 1289 (dbus-daemon) >> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 179 usec to 84 usec for pid 1265 (moused) >> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 22156 usec to 10407 usec for pid 1041 (nfsd) >> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 1292 usec to 607 usec for pid 1032 (mountd) >> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 8801 usec to 4134 usec for pid 664 (devd) >> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 19 usec to 9 usec for pid 9 (sctp_iterator) >> >> >> If I reboot and run the test again, the CPU time accounting seems to be >> working correctly. >> % time ./a.out >> 1144.226u 0.000s 19:06.62 99.7% 5+168k 0+0io 0pf+0w >> > > > I notice that before the calcru messages, ntpd reset the clock by > 18 seconds -- that probably accounts for that. Interesting observation. Since this happened so early in the log, I thought that this time change was the initial time change after boot, but taking a closer look, the time change occurred about 4 1/2 hours after boot. The calcru messages occured another 5 1/2 hours after that. I also just noticed that this log info was from the August 23rd kernel, before I noticed the CPU time accounting problem, and not the latest occurance. Here's the latest log info: Sep 23 16:33:50 scratch ntpd[1144]: ntpd 4.2.4p5-a (1) Sep 23 16:37:03 scratch ntpd[1145]: kernel time sync status change 2001 Sep 23 17:43:47 scratch ntpd[1145]: time reset +276.133928 s Sep 23 17:43:47 scratch ntpd[1145]: kernel time sync status change 6001 Sep 23 17:47:15 scratch ntpd[1145]: kernel time sync status change 2001 Sep 23 19:02:48 scratch ntpd[1145]: time reset +291.507262 s Sep 23 19:02:48 scratch ntpd[1145]: kernel time sync status change 6001 Sep 23 19:06:37 scratch ntpd[1145]: kernel time sync status change 2001 Sep 24 00:03:36 scratch kernel: calcru: runtime went backwards from 1120690857 u sec to 367348485 usec for pid 1518 (csh) Sep 24 00:03:36 scratch kernel: calcru: runtime went backwards from 5403 usec to 466 usec for pid 1477 (hald-addon-mouse-sy) Sep 24 00:03:36 scratch kernel: calcru: runtime went backwards from 7511 usec to 1502 usec for pid 1472 (hald-runner) Sep 24 00:03:36 scratch kernel: calcru: runtime went backwards from 17323 usec t o 12470 usec for pid 1472 (hald-runner) [snip] The time jumps are even larger. There is still the large interval between the last ntp message and the first calcru message. My time source is another FreeBSD box with a GPS receiver on my LAN. My other client machine isn't seeing these time jumps. The only messages from ntp in its log from this period are these: Sep 23 04:12:23 mousie ntpd[1111]: kernel time sync status change 6001 Sep 23 04:29:29 mousie ntpd[1111]: kernel time sync status change 2001 Sep 24 03:55:24 mousie ntpd[1111]: kernel time sync status change 6001 Sep 24 04:12:28 mousie ntpd[1111]: kernel time sync status change 2001 > I don't know if that has any connection to time(1) running slower -- but > perhaps ntpd is aggressively adjusting your clock? It seems to be pretty stable when the machine is idle: % ntpq -c pe remote refid st t when poll reach delay offset jitter ============================================================================== *gw.catspoiler.o .GPS. 1 u 8 64 377 0.168 -0.081 0.007 Not too much degradation under CPU load: % ntpq -c pe remote refid st t when poll reach delay offset jitter ============================================================================== *gw.catspoiler.o .GPS. 1 u 40 64 377 0.166 -0.156 0.026 I/O (dd if=/dev/ad6 of=/dev/null bs=512) doesn't appear to bother it much, either. % ntpq -c pe remote refid st t when poll reach delay offset jitter ============================================================================== *gw.catspoiler.o .GPS. 1 u 35 64 377 0.169 -0.106 0.009