From owner-freebsd-stable@FreeBSD.ORG Tue Sep 28 21:00:19 2010 Return-Path: Delivered-To: stable@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A2B37106567A for ; Tue, 28 Sep 2010 21:00:19 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (adsl-75-1-14-242.dsl.scrm01.sbcglobal.net [75.1.14.242]) by mx1.freebsd.org (Postfix) with ESMTP id 661EA8FC18 for ; Tue, 28 Sep 2010 21:00:18 +0000 (UTC) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id o8SL04CY060428; Tue, 28 Sep 2010 14:00:08 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Message-Id: <201009282100.o8SL04CY060428@gw.catspoiler.org> Date: Tue, 28 Sep 2010 14:00:04 -0700 (PDT) From: Don Lewis To: freebsd@jdc.parodius.com In-Reply-To: <20100928174325.GA69044@icarus.home.lan> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii Cc: stable@FreeBSD.org, sterling@camdensoftware.com Subject: Re: CPU time accounting broken on 8-STABLE machine after a few hours of uptime X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Sep 2010 21:00:19 -0000 On 28 Sep, Jeremy Chadwick wrote: > On Tue, Sep 28, 2010 at 10:15:34AM -0700, Don Lewis wrote: >> My time source is another FreeBSD box with a GPS receiver on my LAN. My >> other client machine isn't seeing these time jumps. The only messages >> from ntp in its log from this period are these: >> >> Sep 23 04:12:23 mousie ntpd[1111]: kernel time sync status change 6001 >> Sep 23 04:29:29 mousie ntpd[1111]: kernel time sync status change 2001 >> Sep 24 03:55:24 mousie ntpd[1111]: kernel time sync status change 6001 >> Sep 24 04:12:28 mousie ntpd[1111]: kernel time sync status change 2001 > > I'm speaking purely about ntpd below this point -- almost certainly a > separate problem/issue, but I'll explain it anyway. I'm not under the > impression that the calcru messages indicate RTC clock drift, but I'd > need someone like John Baldwin to validate my statement. I don't think the problems are directly related. I think the calcru messages get triggered by clcok frequency changes that get detected and change the tick to usec conversion ratio. > Back to ntpd: you can addressing the above messages by adding "maxpoll > 9" to your "server" lines in ntp.conf. The comment we use in our > ntp.conf that documents the well-known problem: Thanks I'll try that. > # maxpoll 9 is used to work around PLL/FLL flipping, which happens at > # exactly 1024 seconds (the default maxpoll value). Another FreeBSD > # user recommended using 9 instead: > # http://lists.freebsd.org/pipermail/freebsd-stable/2006-December/031512.html > >> > I don't know if that has any connection to time(1) running slower -- but >> > perhaps ntpd is aggressively adjusting your clock? >> >> It seems to be pretty stable when the machine is idle: >> >> % ntpq -c pe >> remote refid st t when poll reach delay offset jitter >> ============================================================================== >> *gw.catspoiler.o .GPS. 1 u 8 64 377 0.168 -0.081 0.007 >> >> Not too much degradation under CPU load: >> >> % ntpq -c pe >> remote refid st t when poll reach delay offset jitter >> ============================================================================== >> *gw.catspoiler.o .GPS. 1 u 40 64 377 0.166 -0.156 0.026 >> >> I/O (dd if=/dev/ad6 of=/dev/null bs=512) doesn't appear to bother it >> much, either. >> >> % ntpq -c pe >> remote refid st t when poll reach delay offset jitter >> ============================================================================== >> *gw.catspoiler.o .GPS. 1 u 35 64 377 0.169 -0.106 0.009 > > Still speaking purely about ntpd: > > The above doesn't indicate a single problem. The deltas shown in both > delay, offset, and jitter are all 100% legitimate. A dd (to induce more > interrupt use) isn't going to exacerbate the problem (depending on your > system configuration, IRQ setup, local APIC, etc.). I was hoping to do something to provoke clock interrupt loss. I don't see any problems when this machine is idle. The last two times that the calcru messages have occured where when I booted this machine to build a bunch of ports. I don't see any problems when this machine is idle. Offset and jitter always look really good whenever I've looked. > How about writing a small shell script that runs every minute in a > cronjob that does vmstat -i >> /some/file.log? Then when you see calcru > messages, look around the time frame where vmstat -i was run. Look for > high interrupt rates, aside from those associated with cpuX devices. Ok, I'll give this a try. Just for reference, this is what is currently reported: % vmstat -i interrupt total rate irq0: clk 60683442 1000 irq1: atkbd0 6 0 irq8: rtc 7765537 127 irq9: acpi0 13 0 irq10: ohci0 ehci1+ 10275064 169 irq11: fwohci0 ahc+ 132133 2 irq12: psm0 21 0 irq14: ata0 90982 1 irq15: nfe0 ata1 18363 0 I'm not sure why I'm getting USB interrupts. There aren't any USB devices plugged into this machine. # usbconfig dump_info ugen0.1: at usbus0, cfg=0 md=HOST spd=FULL (12Mbps) pwr=ON ugen1.1: at usbus1, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON ugen2.1: at usbus2, cfg=0 md=HOST spd=FULL (12Mbps) pwr=ON ugen3.1: at usbus3, cfg=0 md=HOST spd=HIGH (480Mbps) pwr=ON > Next, you need to let ntpd run for quite a bit longer than what you did > above. Your poll maximum is only 64, indicating ntpd had recently been > restarted, or that your offset deviates greatly (my guess is ntpd being > restarted). poll will increase over time (64, 128, 256, 512, and > usually max out at 1024), depending on how "stable" the clock is. when > is a counter that increments, and does clock syncing (if needed) once it > reaches poll. You'd see unstable system clock indications in your > syslog as well (indicated by actual +/- clock drift lines occurring > regularly. These aren't the same as 2001/6001 PLL/FLL mode flipping). > Sorry if this is a bit paragraph/much to take in. Yes, these readings were shortly after I rebooted the machine. It's been up a while longer now: % uptime 1:48PM up 16:44, 2 users, load averages: 0.00, 0.00, 0.00 % ntpq -c pe remote refid st t when poll reach delay offset jitter ============================================================================== *gw.catspoiler.o .GPS. 1 u 179 256 377 0.159 0.202 0.053 > You might also try stopping ntpd, removing /var/db/ntpd.drift, and > restarting ntpd -- then check back in about 48 hours (no I'm not > kidding). This is especially necessary if you've replaced the > motherboard or taken the disks from System A and stuck them in System B. I don't think the problem is the drift file. Most of the time ntp is really stable. I haven't seen any indication of clock drift that might be causing ntp to step the clock, but I haven't happened to look in the midst of port building. > All that said: I'm not convinced ntpd has anything to do with your > problem. EIST or EIST-like capabilities (such as Cool'n'Quiet) are > often the source of the problem. "device cpufreq" might solve your > issue entirely, hard to say. My kernel config includes GENERIC, which contains cpufreq. Also, kern.timecounter.hardware is ACPI-fast, which shouldn't be affected by CPU clock speed changes. This shows up in dmesg: powernow0: on cpu0