From owner-freebsd-stable@FreeBSD.ORG Tue Sep 28 23:45:03 2010 Return-Path: Delivered-To: stable@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A6278106564A for ; Tue, 28 Sep 2010 23:45:03 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (adsl-75-1-14-242.dsl.scrm01.sbcglobal.net [75.1.14.242]) by mx1.freebsd.org (Postfix) with ESMTP id 631418FC12 for ; Tue, 28 Sep 2010 23:45:03 +0000 (UTC) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id o8SNiqSK060715; Tue, 28 Sep 2010 16:44:56 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Message-Id: <201009282344.o8SNiqSK060715@gw.catspoiler.org> Date: Tue, 28 Sep 2010 16:44:52 -0700 (PDT) From: Don Lewis To: freebsd@jdc.parodius.com In-Reply-To: <20100928174325.GA69044@icarus.home.lan> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii Cc: stable@FreeBSD.org, sterling@camdensoftware.com Subject: Re: CPU time accounting broken on 8-STABLE machine after a few hours of uptime X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Sep 2010 23:45:03 -0000 On 28 Sep, Jeremy Chadwick wrote: > Still speaking purely about ntpd: > > The above doesn't indicate a single problem. The deltas shown in both > delay, offset, and jitter are all 100% legitimate. A dd (to induce more > interrupt use) isn't going to exacerbate the problem (depending on your > system configuration, IRQ setup, local APIC, etc.). > > How about writing a small shell script that runs every minute in a > cronjob that does vmstat -i >> /some/file.log? Then when you see calcru > messages, look around the time frame where vmstat -i was run. Look for > high interrupt rates, aside from those associated with cpuX devices. Looking at the timestamps of things and comparing to my logs, I discovered that the last instance of ntp instability happened when I was running "make index" in /usr/ports. I tried it again with entertaining results. After a while, the machine became unresponsive. I was logged in over ssh and it stopped echoing keystrokes. In parallel I was running a script that echoed the date, the results of "vmstat -i", and the results of "ntpq -c pe". The latter showed jitter and offset going insane. Eventually "make index" finished and the machine was responsive again, but the time was way off and ntpd croaked because the necessary time correction was too large. Nothing else anomalous showed up in the logs. Hmn, about half an hour after ntpd died I started my CPU time accounting test and two minutes into that test I got a spew of calcru messages ... Tue Sep 28 14:52:27 PDT 2010 interrupt total rate irq0: clk 64077827 999 irq1: atkbd0 26 0 irq8: rtc 8199966 127 irq9: acpi0 19 0 irq10: ohci0 ehci1+ 10356112 161 irq11: fwohci0 ahc+ 132133 2 irq12: psm0 27 0 irq14: ata0 96064 1 irq15: nfe0 ata1 23350 0 Total 82885524 1293 remote refid st t when poll reach delay offset jitter ============================================================================== *gw.catspoiler.o .GPS. 1 u 137 128 377 0.195 0.111 0.030 Tue Sep 28 14:53:27 PDT 2010 interrupt total rate irq0: clk 64137854 999 irq1: atkbd0 26 0 irq8: rtc 8207648 127 irq9: acpi0 19 0 irq10: ohci0 ehci1+ 10360184 161 irq11: fwohci0 ahc+ 132133 2 irq12: psm0 27 0 irq14: ata0 96154 1 irq15: nfe0 ata1 23379 0 Total 82957424 1293 remote refid st t when poll reach delay offset jitter ============================================================================== *gw.catspoiler.o .GPS. 1 u 56 128 377 0.195 0.111 853895. Tue Sep 28 14:54:27 PDT 2010 interrupt total rate irq0: clk 64197881 999 irq1: atkbd0 26 0 irq8: rtc 8215329 127 irq9: acpi0 21 0 irq10: ohci0 ehci1+ 10360777 161 irq11: fwohci0 ahc+ 132133 2 irq12: psm0 27 0 irq14: ata0 96244 1 irq15: nfe0 ata1 23405 0 Total 83025843 1293 remote refid st t when poll reach delay offset jitter ============================================================================== *gw.catspoiler.o .GPS. 1 u 116 128 377 0.195 0.111 853895. Tue Sep 28 14:55:27 PDT 2010 interrupt total rate irq0: clk 64257907 999 irq1: atkbd0 26 0 irq8: rtc 8223011 127 irq9: acpi0 21 0 irq10: ohci0 ehci1+ 10360836 161 irq11: fwohci0 ahc+ 132133 2 irq12: psm0 27 0 irq14: ata0 96334 1 irq15: nfe0 ata1 23424 0 Total 83093719 1292 remote refid st t when poll reach delay offset jitter ============================================================================== gw.catspoiler.o .GPS. 1 u 48 128 377 0.197 2259195 2091608 Tue Sep 28 14:56:27 PDT 2010 interrupt total rate irq0: clk 64317933 999 irq1: atkbd0 26 0 irq8: rtc 8230692 127 irq9: acpi0 21 0 irq10: ohci0 ehci1+ 10360857 161 irq11: fwohci0 ahc+ 132133 2 irq12: psm0 27 0 irq14: ata0 96424 1 irq15: nfe0 ata1 23448 0 Total 83161561 1292 remote refid st t when poll reach delay offset jitter ============================================================================== gw.catspoiler.o .GPS. 1 u 108 128 377 0.197 2259195 2091608 Tue Sep 28 14:57:27 PDT 2010 interrupt total rate irq0: clk 64377960 999 irq1: atkbd0 26 0 irq8: rtc 8238374 127 irq9: acpi0 21 0 irq10: ohci0 ehci1+ 10360869 160 irq11: fwohci0 ahc+ 132133 2 irq12: psm0 27 0 irq14: ata0 96514 1 irq15: nfe0 ata1 23469 0 Total 83229393 1292 remote refid st t when poll reach delay offset jitter ============================================================================== gw.catspoiler.o .GPS. 1 u 39 128 377 0.176 2259195 1909368 Tue Sep 28 14:59:51 PDT 2010 interrupt total rate irq0: clk 64521959 999 irq1: atkbd0 26 0 irq8: rtc 8256801 127 irq9: acpi0 21 0 irq10: ohci0 ehci1+ 10360941 160 irq11: fwohci0 ahc+ 132133 2 irq12: psm0 27 0 irq14: ata0 96730 1 irq15: nfe0 ata1 23641 0 Total 83392279 1292 remote refid st t when poll reach delay offset jitter ============================================================================== gw.catspoiler.o .GPS. 1 u 55 128 377 0.174 2259195 1707791 Tue Sep 28 15:00:51 PDT 2010 interrupt total rate irq0: clk 64581986 999 irq1: atkbd0 26 0 irq8: rtc 8264482 127 irq9: acpi0 21 0 irq10: ohci0 ehci1+ 10361001 160 irq11: fwohci0 ahc+ 132133 2 irq12: psm0 27 0 irq14: ata0 96820 1 irq15: nfe0 ata1 23658 0 Total 83460154 1292 remote refid st t when poll reach delay offset jitter ============================================================================== gw.catspoiler.o .GPS. 1 u 115 128 377 0.174 2259195 1707791 Sep 28 15:16:06 scratch ntpd[1141]: time correction of 2259 seconds exceeds sanity limit (1000); set clock manually to the correct UTC time.