From owner-freebsd-hackers@FreeBSD.ORG Tue Nov 22 00:40:11 2005 Return-Path: X-Original-To: freebsd-hackers@freebsd.org Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 1AA7616A421 for ; Tue, 22 Nov 2005 00:40:10 +0000 (GMT) (envelope-from spork@fasttrackmonkey.com) Received: from angryfist.fasttrackmonkey.com (angryfist.fasttrackmonkey.com [216.220.107.230]) by mx1.FreeBSD.org (Postfix) with ESMTP id EFA0243D5C for ; Tue, 22 Nov 2005 00:40:07 +0000 (GMT) (envelope-from spork@fasttrackmonkey.com) Received: (qmail 26641 invoked by uid 2003); 22 Nov 2005 00:33:31 -0000 Received: from spork@fasttrackmonkey.com by angryfist.fasttrackmonkey.com by uid 1001 with qmail-scanner-1.20 (clamscan: 0.65. Clear:RC:1(216.220.116.154):. Processed in 0.05328 secs); 22 Nov 2005 00:33:31 -0000 Received: from unknown (HELO white.nat.fasttrackmonkey.com) (216.220.116.154) by 0 with (DHE-RSA-AES256-SHA encrypted) SMTP; 22 Nov 2005 00:33:31 -0000 Date: Mon, 21 Nov 2005 19:40:22 -0500 (EST) From: Charles Sprickman X-X-Sender: spork@white.nat.fasttrackmonkey.com To: John Baldwin In-Reply-To: <200511211149.01165.jhb@freebsd.org> Message-ID: References: <200511182215.04399.jhb@freebsd.org> <437F79F1.5040706@geminix.org> <200511211149.01165.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-hackers@freebsd.org, Uwe Doering Subject: Re: 4.8 "Alternate system clock has died" error X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Nov 2005 00:40:11 -0000 On Mon, 21 Nov 2005, John Baldwin wrote: > On Saturday 19 November 2005 02:16 pm, Uwe Doering wrote: >> John Baldwin wrote: >>> On Friday 18 November 2005 10:05 pm, Charles Sprickman wrote: >>>> I tried this query on -stable, hoping someone here can help me further >>>> understand and troubleshoot this. >>>> >>>> Reference: >>>> http://thread.gmane.org/gmane.os.freebsd.stable/32837 >>>> >>>> In short, top, ps report 0% CPU on all processes as of a few weeks ago. >>>> "systat -vmstat" hands out the "Alternate system clock has died" error. >>>> >>>> Box is running 4.8-p24 and has been up 425 days. Nothing out of the >>>> ordinary except for the above symptoms. In searching the various >>>> lists/newsgroups, it seems that the other folks with this problem have >>>> fixed it in various ways: >>>> >>>> -early 4.x users referenced a PR that was committed before 4.8 >>>> -some 5.3 users reported this with unknown resolution/cause >>>> -sending init a HUP was suggested (tried it, no luck) >>>> -setting kern.timecounter.method: 1 (tried it, no luck) >>>> -one user seemed to actually have a dead timer >>> >>> Actually, there was a patch that was committed in 5.4 and 6.0 for this >>> issue. You can see the diff here: >>> >>> http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/i386/isa/clock.c.diff?r1=1. >>> 213&r2=1.214&f=h >>> >>> That patch would probably backport to 4.x fairly easily. >> >> I just looked at RELENG_4, and yes, backporting should be easy. Though >> I haven't tried it yet on our machines. >> >> I wonder, however, what's writing to the RTC on a running server. Could >> this event perhaps have been triggered by the recent Daylight Saving >> Time switch? > > Yep. Also, if you are using ntp, then the adjustments to the time are getting > pushed back to the RTC as well. I run ntp everywhere. So it certainly looks easy enough for me to change the first two sections of the diff referenced above, but I'm having issues finding that last one in cpu_initclocks(). It looks like that section really has changed quite a bit. (see v.1.206) The original PR that references this is against 4.something and only patches in one place: http://www.freebsd.org/cgi/query-pr.cgi?pr=17800 What's my best course of action to try and fix this? It looks like I can take the first two hunks of that cvsweb diff and then add on the one liner from the PR, but I have no idea what that's actually doing. My experience with C is limited to making very small changes to existing work, and nothing quite as important as this one file appears to be (from reading the commit logs on it). Is there any interest in moving this back to 4-STABLE? And lastly, is there any snippet of code that can twiddle the clock from userspace and determine if it's wedged or dead? Scheduling a reboot of this machine gets much, much more complicated if I need to have another box standing by due to a truly dead timer. Thanks so much to both of you for your help... Charles > -- > John Baldwin <>< http://www.FreeBSD.org/~jhb/ > "Power Users Use the Power to Serve" = http://www.FreeBSD.org >