From owner-freebsd-stable@FreeBSD.ORG  Fri Nov 18 23:56:25 2005
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
X-Original-To: stable@freebsd.org
Delivered-To: freebsd-stable@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 7326F16A41F
	for <stable@freebsd.org>; Fri, 18 Nov 2005 23:56:25 +0000 (GMT)
	(envelope-from gemini@geminix.org)
Received: from gen129.n001.c02.escapebox.net (gen129.n001.c02.escapebox.net
	[213.73.91.129])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 911A243D46
	for <stable@freebsd.org>; Fri, 18 Nov 2005 23:56:24 +0000 (GMT)
	(envelope-from gemini@geminix.org)
Message-ID: <437E6A26.6050407@geminix.org>
Date: Sat, 19 Nov 2005 00:56:22 +0100
From: Uwe Doering <gemini@geminix.org>
Organization: Private UNIX Site
User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.12) Gecko/20051117
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Charles Sprickman <spork@bway.net>
References: <Pine.OSX.4.61.0511171853010.709@charles-sprickmans-computer.local>
	<437D91FD.8050809@geminix.org>
	<Pine.OSX.4.61.0511181729080.709@charles-sprickmans-computer.local>
In-Reply-To: <Pine.OSX.4.61.0511181729080.709@charles-sprickmans-computer.local>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Received: from gemini by geminix.org with esmtpsa (TLSv1:AES256-SHA:256)
	(Exim 4.54 (FreeBSD))
	id 1EdG5f-00093Y-30; Sat, 19 Nov 2005 00:56:23 +0100
Cc: stable@freebsd.org
Subject: Re: 4.8 "alternate system clock has died" error
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 18 Nov 2005 23:56:25 -0000

Charles Sprickman wrote:
> On Fri, 18 Nov 2005, Uwe Doering wrote:
>> Charles Sprickman wrote:
>>
>>> I've been digging through Google for more information on this.  I 
>>> have a 4.8 box that's been up for about 430 days.  In the last week 
>>> or so, top and ps have started reporting all CPU usage numbers as 
>>> zero, and running "systat -vmstat" results in the message "The 
>>> alternate system clock has died! Reverting to ``pigs'' display".
>>> [...]
>>
>> We had this once at work, quite a while ago.  The "alternate system 
>> clock" is in fact the Real Time Clock (RTC) on the mainboard.  In our 
>> case we were lucky in that it was just the quartz device that failed 
>> due to an improperly soldered lead which finally came off.  We fixed 
>> the soldering and the problem was gone.
> 
> Are there any tools to verify that the RTC is working? 

"systat -vmstat" will show you the interrupt that it drives.  In our 
case it's irq8, which is in fact labeled "rtc".  It is supposed to run 
at 128 Hz.  Under load it can drop to some lower value.  This is normal.

> I don't exactly 
> understand what the RTC is, but would the machine not be suffering some 
> other problems if there was an actual hardware failure?  Doesn't the 
> system rely on this to time everything from the processors to memory to 
> PCI slots and interrupts?

No, the RTC drives only the interrupt that is responsible for collecting 
the CPU usage data.  When it fails the CPU usage in "top", "ps" etc. 
just drops to zero, as you've observed, but the server continues to run. 
  If the failure is permanent the machine refuses to boot, though.  At 
least that's what happened in our case.  Apparently the RTC chip is 
essential to the mainboard's boot sequence.  For instance, the initial 
date and time information comes from this chip.

On the other hand, if a reset corrects the problem then the RTC chip 
probably got hung, or there is a problem with the interrupt controller 
it is connected to.  On a properly working mainboard this shouldn't 
happen, of course.

> Is there any simple way to figure out if this is hardware or software?

I don't know of any.  However, we run FreeBSD almost since 4.0, on 
various mainboards, UP and SMP, and we've never seen these symptoms but 
in this one case mentioned above.  So I suppose it's not a kernel bug. 
I haven't looked at the PR database, though.

    Uwe
-- 
Uwe Doering         |  EscapeBox - Managed On-Demand UNIX Servers
gemini@geminix.org  |  http://www.escapebox.net