From owner-freebsd-stable@FreeBSD.ORG Fri Nov 18 23:56:25 2005 Return-Path: X-Original-To: stable@freebsd.org Delivered-To: freebsd-stable@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7326F16A41F for ; Fri, 18 Nov 2005 23:56:25 +0000 (GMT) (envelope-from gemini@geminix.org) Received: from gen129.n001.c02.escapebox.net (gen129.n001.c02.escapebox.net [213.73.91.129]) by mx1.FreeBSD.org (Postfix) with ESMTP id 911A243D46 for ; Fri, 18 Nov 2005 23:56:24 +0000 (GMT) (envelope-from gemini@geminix.org) Message-ID: <437E6A26.6050407@geminix.org> Date: Sat, 19 Nov 2005 00:56:22 +0100 From: Uwe Doering Organization: Private UNIX Site User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.12) Gecko/20051117 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Charles Sprickman References: <437D91FD.8050809@geminix.org> In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Received: from gemini by geminix.org with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.54 (FreeBSD)) id 1EdG5f-00093Y-30; Sat, 19 Nov 2005 00:56:23 +0100 Cc: stable@freebsd.org Subject: Re: 4.8 "alternate system clock has died" error X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 18 Nov 2005 23:56:25 -0000 Charles Sprickman wrote: > On Fri, 18 Nov 2005, Uwe Doering wrote: >> Charles Sprickman wrote: >> >>> I've been digging through Google for more information on this. I >>> have a 4.8 box that's been up for about 430 days. In the last week >>> or so, top and ps have started reporting all CPU usage numbers as >>> zero, and running "systat -vmstat" results in the message "The >>> alternate system clock has died! Reverting to ``pigs'' display". >>> [...] >> >> We had this once at work, quite a while ago. The "alternate system >> clock" is in fact the Real Time Clock (RTC) on the mainboard. In our >> case we were lucky in that it was just the quartz device that failed >> due to an improperly soldered lead which finally came off. We fixed >> the soldering and the problem was gone. > > Are there any tools to verify that the RTC is working? "systat -vmstat" will show you the interrupt that it drives. In our case it's irq8, which is in fact labeled "rtc". It is supposed to run at 128 Hz. Under load it can drop to some lower value. This is normal. > I don't exactly > understand what the RTC is, but would the machine not be suffering some > other problems if there was an actual hardware failure? Doesn't the > system rely on this to time everything from the processors to memory to > PCI slots and interrupts? No, the RTC drives only the interrupt that is responsible for collecting the CPU usage data. When it fails the CPU usage in "top", "ps" etc. just drops to zero, as you've observed, but the server continues to run. If the failure is permanent the machine refuses to boot, though. At least that's what happened in our case. Apparently the RTC chip is essential to the mainboard's boot sequence. For instance, the initial date and time information comes from this chip. On the other hand, if a reset corrects the problem then the RTC chip probably got hung, or there is a problem with the interrupt controller it is connected to. On a properly working mainboard this shouldn't happen, of course. > Is there any simple way to figure out if this is hardware or software? I don't know of any. However, we run FreeBSD almost since 4.0, on various mainboards, UP and SMP, and we've never seen these symptoms but in this one case mentioned above. So I suppose it's not a kernel bug. I haven't looked at the PR database, though. Uwe -- Uwe Doering | EscapeBox - Managed On-Demand UNIX Servers gemini@geminix.org | http://www.escapebox.net