From owner-freebsd-i386@FreeBSD.ORG Sun Jan 13 20:00:04 2008 Return-Path: Delivered-To: freebsd-i386@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5C1D416A41A for ; Sun, 13 Jan 2008 20:00:04 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 4330E13C458 for ; Sun, 13 Jan 2008 20:00:04 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (gnats@localhost [127.0.0.1]) by freefall.freebsd.org (8.14.2/8.14.2) with ESMTP id m0DK03mO080607 for ; Sun, 13 Jan 2008 20:00:04 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.2/8.14.1/Submit) id m0DK03RF080592; Sun, 13 Jan 2008 20:00:03 GMT (envelope-from gnats) Date: Sun, 13 Jan 2008 20:00:03 GMT Message-Id: <200801132000.m0DK03RF080592@freefall.freebsd.org> To: freebsd-i386@FreeBSD.org From: Bruce Evans Cc: Subject: Re: i386/119574: 7.0-RC1 times out in calibrate_clocks() X-BeenThere: freebsd-i386@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Bruce Evans List-Id: I386-specific issues for FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 13 Jan 2008 20:00:04 -0000 The following reply was made to PR i386/119574; it has been noted by GNATS. From: Bruce Evans To: "Oliver B. Warzecha" Cc: FreeBSD-gnats-submit@FreeBSD.org, freebsd-i386@FreeBSD.org Subject: Re: i386/119574: 7.0-RC1 times out in calibrate_clocks() Date: Mon, 14 Jan 2008 00:11:34 +1100 (EST) On Fri, 11 Jan 2008, Oliver B. Warzecha wrote: >> Description: > The 7.0-RC1 GENERIC kernel fails to recognize that the seconds counter > changes during the loops in calibrate_clocks() in sys/i386/isa/clocks.c > > The same machine passes this routine quickly with 6.2-RELEASE-p8. It > is a 200 MHz AMD K6 on an ASUS P55T2P4S board. Oops. > Through generous use of debugging printf()'s while searching, I could > limit the possible causes to this lines: > [... moved this later] > As I noticed that rtcin() was reworked during the change to 7.0, > (http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/i386/isa/clock.c.diff?r1=1.222.2.2;r2=1.239.2.1;f=h) > it occurred to me that perhaps there might be some condition > in the new code. Probably. I wrote both. Apparently I optimized rtcin() too much. Try adding some delays to it, and/or more debugging code. For debugging code, just read the RTC index register and verify that it is set to the memory value that is supposed to track it. Check after writing to it and when the write is avoided because it already has the desired value according to the memory value. > As I said, on a 6.2-RELEASE-p8 kernel it all goes smoothly, I just > checked with boot -v: > Calibrating clock(s) ... i8254 clock: 1193296 Hz > CLK_USE_I8254_CALIBRATION not specified - using default frequency > Timecounter "i8254" frequency 1193182 Hz quality 0 > > It does *not* say "failed, using default i8254 clock [...]" which > would indicate the "fault" label. The calibration isn't really used unless you configure CLK_USE_I8254_CALIBRATION. Otherwise, it is just a sanity check. I stopped using this configuration a few years ago when I switched to a more dynamic calibration scheme. >> Fix: > One possible workaround would be to reset the timeout variable > to a more sane value, for the first loop it gets set to > 100000000. But then this loop seems to work okay, so I have > no idea what a sane value would be. At least it would fail > fast. 100 million is supposed to be 100 seconds with a large safety margin. This depends on rtcin() taking at least 1 usec by doing slow hardware accesses. But 100 seconds is too long and rtcin() normally takes much longer than 1us. I reduced it to 1 million locally. You should reduce it to debug this (so that it times out after < 10 seconds). > timeout = 100000000; > > /* Read the mc146818A seconds counter. */ > for (;;) { > if (!(rtcin(RTC_STATUSA) & RTCSA_TUP)) { > sec = rtcin(RTC_SEC); > break; > } > if (--timeout == 0) > goto fail; > } > // printf("seconds counter read ... \n"); The optimization might break this by going too fast -- it now reads the status register as fast as possible, where before it had lots of delays. It still has the delays (supposed to be done more carefully than before when switching to reading the seconds register, so I would expect the read of the seconds register to be as correct as before, but it apparently isn't. > /* Wait for the mC146818A seconds counter to change. */ > start_sec = sec; > for (;;) { > if (!(rtcin(RTC_STATUSA) & RTCSA_TUP)) { > sec = rtcin(RTC_SEC); > if (sec != start_sec) > break; > } > if (--timeout == 0) > goto fail; > } > // printf("seconds counter changed ... \n"); > (my debugging printf()s commented out with C++ comments) This should behaves similarly. > The first output "seconds counter read" gets printed very fast. Do you think it gets printed too fast? It should take 1/2 second on average to reach the next seconds boundary, then another second to calibrate. Configure CLK_CALIBRATION_LOOP and boot with -v to see the calibration repeated precisely every second if it is working. > After that the machine counts back from 0 to 0 and after some minutes > he falls out of the loop and "seconds counter changed" gets never > printed. You could also try watching the RTCSA_TUP bit. This should change every second and stay set for a while. Some logic would be required to limit printfs to 1 per change. This bit alone could be used to detect seconds boundaries. I just noticed that I reduced the accuracy of the calibration by adding delays to rtcin(). I had the delays commented out in my version but put them back when I optimized the usual case of rtcin() being called from the RTC interrupt handler (the index register is normally constant then). The (relative) accuracy is limited the time that it takes to read the seconds register and the time it takes to read the I8254, and though the code is arranged so that the inaccuracy is much less than the sum of these times, increasing the times doesn't help. There is probably also more jitter from branch misprediction in rtcin() when switching indexes. So watching only the RTCSA_TUP bit would be good for accuracy too. The seconds register could be read in non-time-critical code after reading all the other registers as a sanity check (check that it is incremented by 1; the current code only checks that it changes). Bruce