From owner-freebsd-hackers Sun Jun 2 20:57:56 1996 Return-Path: owner-hackers Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id UAA00718 for hackers-outgoing; Sun, 2 Jun 1996 20:57:56 -0700 (PDT) Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id UAA00712 for ; Sun, 2 Jun 1996 20:57:50 -0700 (PDT) Received: (from bde@localhost) by godzilla.zeta.org.au (8.6.12/8.6.9) id NAA22079; Mon, 3 Jun 1996 13:52:44 +1000 Date: Mon, 3 Jun 1996 13:52:44 +1000 From: Bruce Evans Message-Id: <199606030352.NAA22079@godzilla.zeta.org.au> To: deborah@microunity.com, gusw@zedat.fu-berlin.de Subject: Re: Adaptec 2940 U makes fatal bus resets! Cc: freebsd-hackers@freebsd.org Sender: owner-hackers@freebsd.org X-Loop: FreeBSD.org Precedence: bulk >I thus made a 2.2-960501-SNAP kernel and tried it out -- nothing went >better with the new kernel. So, I had to help myself: The problem, and >the fixes for it -- yes, it seems like I fixed the problem -- point >into a major problem that FreeBSD might have particularly on fast >machines: timeout timers or counters seem to be initialized too small, Maybe. >and thus, timeout states occur prematurely. Two evidences from >different parts of the kernel: (1) the fdc driver and (2) the aic7xxx >driver. >(1) FDC driver >Please look at this (i386/isa/fdc.c): >int >in_fdc(fdcu_t fdcu) >{ > int baseport = fdc_data[fdcu].baseport; > int i, j = 100000; > while ((i = inb(baseport+FDSTS) & (NE7_DIO|NE7_RQM)) > != (NE7_DIO|NE7_RQM) && j-- > 0) > if (i == NE7_RQM) > return fdc_err(fdcu, "ready for output in input\n"); > if (j <= 0) > return fdc_err(fdcu, "input ready timeout\n"); >... >This is obviously a counter, not a timer. My machine is fast, it >counts considerably more in the same amount of time, and thus results >in nasty timeouts (that even lock the machine sometimes) It actually acts as a timer. inb() is very slow on all machines. On all ISA machines, inb() takes about 1-1.25 usec. On PCI machines, it may be faster, but it probably won't be more than a few times faster, and certainly can't be more than 100 times faster. The initial count is large enough to allow for a speedup of a few thousand. On my ASUS P55TP4XE (rev.2.4), inb(0x1f0+FDSTS) actually takes 1180 ns, so the loop goes only about 10/9 or 11/9 times as fast as on my slow ISA systems, and the loop times out after about 118 ms. Timeouts occured because of a bug elsewhere in the driver and unusual behaviour of the UMC i/o chip. The chip sometimes interrupted early in response to i/o commands. This causeed the driver to enter the spinloop too early and busy-wait until i/o completion. 118 ms is long enough for i/o to complete in most cases except after a seek, when it usually takes slightly less than one disk revolution (200 ms or 167 ms) for i/o to complete. Increasing the timeout masked the problem. The fix was to clear all the interrupts generated by reset instead of just one. This has been fixed in -current and -stable for a couple of months. > We need to depend the init value of j on the speed of the >machine. And, after all, we shouldn't just count and block the whole >machine from doing better things. Insert a tsleep()! Interrupt handlers can't call tsleep(). In this case, there is nothing better to do than to busy-wait, since setting up a timeout would take much longer than the expected wait time. >First I define a constant with the counter value times 10, for a basic >safety, such that it can be predefined as an option in the config >file. I use the old value 100000 for my i486/33 ISA machine, and the >times 10 value for the i586/133 PCI -- the timeouts didn't occur since >I did this! But one can clearly watch the machine hang for a few >milliseconds, when e.g. fdformat(8) is running (see how the regular >blinking of the cursor stucks) -- I bet that a tsleep() instead of the >counter would fix this for ever. I saw i by watching systat. An i586/133 PCI shouldn't have a 10% overhead for floppy interrupts! This also showed that increasing the timeout was the wrong fix. >O.K. that's for the FD controller driver, but the real nasty thing >will be fixed now! >(2) the PCI ahc driver (i386/scsi/aic7xxx.c) I don't know much about this. >O.K. since I no longer trust the time/tick/hz management and proper >adjustment of my kernel to high CPU speeds, I decided to just increase >the timeout values by the same factor of 10. The timeout() and clock interrupt and higher level parts (including everything to do with hz) can be trusted. >void >ahc_scb_timeout(unit, ahc, scb) >... > timeout(ahc_timeout, (caddr_t)scb, ( TOFACT * 2 * hz)); 2 seconds was already a lot. The fatal problem is probably in poor handling of SCSI errors. Bruce