From owner-freebsd-hackers Sun Jun 2 19:09:42 1996 Return-Path: owner-hackers Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id TAA25534 for hackers-outgoing; Sun, 2 Jun 1996 19:09:42 -0700 (PDT) Received: from fub46.zedat.fu-berlin.de (fub46.fddi1.fu-berlin.de [160.45.1.46]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id TAA25523 for ; Sun, 2 Jun 1996 19:09:14 -0700 (PDT) Received: by fub46.zedat.fu-berlin.de (Smail3.1.29.1) id ; Mon, 3 Jun 1996 04:09:04 +0200 (MES) Message-Id: Date: Mon, 3 Jun 1996 04:09:04 +0200 (MES) From: gusw@zedat.fu-berlin.de (Gunther Schadow) To: deborah@microunity.com, gusw@zedat.fu-berlin.de Subject: Re: Adaptec 2940 U makes fatal bus resets! Cc: freebsd-hackers@freebsd.org Sender: owner-hackers@freebsd.org X-Loop: FreeBSD.org Precedence: bulk Yeah! I cut off that problem, finally! And there is some more to say about timeout timers/counters ... please read on. My first posting for help: >> Hi, >> my new machine has an Adaptech 2940 Ultra SCSI host adapter with >> an IBM DORS SCSI2 (2 GB) disk, a CD ROM and a DAT drive attached to >> it. Now, when I write to the DAT, everything seems O.K., however, >> when trying to read, I sometimes get: >> ahc0: target 6, lun 0 (st0) timed out >> st0(ahc0:6:0): BUS DEVICE RESET message queued >> and then: >> st0(ahc0:6:0): Target Busy >> I can't figure out, what is wrong here, since sometimes reading works >> just fine. my only answer I've got: > Did you see my posting on almost the same day reporting > problems with the same controller as you have and > a CD-ROM drive? I can get the machine to boot, but > when I mount from the CD I see the exact same error > messages as you do. From our mutual problems, I suspect > the problem is with the controller, not the SCSI > devices attached. The controller is rather new. > I don't have the ability to submit a problem report > using send-pr right now - do you? Have you submitted > a report yet? Did you get any response? > I am copying this reply to the freebsd-hackers list > in hope that someone there might already be working on > the problem, or be interested enough to read the > two articles in the newsgroup before they expire. > -deborah bennett Now, the good news: SCSI BUS DEVICE RESET Problems seem to be fixed I was going mad, since the upgrade to a new Pentium 133MHz with AHC 2790 Ultra and finally from FreeBSD-1.1.5 to 2.1-RELEASE turned out to be a bad deal! I never ever had such a short mean time between two breakdowns of our fine BSD! In the last couple of hours, the system was up for no more than one hour -- not even 386BSD-0.0new was so unstabile! I thus made a 2.2-960501-SNAP kernel and tried it out -- nothing went better with the new kernel. So, I had to help myself: The problem, and the fixes for it -- yes, it seems like I fixed the problem -- point into a major problem that FreeBSD might have particularly on fast machines: timeout timers or counters seem to be initialized too small, and thus, timeout states occur prematurely. Two evidences from different parts of the kernel: (1) the fdc driver and (2) the aic7xxx driver. (1) FDC driver Please look at this (i386/isa/fdc.c): int in_fdc(fdcu_t fdcu) { int baseport = fdc_data[fdcu].baseport; int i, j = 100000; while ((i = inb(baseport+FDSTS) & (NE7_DIO|NE7_RQM)) != (NE7_DIO|NE7_RQM) && j-- > 0) if (i == NE7_RQM) return fdc_err(fdcu, "ready for output in input\n"); if (j <= 0) return fdc_err(fdcu, "input ready timeout\n"); ... This is obviously a counter, not a timer. My machine is fast, it counts considerably more in the same amount of time, and thus results in nasty timeouts (that even lock the machine sometimes) We need to depend the init value of j on the speed of the machine. And, after all, we shouldn't just count and block the whole machine from doing better things. Insert a tsleep()! I don't have time to fix this now, so I just append a 0 to the counter init value. This is part of my work-around of the timeout counter bug that resides in in_fdc(), fd_in(), and fd_out() First I define a constant with the counter value times 10, for a basic safety, such that it can be predefined as an option in the config file. I use the old value 100000 for my i486/33 ISA machine, and the times 10 value for the i586/133 PCI -- the timeouts didn't occur since I did this! But one can clearly watch the machine hang for a few milliseconds, when e.g. fdformat(8) is running (see how the regular blinking of the cursor stucks) -- I bet that a tsleep() instead of the counter would fix this for ever. #ifndef FDC_TIMEOUT_CNT # define FDC_TIMEOUT_CNT 1000000; /* added a 0 for safety */ #endif ... in_fdc(fdcu_t fdcu) { int baseport = fdc_data[fdcu].baseport; int i, j = FDC_TIMEOUT_CNT /* definition above in this file (GS) */; while ((i = inb(baseport+FDSTS) & (NE7_DIO|NE7_RQM)) != (NE7_DIO|NE7_RQM) && j-- > 0) if (i == NE7_RQM) ... int out_fdc(fdcu_t fdcu, int x) { int baseport = fdc_data[fdcu].baseport; int i; /* Check that the direction bit is set */ i = FDC_TIMEOUT_CNT /* dito GS */; while ((inb(baseport+FDSTS) & NE7_DIO) && i-- > 0); if (i <= 0) return fdc_err(fdcu, "direction bit not set\n"); /* Check that the floppy controller is ready for a command */ i = FDC_TIMEOUT_CNT /* dito GS */; while ((inb(baseport+FDSTS) & NE7_RQM) == 0 && i-- > 0); if (i <= 0) return fdc_err(fdcu, "output ready timeout\n"); O.K. that's for the FD controller driver, but the real nasty thing will be fixed now! (2) the PCI ahc driver (i386/scsi/aic7xxx.c) I experienced regular accidents with the following message: ahc0: target 6, lun 0 (st0) timed out st0(ahc0:6:0): BUS DEVICE RESET message queued. st0(ahc0:6:0): Target Busy last message repeated 23 times and even worse ahc0: target 0, lun 0 (sd0) timed out sd0(ahc0:0:0): BUS DEVICE RESET message queued. and (since this is my /, /usr, and swap disk ?), the machine will never recover from this and eventually hangs or panics, anyway leaving a damaged filesystem behind (fortunately the ufs code is so robust that damages are never fatal). I decided that this accident is closely related to swap activities, since it does not happen partuicularly when there is high load on machine and disk, but, when there is almost no activity, but some processes hanging around waiting and eventually swaped out ... and ... BANG!, game over! :-( O.K. since I no longer trust the time/tick/hz management and proper adjustment of my kernel to high CPU speeds, I decided to just increase the timeout values by the same factor of 10. #define TOFACT 10 /* Time Out FACTor */ then search for any timeout() function call and increase the 3rd parameter, which seems to be the timer value by that factor, e.g. like here: void ahc_scb_timeout(unit, ahc, scb) ... scb->datalen[1] = 0; scb->datalen[2] = 0; outb(SCBCNT + iobase, 0x80); outsb(SCBARRAY+iobase,scb,SCB_DOWN_SIZE); outb(SCBCNT + iobase, 0); ahc_add_waiting_scb(iobase, scb, list_second); timeout(ahc_timeout, (caddr_t)scb, ( TOFACT * 2 * hz)); ^^^^^^^^^^^^^ #ifdef AHC_DEBUG if(ahc_debug & AHC_SHOWABORTS) { sc_print_addr(scb->xs->sc_link); printf("BUS DEVICE RESET message queued.\n"); ^^^^^^^^^^^^^^^^^^^^^^^ This was the last message before the kernel suicided Now, I thik (hope) that I finally cut out this problem. But let me suggest you kernel gurus out there to think about the timing problem with fast CPUs. I've already seen some messages about clock adjustment when I booted my 2.2-*-SNAP kernel, but in any way, that didn't help, and the FDC driver doesn't even use any timer value, not even the wrong initialized hz. It just counts down! Anyway, thank you for that great work on FreeBSD! regards Gunther Schadow PS: the, sound blaster driver has a bug as well: it won't compile with option JAZZ16, since JAZZ_DMA16 (used in i386/isa/sound/sb_dsp.c) is never defined. I think that these lines belong into sound_config.h: #ifdef JAZZ16 #define JAZZ_DMA16 1 #endif at least, that worked for me.