Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 3 Jun 1996 04:09:04 +0200 (MES)
From:      gusw@zedat.fu-berlin.de (Gunther Schadow)
To:        deborah@microunity.com, gusw@zedat.fu-berlin.de
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: Adaptec 2940 U makes fatal bus resets!
Message-ID:  <m0uQP4q-000dR6C@fub46.zedat.fu-berlin.de>

next in thread | raw e-mail | index | archive | help
Yeah! I cut off that problem, finally! And there is some more to say
about timeout timers/counters ... please read on.

My first posting for help:

>>    Hi,

>>    my new machine has an Adaptech 2940 Ultra SCSI host adapter with
>>    an IBM DORS SCSI2 (2 GB) disk, a CD ROM and a DAT drive attached to
>>    it. Now, when I write to the DAT, everything seems O.K., however,
>>    when trying to read, I sometimes get:

>>    ahc0: target 6, lun 0 (st0) timed out
>>    st0(ahc0:6:0): BUS DEVICE RESET message queued
>>    and then:
>>    st0(ahc0:6:0): Target Busy

>>    I can't figure out, what is wrong here, since sometimes reading works
>>    just fine.

my only answer I've got:

> Did you see my posting on almost the same day reporting
> problems with the same controller as you have and
> a CD-ROM drive? I can get the machine to boot, but
> when I mount from the CD I see the exact same error
> messages as you do. From our mutual problems, I suspect
> the problem is with the controller, not the SCSI
> devices attached. The controller is rather new.

> I don't have the ability to submit a problem report
> using send-pr right now - do you? Have you submitted
> a report yet? Did you get any response?

> I am copying this reply to the freebsd-hackers list
> in hope that someone there might already be working on
> the problem, or be interested enough to read the
> two articles in the newsgroup before they expire.

> -deborah bennett

Now, the good news:
SCSI BUS DEVICE RESET Problems seem to be fixed

I was going mad, since the upgrade to a new Pentium 133MHz with AHC
2790 Ultra and finally from FreeBSD-1.1.5 to 2.1-RELEASE turned out to
be a bad deal! I never ever had such a short mean time between two
breakdowns of our fine BSD! In the last couple of hours, the system
was up for no more than one hour -- not even 386BSD-0.0new was so
unstabile!

I thus made a 2.2-960501-SNAP kernel and tried it out -- nothing went
better with the new kernel. So, I had to help myself: The problem, and
the fixes for it -- yes, it seems like I fixed the problem -- point
into a major problem that FreeBSD might have particularly on fast
machines: timeout timers or counters seem to be initialized too small,
and thus, timeout states occur prematurely. Two evidences from
different parts of the kernel: (1) the fdc driver and (2) the aic7xxx
driver.

(1) FDC driver

Please look at this (i386/isa/fdc.c):

int
in_fdc(fdcu_t fdcu)
{
	int baseport = fdc_data[fdcu].baseport;
	int i, j = 100000;
	while ((i = inb(baseport+FDSTS) & (NE7_DIO|NE7_RQM))
		!= (NE7_DIO|NE7_RQM) && j-- > 0)
		if (i == NE7_RQM)
			return fdc_err(fdcu, "ready for output in input\n");
	if (j <= 0)
		return fdc_err(fdcu, "input ready timeout\n");
...

This is obviously a counter, not a timer. My machine is fast, it
counts considerably more in the same amount of time, and thus results
in nasty timeouts (that even lock the machine sometimes)
  We need to depend the init value of j on the speed of the
machine. And, after all, we shouldn't just count and block the whole
machine from doing better things. Insert a tsleep()!
  I don't have time to fix this now, so I just append a 0
to the counter init value.

This is part of my work-around of the timeout counter bug that resides
in in_fdc(), fd_in(), and fd_out()

First I define a constant with the counter value times 10, for a basic
safety, such that it can be predefined as an option in the config
file. I use the old value 100000 for my i486/33 ISA machine, and the
times 10 value for the i586/133 PCI -- the timeouts didn't occur since
I did this! But one can clearly watch the machine hang for a few
milliseconds, when e.g. fdformat(8) is running (see how the regular
blinking of the cursor stucks) -- I bet that a tsleep() instead of the
counter would fix this for ever.

#ifndef FDC_TIMEOUT_CNT
# define FDC_TIMEOUT_CNT 1000000;  /* added a 0 for safety */
#endif

...

in_fdc(fdcu_t fdcu)
{
	int baseport = fdc_data[fdcu].baseport;
	int i, j = FDC_TIMEOUT_CNT /* definition above in this file (GS) */;
	while ((i = inb(baseport+FDSTS) & (NE7_DIO|NE7_RQM))
		!= (NE7_DIO|NE7_RQM) && j-- > 0)
		if (i == NE7_RQM)
	
...

int
out_fdc(fdcu_t fdcu, int x)
{
	int baseport = fdc_data[fdcu].baseport;
	int i;

	/* Check that the direction bit is set */
	i = FDC_TIMEOUT_CNT /* dito GS */;
	while ((inb(baseport+FDSTS) & NE7_DIO) && i-- > 0);
	if (i <= 0) return fdc_err(fdcu, "direction bit not set\n");

	/* Check that the floppy controller is ready for a command */
	i = FDC_TIMEOUT_CNT /* dito GS */;
	while ((inb(baseport+FDSTS) & NE7_RQM) == 0 && i-- > 0);
	if (i <= 0) return fdc_err(fdcu, "output ready timeout\n");

O.K. that's for the FD controller driver, but the real nasty thing
will be fixed now!

(2) the PCI ahc driver (i386/scsi/aic7xxx.c)

I experienced regular accidents with the following message:

ahc0: target 6, lun 0 (st0) timed out
st0(ahc0:6:0): BUS DEVICE RESET message queued.
st0(ahc0:6:0): Target Busy
last message repeated 23 times

and even worse

ahc0: target 0, lun 0 (sd0) timed out
sd0(ahc0:0:0): BUS DEVICE RESET message queued.

and (since this is my /, /usr, and swap disk ?), the machine will
never recover from this and eventually hangs or panics, anyway leaving
a damaged filesystem behind (fortunately the ufs code is so robust
that damages are never fatal). I decided that this accident is closely
related to swap activities, since it does not happen partuicularly
when there is high load on machine and disk, but, when there is almost
no activity, but some processes hanging around waiting and eventually
swaped out ... and ... BANG!, game over! :-(

O.K. since I no longer trust the time/tick/hz management and proper
adjustment of my kernel to high CPU speeds, I decided to just increase
the timeout values by the same factor of 10.

#define TOFACT 10 /* Time Out FACTor */

then search for any timeout() function call and increase the 3rd
parameter, which seems to be the timer value by that factor, e.g. like
here:

void
ahc_scb_timeout(unit, ahc, scb)
...

			scb->datalen[1] = 0;
			scb->datalen[2] = 0;
			outb(SCBCNT + iobase, 0x80);
			outsb(SCBARRAY+iobase,scb,SCB_DOWN_SIZE);
			outb(SCBCNT + iobase, 0);
			ahc_add_waiting_scb(iobase, scb, list_second);
			timeout(ahc_timeout, (caddr_t)scb, ( TOFACT * 2 * hz));
							    ^^^^^^^^^^^^^

#ifdef AHC_DEBUG
			if(ahc_debug & AHC_SHOWABORTS) {
				sc_print_addr(scb->xs->sc_link);
				printf("BUS DEVICE RESET message queued.\n");
					^^^^^^^^^^^^^^^^^^^^^^^
					This was the last message before
					the kernel suicided

Now, I thik (hope) that I finally cut out this problem. But let me
suggest you kernel gurus out there to think about the timing problem
with fast CPUs. I've already seen some messages about clock adjustment
when I booted my 2.2-*-SNAP kernel, but in any way, that didn't help,
and the FDC driver doesn't even use any timer value, not even the
wrong initialized hz. It just counts down!

Anyway, thank you for that great work on FreeBSD!

regards
Gunther Schadow

PS: the, sound blaster driver has a bug as well: it won't compile with
option JAZZ16, since JAZZ_DMA16 (used in i386/isa/sound/sb_dsp.c) is
never defined. I think that these lines belong into sound_config.h:

#ifdef JAZZ16
#define JAZZ_DMA16 1
#endif

at least, that worked for me.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?m0uQP4q-000dR6C>