Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 21 Aug 1997 12:16:18 -0500
From:      Doug Ledford <dledford@dialnet.net>
To:        "Ulrich Windl" <ulrich.windl@rz.uni-regensburg.de>
Cc:        Leonard Zubkoff <lnz@dandelion.com>, aic7xxx@freebsd.org, linux-scsi@vger.rutgers.edu, Harald Koenig <koenig@tat.physik.uni-tuebingen.de>, Hubert Mantel <mantel@suse.de>
Subject:   Re: "read defect list" with 2.0.30-pre7 and patch Aug19 
Message-ID:  <199708211716.MAA16115@dledford.dialnet.net>
In-Reply-To: Your message of "Thu, 21 Aug 1997 09:16:52 %2B0200." <5CA15F646FE@rkdvmks1.ngate.uni-regensburg.de> 

next in thread | previous in thread | raw e-mail | index | archive | help
--------
> (My floppy with the Aug19-2 patch had a CRC error, so I had to use 
> the Aug19 patch)

The differences between those two patches would not have affected this 
particular problem anyway, so that's fine.


> Having enabled the statistics, I found out that I have statistics for
> non-existing SCSI IDs and LUNs -- maybe the read was there, but not the
> LUN ;-) The question is if you want to support SCSC plug and play,
> what condition should you check? At least accesses in two categories?

I think there is a missing check for MSG_WDTR and MSG_SDTR to avoid counting 
devices during the TEST UNIT READY messages.  Easy enough to fix, I just 
haven't looked at the proc stuff in a while so I didn't notice that we were 
picking up non-existent devices.  I'll fix that in my next patch set.

> 
> Despite of that the information given should be much more compact; for
> cat /proc/scsi/aic7xxx/0 I got a bunch of:
> 
> ...possible overflow at loop 0:8
>                              0:8
>                              1:8
>                              0:8
>                              1:8
>                              2:8
>                              0:8
>                              1:8
>                              2:8
>                              0:8
>                              1:8
>                              2:8
> 

Heinz Mauelshagen has sent me a modified aic7xxx_proc.c file that should fix 
these messages, I have instructed him to send it to Dan, so I would think 
this problem will be gone in the near future.

> Resource allocation: SHouldn't the driver use a hardware-identifier instead
> of a software-identifier when registering resources? Currently the driver
> uses generic "aic7xxx", not the actual CHIP, and not the PCI bus & device.
> With multiple cards the approach seems ambiguous (talking about /proc/ioports
> and /proc/interrupts).

It could be modified to do this, but this would fall farther down on a list 
of things to be done in my book.  Most people know where each of their cards 
are in terms of interrupts and what not simply by the boot messages, 
regardless of the proc registration of the irq handler.

> Unfortunately the kernel still bombs out badly, but I was able to get 
> at least some information onto a file on my IDE harddisk; I even had 
> symbolic information. I added another log to show how consistent the
> fault is.
> 
> Still, as expected earlier, there seems to be a undetected buffer 
> overflow in the kernel that overwrites some SCSI data structures (at 
> least). The code of the fault looked OK, but the RAM accesses had 
> probably bad values.

> 22:04:11 scsi0 channel 0 : resetting for second half of retries.
> 22:04:11 SCSI bus is being reset for host 0 channel 0.

This part makes sense.  We had two overflow errors (they didn't show in the 
log, but if you boot with aic7xxx=verbose, they would have then).  We 
returned an error code to the mid level scsi code both times.  The mid level 
scsi code decided we needed to reset the bus to try and fix the problem.

> 22:04:11 EIP:    0010:[scsi_mark_host_reset+15/28]

And then the mid level code died in here.  Hmmmm.....

> 22:04:11 Call Trace: [scsi_reset+399/776] [scsi_done+1162/1672] [aic7xxx_isr+1117/1424] [do_IRQ+45/80] [IRQ11_interrupt+95/144] [hard_idle+31/56] [sys_idle+59/112] 
> 22:04:11        [system_call+85/128] [init+0/656] [start_kernel+429/440] 

Call trace is clean.  What it doesn't show is the following:
scsi_reset
    \ -> aic7xxx_reset
              \ -> aic7xxx_reset_channel
                            | -> aic7xxx_reset_current_bus
                            | -> aic7xxx_reset_device
					\ -> aic7xxx_search_qinfifo
                            \ -> aic7xxx_run_done_queue
					| -> aic7xxx_done ?
					\ -> aic7xxx_done_cmds_complete
						\ -> scsi_done ?

The two calls with ? would only get called if you had other outstanding 
commands on the bus besides the one we are resetting over.

So, given those extra calls on the stack, I guess there is an *outside* 
possibility of a stack overflow, but to be quite honest, I doubt it.  Most 
of these functions have rather small stack usage any more.

> 
> I suspect it's not the aic7xxx, I suspect someone else shot some memory with
> an undetected overflow...

I agree.  I think the next place to look is into the buffers passed by the 
mid level scsi code down to the aic7xxx driver for picking up that defect 
list.

> And I did not configure SCSI generic support -- Should I?

I don't think it would have mattered unless your program is capable of 
reading the defect list from a generic device (instead of from the disk 
device) using the big buffers that Leonard Zubkoff mentioned.

Actually, the whole problem with this defect list on your drive is 
strikingly similar to the problem with the aic7xxx_proc.c file.  With disk 
devices, we expect to know how much data we are transfering.  Any time we 
read a sector, we do know.  If it comes up either short or long, it's an 
error.  When asking for a defect list though, the situation is not nearly so 
clear.  Here, we are asking for a list, and passing a buffer that we hope is 
big enough.  In the case of your drive, it isn't.  So, we get an overflow.  
The sequencer automatically turns off data transfers to memory when this 
overflow occurs so that isn't effecting anything unless the buffer length 
the mid level code gave us was wrong, but I doubt that.  That's why reading 
a defect list can cause so many problems as this does.  Maybe that should 
fall into a "don't do that" category of operations :)

What really strikes me as wierd is where the program died.  It died during 
the orb instruction to set the bitfields in scsi_mark_device_reset().

0x19a428 <scsi_mark_host_reset>:        movl   0x4(%esp,1),%eax
					Grab Scsi_Host *Host
0x19a42c <scsi_mark_host_reset+4>:      movl   0x10(%eax),%edx
					SCptr = Host->host_queue;
0x19a42f <scsi_mark_host_reset+7>:      testl  %edx,%edx
					if (SCptr)
0x19a431 <scsi_mark_host_reset+9>:
    je     0x19a442 <scsi_mark_host_reset+26>
0x19a433 <scsi_mark_host_reset+11>:     nop
0x19a434 <scsi_mark_host_reset+12>:     movl   0x4(%edx),%eax
					get * to SCptr->device
0x19a437 <scsi_mark_host_reset+15>:     orb    $0xc0,0x4b(%eax)
					set device->was_reset = 1
					set device->expecting_cc_ua = 1

This is where we choked.  Why, because %eax, the address for the base of the 
Scsi_Device structure was pointing outside of the range of kernel bounds.  
Had it been pointing inside the kernel area, we would have been tromping on 
unknown memory.  Whatever caused this is a potential memory scribble bug and 
needs to be found.  The question is, why did the Scsi_Cmnd *SCptr have an 
invalid address for its ->device pointer?  Hmmm, I'm going to Cc: this to 
Leonard in case he might have an idea.  As far as the aic7xxx_reset code is 
concerned, it barely touched the Scsi_Cmnd * passed to it during the reset 
(and I have had the exact same reset sequence on my machine here at the 
house and it didn't fail).  I suspect that somewhere along the line in the 
failed attempts to read the defect list, some Scsi_Cmnd structure got 
corrupted and it just showed up when we did the mark_bus_reset call.  
Ulrich, could you send me a copy of the program you are using to read the 
defects list, I want to see if I can duplicate any of this here at the farm.



-- 
*****************************************************************************
* Doug Ledford                      *   Unix, Novell, Dos, Windows 3.x,     *
* dledford@dialnet.net    873-DIAL  *     WfW, Windows 95 & NT Technician   *
*   PPP access $14.95/month         *****************************************
*   Springfield, MO and surrounding * Usenet news, e-mail and shell account.*
*   communities.  Sign-up online at * Web page creation and hosting, other  *
*   873-9000 V.34                   * services available, call for info.    *
*****************************************************************************





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199708211716.MAA16115>