Date: Thu, 21 Aug 1997 12:16:18 -0500 From: Doug Ledford <dledford@dialnet.net> To: "Ulrich Windl" <ulrich.windl@rz.uni-regensburg.de> Cc: Leonard Zubkoff <lnz@dandelion.com>, aic7xxx@freebsd.org, linux-scsi@vger.rutgers.edu, Harald Koenig <koenig@tat.physik.uni-tuebingen.de>, Hubert Mantel <mantel@suse.de> Subject: Re: "read defect list" with 2.0.30-pre7 and patch Aug19 Message-ID: <199708211716.MAA16115@dledford.dialnet.net> In-Reply-To: Your message of "Thu, 21 Aug 1997 09:16:52 %2B0200." <5CA15F646FE@rkdvmks1.ngate.uni-regensburg.de>
next in thread | previous in thread | raw e-mail | index | archive | help
-------- > (My floppy with the Aug19-2 patch had a CRC error, so I had to use > the Aug19 patch) The differences between those two patches would not have affected this particular problem anyway, so that's fine. > Having enabled the statistics, I found out that I have statistics for > non-existing SCSI IDs and LUNs -- maybe the read was there, but not the > LUN ;-) The question is if you want to support SCSC plug and play, > what condition should you check? At least accesses in two categories? I think there is a missing check for MSG_WDTR and MSG_SDTR to avoid counting devices during the TEST UNIT READY messages. Easy enough to fix, I just haven't looked at the proc stuff in a while so I didn't notice that we were picking up non-existent devices. I'll fix that in my next patch set. > > Despite of that the information given should be much more compact; for > cat /proc/scsi/aic7xxx/0 I got a bunch of: > > ...possible overflow at loop 0:8 > 0:8 > 1:8 > 0:8 > 1:8 > 2:8 > 0:8 > 1:8 > 2:8 > 0:8 > 1:8 > 2:8 > Heinz Mauelshagen has sent me a modified aic7xxx_proc.c file that should fix these messages, I have instructed him to send it to Dan, so I would think this problem will be gone in the near future. > Resource allocation: SHouldn't the driver use a hardware-identifier instead > of a software-identifier when registering resources? Currently the driver > uses generic "aic7xxx", not the actual CHIP, and not the PCI bus & device. > With multiple cards the approach seems ambiguous (talking about /proc/ioports > and /proc/interrupts). It could be modified to do this, but this would fall farther down on a list of things to be done in my book. Most people know where each of their cards are in terms of interrupts and what not simply by the boot messages, regardless of the proc registration of the irq handler. > Unfortunately the kernel still bombs out badly, but I was able to get > at least some information onto a file on my IDE harddisk; I even had > symbolic information. I added another log to show how consistent the > fault is. > > Still, as expected earlier, there seems to be a undetected buffer > overflow in the kernel that overwrites some SCSI data structures (at > least). The code of the fault looked OK, but the RAM accesses had > probably bad values. > 22:04:11 scsi0 channel 0 : resetting for second half of retries. > 22:04:11 SCSI bus is being reset for host 0 channel 0. This part makes sense. We had two overflow errors (they didn't show in the log, but if you boot with aic7xxx=verbose, they would have then). We returned an error code to the mid level scsi code both times. The mid level scsi code decided we needed to reset the bus to try and fix the problem. > 22:04:11 EIP: 0010:[scsi_mark_host_reset+15/28] And then the mid level code died in here. Hmmmm..... > 22:04:11 Call Trace: [scsi_reset+399/776] [scsi_done+1162/1672] [aic7xxx_isr+1117/1424] [do_IRQ+45/80] [IRQ11_interrupt+95/144] [hard_idle+31/56] [sys_idle+59/112] > 22:04:11 [system_call+85/128] [init+0/656] [start_kernel+429/440] Call trace is clean. What it doesn't show is the following: scsi_reset \ -> aic7xxx_reset \ -> aic7xxx_reset_channel | -> aic7xxx_reset_current_bus | -> aic7xxx_reset_device \ -> aic7xxx_search_qinfifo \ -> aic7xxx_run_done_queue | -> aic7xxx_done ? \ -> aic7xxx_done_cmds_complete \ -> scsi_done ? The two calls with ? would only get called if you had other outstanding commands on the bus besides the one we are resetting over. So, given those extra calls on the stack, I guess there is an *outside* possibility of a stack overflow, but to be quite honest, I doubt it. Most of these functions have rather small stack usage any more. > > I suspect it's not the aic7xxx, I suspect someone else shot some memory with > an undetected overflow... I agree. I think the next place to look is into the buffers passed by the mid level scsi code down to the aic7xxx driver for picking up that defect list. > And I did not configure SCSI generic support -- Should I? I don't think it would have mattered unless your program is capable of reading the defect list from a generic device (instead of from the disk device) using the big buffers that Leonard Zubkoff mentioned. Actually, the whole problem with this defect list on your drive is strikingly similar to the problem with the aic7xxx_proc.c file. With disk devices, we expect to know how much data we are transfering. Any time we read a sector, we do know. If it comes up either short or long, it's an error. When asking for a defect list though, the situation is not nearly so clear. Here, we are asking for a list, and passing a buffer that we hope is big enough. In the case of your drive, it isn't. So, we get an overflow. The sequencer automatically turns off data transfers to memory when this overflow occurs so that isn't effecting anything unless the buffer length the mid level code gave us was wrong, but I doubt that. That's why reading a defect list can cause so many problems as this does. Maybe that should fall into a "don't do that" category of operations :) What really strikes me as wierd is where the program died. It died during the orb instruction to set the bitfields in scsi_mark_device_reset(). 0x19a428 <scsi_mark_host_reset>: movl 0x4(%esp,1),%eax Grab Scsi_Host *Host 0x19a42c <scsi_mark_host_reset+4>: movl 0x10(%eax),%edx SCptr = Host->host_queue; 0x19a42f <scsi_mark_host_reset+7>: testl %edx,%edx if (SCptr) 0x19a431 <scsi_mark_host_reset+9>: je 0x19a442 <scsi_mark_host_reset+26> 0x19a433 <scsi_mark_host_reset+11>: nop 0x19a434 <scsi_mark_host_reset+12>: movl 0x4(%edx),%eax get * to SCptr->device 0x19a437 <scsi_mark_host_reset+15>: orb $0xc0,0x4b(%eax) set device->was_reset = 1 set device->expecting_cc_ua = 1 This is where we choked. Why, because %eax, the address for the base of the Scsi_Device structure was pointing outside of the range of kernel bounds. Had it been pointing inside the kernel area, we would have been tromping on unknown memory. Whatever caused this is a potential memory scribble bug and needs to be found. The question is, why did the Scsi_Cmnd *SCptr have an invalid address for its ->device pointer? Hmmm, I'm going to Cc: this to Leonard in case he might have an idea. As far as the aic7xxx_reset code is concerned, it barely touched the Scsi_Cmnd * passed to it during the reset (and I have had the exact same reset sequence on my machine here at the house and it didn't fail). I suspect that somewhere along the line in the failed attempts to read the defect list, some Scsi_Cmnd structure got corrupted and it just showed up when we did the mark_bus_reset call. Ulrich, could you send me a copy of the program you are using to read the defects list, I want to see if I can duplicate any of this here at the farm. -- ***************************************************************************** * Doug Ledford * Unix, Novell, Dos, Windows 3.x, * * dledford@dialnet.net 873-DIAL * WfW, Windows 95 & NT Technician * * PPP access $14.95/month ***************************************** * Springfield, MO and surrounding * Usenet news, e-mail and shell account.* * communities. Sign-up online at * Web page creation and hosting, other * * 873-9000 V.34 * services available, call for info. * *****************************************************************************
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199708211716.MAA16115>