From owner-aic7xxx Thu Feb 12 22:32:52 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id WAA01945 for aic7xxx-outgoing; Thu, 12 Feb 1998 22:32:52 -0800 (PST) (envelope-from owner-aic7xxx@FreeBSD.ORG) Received: from dledford.dialnet.net (root@dledford.dialnet.net [206.65.249.116]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id WAA01932 for ; Thu, 12 Feb 1998 22:32:47 -0800 (PST) (envelope-from dledford@dialnet.net) Received: from dialnet.net (localhost [127.0.0.1]) by dledford.dialnet.net (8.8.5/8.8.4) with ESMTP id AAA22624; Fri, 13 Feb 1998 00:32:27 -0600 Message-ID: <34E3E8FB.EA2CF286@dialnet.net> Date: Fri, 13 Feb 1998 00:32:27 -0600 From: Doug Ledford X-Mailer: Mozilla 4.04 [en] (X11; I; Linux 2.0.33 i686) MIME-Version: 1.0 To: mikebw@bilow.bilow.uu.ids.net CC: aic7xxx Mailing List Subject: Re: HELP References: <4e33ebd2@bilow.bilow.uu.ids.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-aic7xxx@FreeBSD.ORG Precedence: bulk Mike Bilow wrote: > > Doug Ledford wrote in a message to Mike Bilow: > > > (scsi0:0:3:0) Synchronous at 6.67MHz, offset 15. > > Vendor: ARCHIVE Model: Python 28388-XXX Rev: 5.45 > > Type: Sequential-Access ANSI SCSI revision: 02 > > Detected scsi tape st0 at scsi0, channel 0, id 3, lun 0 > > scsi : detected 1 SCSI tape 2 SCSI disks total. > * * * > > DL> This all looks right and good. > > No, it certainly doesn't. There are no Archive Python models which should > negotiate 6.67 MHz. There are some old models which should negotiate 5.0 MHz, > but nearly all (including this one) should negotiate 10.0 MHz. The only > exceptions would be OEM drives, but these would not identify as "ARCHIVE." Whether or not is should negotiate at 5 or 10 or whatever MHz isn't what concerns me (I don't have one of these drives to know what it is suppossed to negotiate at). However, there were no errors generated and it negotiated properly (assuming that either the tape drive or the controller limited the rate to 6.67, if the tape drive shouldn't negotiate at this rate, then the SCSI BIOS device settings need checked). > > (scsi0:0:0:0) Data overrun detected in Data-In phase, tag 14; > > Have seen Data Phase. Length=28672, NumSGs=5. > > sg[0] - Addr 0xb32000 : Length 4096 > > sg[1] - Addr 0xb37000 : Length 12288 > > sg[2] - Addr 0xb3b000 : Length 4096 > > sg[3] - Addr 0xb3e000 : Length 4096 > > sg[4] - Addr 0xb40000 : Length 4096 > > DL> I see these occasionally from certain drives, usually under > DL> heavy load. Normally, they are nothign to worry about. > > I would not regard this as normal behavior. I didn't say they were normal behavior, and I would like to track them down, but they normally aren't anything to worry about as we will immediately retry the command after this error and the sequencer won't let us corrupt kernel memory during the overrun. I have to assume that an overrun during a data out phase also isn't corrupting the hard drive, or else it would be the drives firmware at fault (we can't force a drive to take more than it wants, if it thinks that it has completed the transfer and we don't, then you get an underrun instead of an overrun, you should only get an overrun when the drive thinks it isn't done while we think it is, in those cases bogus data may get written to a portion of the hard drive, and then because of the retry, it immediately gets re-written with the correct data). > > scsi : aborting command due to timeout : pid 313876, scsi0, channel 0, > > id 0, lun 0 Read (6) 0d f0 70 5a 00 > > scsi : aborting command due to timeout : pid 313874, scsi0, channel 0, > > id 0, lun 0 Read (6) 0d f0 18 38 00 > > scsi : aborting command due to timeout : pid 313878, scsi0, channel 0, > > id 0, lun 0 Read (6) 16 84 e6 02 00 > > DL> This usually indicates a drive that is either wedged itself > DL> or has wedged the bus. > > Correct! The drive has wedged itself, because its internal fail-safe has cut > in. As I said, this is usually the result of a tape jam, but there are other > cases in which it can happen. If the tape drive wedged itself, it wouldn't matter. The above error messages indicate the hard disk. There is no indication of a full bus reset ever being issued, so I can only assume that the condition was corrected via the use of the abort call. In that case, the worst that would happen would be a bus device reset of the hard disk and nothing to the tape drive. > > (scsi0:0:0:0) No active SCB for reconnecting target - Issuing BUS > > DEVICE RESET. > > DL> I've seen one other of these since the 5.0.5 release, and > DL> this indicates an error in the driver somewhere. I'm > DL> currently hunting for it. However, the hunt would be much > DL> easier if I had more information :) For instance, it looks > DL> like the system was booted without the aic7xxx=verbose > DL> option because I'm not seeing any calls to aic7xxx_abort due > DL> to the above lines but those calls should exist. > > The SCSI bus behavior of the tape drive when its fail-safe sensors trip is > considered to be undefined. This is a mechanical safety issue, since you may > have motors whose power is abruptly ripped away. The drive should not hang the > SCSI bus, at least not if disconnection is enabled, but it could happen. That error message isn't about the tape drive, it's about the hard disk. The condition of the tape drive is irrelevant in this situation since the particular error is a reconnection from the hard drive, which couldn't happen if the bus was wedged, and then a failure to find the command associated with that reconnection. > > > (scsi0:0:0:0) SAVED_TCL=0x0, ARG_1=0xe, SEQADDR=0x100 > > (scsi0:0:0:0) Synchronous at 10.0MHz, offset 15. > > st0: Error with sense data: Current error st09:00: sense key Medium > > Error > > Additional sense indicates Sequential positioning error > > DL> Either a bad tape, or the equivelant of an "mt -f /dev/st0 > DL> fsf x" command where x is too high. In other words, the > DL> software has tried to space forward past x filemarks and > DL> there weren't that many file marks on the tape. > > Whoops! The drive is now negotiating 10.0 MHz as it originally should have > done. Why now? I think I know the answer, but I'll save it for later... No, the tape drive isn't. Read the line headers, that's the hard disk renegotiating. The tape drive is SCSI ID 4, the Conner disk is ID 0. The fact that there is a negotiation message for the disk and not the tape only confirms what I was suspecting earlier, aka there was no bus reset and the disk drive is renegotiating as a result of the BUS_DEVICE_RESET it received in the message before this one. > > > I am willing to change hardware if that might be a more elegant > > solution, but I am not clear, from the info that I see, as to which > > device is really causing the problems. > > Thank you for your patience in reading through all of this stuff, and > > thank you in advance for your help and suggestions. > > DL> Well, the only hardware I would try changing at the moment > DL> is your tape drive. Beyond that, just try to replicate > DL> these errors after rebuilding your kernel with verbose SCSI > DL> reporting (if it isn't already enabled) and after rebooting > DL> with aic7xxx=verbose (or aic7xxx=verbose:0xffff for even > DL> more info) and then send me the full logs of what happened. > DL> From there, I should be able to figure it out. > > And the murderer is... > > THE POWER SUPPLY! > > Based on considerable experience with Archive Python drives, I would lay odds > that your power supply is slacking off on its +12 VDC. All of your problems > could be explained as motor motion faults, which are most likely to occur at > startup. The wrong speed being negotiated on startup and the right speed being > negotiated later is strongly suggestive of power supply noise getting the > microcontroller in the tape drive confused. Almost nothing in a modern machine > will use +12 VDC, ^^^^^^^ Try every Seagate hard drive in existence uses 12V power. So do HP drives, and Quantum hard drives. To my knowledge, I have yet to find a single SCSI hard drive that doesn't use +12V for the spindle motor. > and you could get a motherboard, a 3.5-inch hard drive, and a > 3.5-inch floppy drive to run even with a power supply that has no +12 VDC > output at all. > > The tape drive is messing up at startup because you have the power-on self-test > feature enabled, which is the proper default, and it goes through all sorts of > machninations which make motor spikes while the bus parameters are being worked > out. All of your other failures during the course of operation are most likely > when the positioning motors start and stop, drawing very high current. Low > supply voltage leads to excessive current draw, and sucking excessive current > on the motors is exactly how the drive detects tape jams for fail-safe > purposes. In all probability, you are also seeing a flashing red light on the > front of the tape drive, but you may not have noticed this. If the Archive drives are more sensitive to the +12V level than most hard drives, then this indeed could be the case. Most hard drives will spin down their spindles at around 11.1 to 11.2 volts. The rated tolerance for voltage on those drives is typically between 11.4 - 12.6V. So, they have a little better tolerance than the rated +-5%. If the archive is just as tolerant of voltage specs as the hard drives, then I would seriously doubt this as a possibility. I still would recommend the first course of action being to try a new tape (I said tape drive in my original email, which was an ooops, I meant to just try a new tape in the same tape drive, possibly even one that you have taken a bulk eraser to recently just to make sure it's clean). -- Doug Ledford Opinions expressed are my own, but they should be everybody's. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe aic7xxx" in the body of the message