From owner-aic7xxx@FreeBSD.ORG Thu Jul 8 17:00:00 2004 Return-Path: Delivered-To: aic7xxx@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C2DAD16A4CE for ; Thu, 8 Jul 2004 17:00:00 +0000 (GMT) Received: from deimos.datamarkets.com.ar (deimos.datamarkets.com.ar [200.42.0.100]) by mx1.FreeBSD.org (Postfix) with ESMTP id 23C7043D1D for ; Thu, 8 Jul 2004 16:59:59 +0000 (GMT) (envelope-from ppetriz@siscat.com.ar) Received: from zeus.sc.com (200-42-83-152.cab.prima.net.ar [200.42.83.152]) i68GxsAN079948; Thu, 8 Jul 2004 13:59:54 -0300 (ART) Received: by zeus.sc.com with Internet Mail Service (5.5.2653.19) id <3P4KHRBJ>; Thu, 8 Jul 2004 13:54:27 -0300 Message-ID: <1CEC5A75042ED51180E300A024E99257AD835C@zeus.sc.com> From: "Petriz, Pablo" To: "'Todd Denniston'" Date: Thu, 8 Jul 2004 13:54:23 -0300 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable cc: "'aic7xxx@freebsd.org'" Subject: RE: Many SCSI errors X-BeenThere: aic7xxx@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Adaptec Device Drivers in FreeBSD and Linux List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 08 Jul 2004 17:00:00 -0000 That 3rd disk was changed. Now it's running ok on another box. But that problem was solved with a entirely new Promise tower now with 1.1.0.30 firmware. Since your last mail i've been testing the tower in many ways. I've connected the tower to a and old PC running webpam to log the possible errors, but it doesn=B4t detect this transmission=20 error. On most hangs the tower seems to be working ok, we have to reboot the host, only one time we have to reboot both to reconnect. The tests: 1)We tried slowering the speed of the SCSI bus (from 320 to 20!) then run badblocks against the entire RAID5 of 6 disks but it fails and generates the same "Transmission error detected". 2) We configure the tower like JBOD and then run badblocks for every single HD (from 1 to 6). It runs ok on 1,2,3 and 4 HD then it generates the Transmission error with HD5. We have to reboot the host and the tower. Then run it again and its OK, over HD6 ok too (but suffering hangs with the same messages). So the conclusion is that the HD are ok, the problem is ... Transmission (cable, SCSIcards, terminators) or software: SCSI driver / firmware tower firmware. 3) We change SCSI cable. Test again using badblocks but same=20 error happens "transmision error detected" and out. 4) Lets try something more radical. You know my hardware is an INTEL SE7501BR2 (AIC-7901 on board) with RH Linux 9 (2.40.20) using aic79xx-2.0.10-rh90.i686.rpm for the SCSI. We disabled=20 the onboard SCSI and add an old Adaptec AHA-2049U. Kudzu detects it and load the aic7xxx module. We can see the tower so we test again with badblocks the 6 HD and everything works fine!!! (but its a little sloooooow). That card has a max 10MB/S rate. What is the conclusion now? May be it's the onboard SCSI, may be it's the aic7xxx that works different than the aic79xxx module, may be it will fail next minute,I=20 don't know, but i don't want this like "the solution". I've send two messages to Promise support but no response yet. I haven't many options to test. The next things we will try are: - Install all the firmware upgrades of the motherboard and test - Try to run the on board SCSI AIC-7901 with the aic7xxx module (is that somehow possible?) - Install an old card that use aic7xxx module but faster than=20 the 2049. (I don't like this, but...) - Finally: Install some windows OS on the host and test. (This is my laaaaast chance, and i hope i can find a Linux solution to this problem).=20 I'm thinking that the problem is not the tower or the firmware=20 itself, but the way the linux driver "talks" with the promise=20 firmware. This is only a conclusion from my tests, i'm not a driver programmer and i can't go and see the code... but on the other hand you are having weird symptoms like the one you tell me with the HD removal... i feel lost in the fog. Hope this helps you (and me). PABLO > -----Mensaje original----- > De: Todd Denniston [mailto:Todd.Denniston@ssa.crane.navy.mil] > Enviado el: jueves 8 de julio de 2004 11:57 > Para: Petriz, Pablo > Asunto: Re: Many SCSI errors >=20 >=20 > "Petriz, Pablo" wrote: > >=20 > > Hello Todd > >=20 > > > De: Todd Denniston > > > > > > You have 2 RM8000 and if i've understood ok, one works fine and > > the other doesn't. =BFWhat's the difference between the two? > >=20 > > This tower is driving me nuts. We bought it in december and it > > works fine for 1 or 2 weeks till we turn it off, then it began > > to rebuild the 3rd. disk. We change it, rebuild the new one, > > everething seems to be ok, but turn off, turn on and rebuild > > again.=20 > > Do you still have the disk (that was 3rd at the time)? > Is that disk still setting physically in the Promise array? >=20 > The reason I ask is, 12 days ago I removed from the array a=20 > drive which I know > to be bad [1]. I know it should not have made any difference=20 > though, because > the drive was only physically in the array, it was not locked=20 > in so there > should not have been power or communications to it. Since I=20 > have removed it, > I put the system in a configuration where before it would=20 > last ~16 hours max > before lock up, and yet it has been running for 12 days. The=20 > only change is > the physical removal of the bad drive! >=20 > It would both thrill me and make me mad to find out that a=20 > drive just setting > in the array with no power could cause these problems! >=20 >=20 > [1] at least from the perspective of the badblocks program. >=20 > --=20 > Todd Denniston > Crane Division, Naval Surface Warfare Center (NSWC Crane)=20 > Harnessing the Power of Technology for the Warfighter >=20