Date: Sat, 29 Mar 1997 12:27:28 -0500 From: Rohit Dube <rohit@cs.umd.edu> To: scsi@freebsd.org Cc: rohit@cs.umd.edu Subject: Re: AHA2940 bug(s) still exist in 2.2.1 Message-ID: <199703291727.MAA07478@seine.cs.umd.edu>
next in thread | raw e-mail | index | archive | help
Hi, I had posted the following to hardware earlier. Am reposting this to scsi with some minor edits in the hope that it may help give the developers some additional clues. --> I am seeing some weird problems with a couple of machines running 2.2-970225-GAMMA Everynight when we run amanda's 'amdump', these machines crash. The crash can also be triggered by a 'dump' to /dev/null or a 'dd'. (Not entirely deterministic but all 3 crash the machines most of the time). We tried 2.1.5, 2.1.7, 2.2-961006-SNAP, 2.2.1 which exhibit the same behaviour. We have the following hardware on the machines which are crashing - (curtailed dmesg output showing only the PCI devices) Probing for devices on PCI bus 0: chip0 <Intel 82439> rev 3 on pci0:0 chip1 <Intel 82371SB PCI-ISA bridge> rev 1 on pci0:7:0 chip2 <Intel 82371SB IDE interface> rev 0 on pci0:7:1 vga0 <VGA-compatible display device> rev 0 int a irq 12 on pci0:9 de0 <Digital 21140A Fast Ethernet> rev 32 int a irq 10 on pci0:11 de0: SMC 9332BDT 21140A [10-100Mb/s] pass 2.0 de0: address 00:00:c0:03:6b:f9 ahc0 <Adaptec 2940 Ultra SCSI host adapter> rev 0 int a irq 11 on pci0:12 ahc0: aic7880 Single Channel, SCSI Id=7, 16 SCBs ahc0 waiting for scsi devices to settle (ahc0:0:0): "MICROP 4421-07 0329SJ 0329" type 0 fixed SCSI 2 sd0(ahc0:0:0): Direct-Access 2047MB (4193360 512 byte sectors) (ahc0:6:0): "SONY CD-ROM CDU-76S 1.2d" type 5 removable SCSI 2 cd0(ahc0:6:0): CD-ROM cd present [400000 x 2048 byte records] The console shows the following error messages (which are not logged as the disk is inacessible): sd0(ahc0:0:0): no longer in timeout ahc0: Issued Channel A Bus Reset: 2SCBs aborted Clearing bus reset Clearing 'in-reset' flag Sd0(ahc0:0:0): SCB 0x1 - timed out while idle LASTPHASE == 0x1, SCSIISGI = 0x0 SEQADDR == 0x12 The above message repeats with different values for SEQADDR. The first message which gets printed out says something like 'timed out in command phase'. I can't paraphrase it here as it happened in the middle of the night and scrolled off. After resetting following this occurance, the disk is not visible even to the Adaptec probe on boot-up. We must power-cycle. The block position where the error is triggered varies, by the way. Has somebody else seen a problem like this before? Or would otherwise know what is going on here? Any help greatly appreciated! Just can't afford to have these machines go down every night while doing a backup!! Thanks. --rohit. PS: I am attaching the output of 'scsi -f /dev/rsd0 -m1' and 'df' here, if that is of any use in tracking this problem. #scsi -f /dev/rsd0 -m1 AWRE (Auto Write Reallocation Enbld): 1 ARRE (Auto Read Reallocation Enbld): 1 TB (Transfer Block): 0 RC (Read Continuous): 0 EER (Enable Early Recovery): 0 PER (Post Error): 0 DTE (Disable Transfer on Error): 0 DCR (Disable Correction): 0 Read Retry Count: 14 Correction Span: 28 Head Offset Count: 0 Data Strobe Offset Count: 0 Write Retry Count: 15 Recovery Time Limit: 0 # df Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/sd0a 47183 13098 30311 30% / /dev/sd0s1f 1822738 504147 1172772 30% /usr /dev/sd0s1e 98479 1372 89229 2% /var procfs 4 4 0 100% /proc amd:96 0 0 0 100% /fs <--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199703291727.MAA07478>