From owner-freebsd-scsi@FreeBSD.ORG Sun Apr 27 06:05:46 2003 Return-Path: Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3D4F937B401 for ; Sun, 27 Apr 2003 06:05:46 -0700 (PDT) Received: from hub.org (hub.org [64.117.224.146]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5DEB543FA3 for ; Sun, 27 Apr 2003 06:05:45 -0700 (PDT) (envelope-from scrappy@hub.org) Received: from hub.org (hub.org [64.117.224.146]) by hub.org (Postfix) with ESMTP id AD26B1038914; Sun, 27 Apr 2003 10:05:42 -0300 (ADT) Date: Sun, 27 Apr 2003 10:05:42 -0300 (ADT) From: "Marc G. Fournier" To: freebsd-scsi@freebsd.org Message-ID: <20030427093916.N8333@hub.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: Scott Long Subject: changed cable, server still hangs after ~24hrs ... X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Apr 2003 13:05:46 -0000 'K, after the last hang, I got the techs to replace the SCSI cable in the box, which made no difference ... I've removed the KVA_PAGES args from the kernel, so that there is nothing 'weird' configured into it, and now aaccli for the 5400 works (I haven't been able to get my hands on one for the 2120s yet), and am not sure what sort of info I should be looking at/for (or even what is particularly safe to run) ... but does any of the above provide *anything*? Note that this enclosure is one the Intel SR2200(s), and I'm still getting the occasional 'Time-out', which to me indicates a problem, but according to the controller: AAC0> disk show smart Executing: disk show smart Smart Method of Enable Capable Informational Exception Performance Error C:ID:L Device Exceptions(MRIE) Control Enabled Count ------ ------- ---------------- --------- ----------- ------ 0:00:0 Y 6 Y N 0 0:01:0 Y 6 Y N 0 0:02:0 Y 6 Y N 0 0:03:0 Y 6 Y N 0 0:04:0 Y 6 Y N 0 0:05:0 Y 6 Y N 0 I would have expected Error Count to have increased by at least 1 if there was a problem at the hardware level ... no? The system itself is a Dual-PIII, 4G of RAM ... Intel MOBO & Chassis, so the only SCSI cable I'm dealing with is from the MOBO to the backplane itself ... The hangs are similar to the original ones, where I'd get TIMEOUT scrolling up the screen, but since Scott's last "fix" for the 2G allocation issue, I no longer get the actual error messages ... On each hang, I've asked the techs to do a 'ctl-alt-esc', but, again, like before, this doesn't work :( Help? Anything else I can get the techs to try to eliminate 'hardware' as the cause? :( neptune# grep aac /var/log/messages Apr 27 07:42:02 neptune /kernel: aac0: **Monitor** ID(0:05:0) Abort Time-out. Resetting bus. Apr 27 07:42:05 neptune /kernel: aac0: **Monitor** SCSI bus reset issued on channel 0 Apr 27 09:29:19 neptune /kernel: aac0: mem 0xf8000000-0xfbffffff irq 2 at device 9.0 on pci1 Apr 27 09:29:19 neptune /kernel: aac0: i960RX 100MHz, 48MB cache memory, optional battery present Apr 27 09:29:19 neptune /kernel: aac0: Kernel 4.0-0, Build 5770, S/N 232fb7 Apr 27 09:29:19 neptune /kernel: aac0: Supported Options=1f7e Apr 27 09:29:20 neptune /kernel: aacd0: on aac0 Apr 27 09:29:20 neptune /kernel: aacd0: 174993MB (358387200 sectors) Apr 27 09:29:20 neptune /kernel: Mounting root from ufs:/dev/aacd0s1a neptune# zgrep aac /var/log/messages.0.gz neptune# zgrep aac /var/log/messages.1.gz neptune# zgrep aac /var/log/messages.2.gz Apr 24 14:56:45 neptune /kernel: aac0: **Monitor** ID(0:05:0) Abort Time-out. Resetting bus. Apr 24 14:56:48 neptune /kernel: aac0: **Monitor** SCSI bus reset issued on channel 0 neptune# zgrep aac /var/log/messages.3.gz Apr 23 02:20:20 neptune /kernel: swap_pager: indefinite wait buffer: device: #aacd/0x20001, blkno: 116328, size: 4096 Apr 23 02:20:29 neptune /kernel: swap_pager: indefinite wait buffer: device: #aacd/0x20001, blkno: 104256, size: 4096 Apr 23 02:20:30 neptune /kernel: swap_pager: indefinite wait buffer: device: #aacd/0x20001, blkno: 111896, size: 4096 Apr 23 02:20:30 neptune /kernel: swap_pager: indefinite wait buffer: device: #aacd/0x20001, blkno: 116304, size: 4096 Apr 23 02:20:30 neptune /kernel: swap_pager: indefinite wait buffer: device: #aacd/0x20001, blkno: 112576, size: 4096 Apr 23 02:20:30 neptune /kernel: swap_pager: indefinite wait buffer: device: #aacd/0x20001, blkno: 116952, size: 4096 Apr 23 02:20:30 neptune /kernel: swap_pager: indefinite wait buffer: device: #aacd/0x20001, blkno: 113144, size: 4096 Apr 23 02:20:30 neptune /kernel: swap_pager: indefinite wait buffer: device: #aacd/0x20001, blkno: 87424, size: 4096 Apr 23 02:20:30 neptune /kernel: swap_pager: indefinite wait buffer: device: #aacd/0x20001, blkno: 116312, size: 4096 Apr 23 02:20:31 neptune /kernel: swap_pager: indefinite wait buffer: device: #aacd/0x20001, blkno: 117016, size: 4096 Apr 23 02:20:31 neptune /kernel: swap_pager: indefinite wait buffer: device: #aacd/0x20001, blkno: 116408, size: 4096 Apr 23 02:20:31 neptune /kernel: swap_pager: indefinite wait buffer: device: #aacd/0x20001, blkno: 43984, size: 4096 Apr 23 02:20:31 neptune /kernel: swap_pager: indefinite wait buffer: device: #aacd/0x20001, blkno: 116296, size: 4096 Apr 23 02:20:31 neptune /kernel: swap_pager: indefinite wait buffer: device: #aacd/0x20001, blkno: 111224, size: 4096 Apr 23 02:20:31 neptune /kernel: swap_pager: indefinite wait buffer: device: #aacd/0x20001, blkno: 112440, size: 8192 Apr 23 02:20:31 neptune /kernel: swap_pager: indefinite wait buffer: device: #aacd/0x20001, blkno: 104840, size: 4096 Apr 23 02:20:31 neptune /kernel: swap_pager: indefinite wait buffer: device: #aacd/0x20001, blkno: 111856, size: 4096 Apr 23 02:20:31 neptune /kernel: swap_pager: indefinite wait buffer: device: #aacd/0x20001, blkno: 15208, size: 4096 Apr 23 02:20:31 neptune /kernel: aac0: **Monitor** ID(0:01:0) Abort Time-out. Resetting bus. Apr 23 02:20:31 neptune /kernel: aac0: **Monitor** SCSI bus reset issued on channel 0 Apr 23 10:39:30 neptune /kernel: aac0: mem 0xf8000000-0xfbffffff irq 2 at device 9.0 on pci1 Apr 23 10:39:30 neptune /kernel: aac0: i960RX 100MHz, 48MB cache memory, optional battery present Apr 23 10:39:30 neptune /kernel: aac0: Kernel 4.0-0, Build 5770, S/N 232fb7 Apr 23 10:39:30 neptune /kernel: aac0: Supported Options=1f7e Apr 23 10:39:30 neptune /kernel: aacd0: on aac0 Apr 23 10:39:30 neptune /kernel: aacd0: 174993MB (358387200 sectors) Apr 23 10:39:30 neptune /kernel: Mounting root from ufs:/dev/aacd0s1a Apr 23 23:32:39 neptune /kernel: aac0: **Monitor** ID(0:01:0) Abort Time-out. Resetting bus. Apr 23 23:32:42 neptune /kernel: aac0: **Monitor** SCSI bus reset issued on channel 0 AAC0> controller details Executing: controller details Controller Information ---------------------- Remote Computer: S Device Name: S Controller Type: No Info Access Mode: READ-WRITE Controller Serial Number: Last Six Digits = 232FB7 Number of Buses: 1 Devices per Bus: 15 Controller CPU: i960 R series Controller CPU Speed: 100 Mhz Controller Memory: 64 Mbytes Battery State: Not Present Component Revisions ------------------- CLI: 1.0-0 (Build #5263) API: 1.0-0 (Build #5263) Miniport Driver: 4.0-0 (Build #5770) Controller Software: 4.0-0 (Build #5770) Controller BIOS: 4.0-0 (Build #5770) Controller Firmware: (Build #5770) Controller Hardware: 2.64 Scsi Partition Container MultiLevel C:ID:L Offset:Size Num Type Num Type R/W ------ ------------- --- ------ --- ------ --- 0:00:0 64.0KB:34.1GB 0 RAID-5 0 None RW 0:01:0 64.0KB:34.1GB 0 RAID-5 0 None RW 0:02:0 64.0KB:34.1GB 0 RAID-5 0 None RW 0:03:0 64.0KB:34.1GB 0 RAID-5 0 None RW 0:04:0 64.0KB:34.1GB 0 RAID-5 0 None RW 0:05:0 64.0KB:34.1GB 0 RAID-5 0 None RW Smart Method of Enable Capable Informational Exception Performance Error C:ID:L Device Exceptions(MRIE) Control Enabled Count ------ ------- ---------------- --------- ----------- ------ 0:00:0 Y 6 Y N 0 0:01:0 Y 6 Y N 0 0:02:0 Y 6 Y N 0 0:03:0 Y 6 Y N 0 0:04:0 Y 6 Y N 0 0:05:0 Y 6 Y N 0 0:06:0 N 0:06:1 N 0:06:2 N 0:06:3 N 0:06:4 N 0:06:5 N 0:06:6 N 0:06:7 N C:ID:L Device Type Blocks Bytes/Block Usage Shared Rate ------ -------------- --------- ----------- ---------------- ------ ---- 0:00:0 Disk 71687372 512 Initialized NO 320 0:01:0 Disk 71687372 512 Initialized NO 320 0:02:0 Disk 71687372 512 Initialized NO 320 0:03:0 Disk 71687372 512 Initialized NO 320 0:04:0 Disk 71687372 512 Initialized NO 320 0:05:0 Disk 71687372 512 Initialized NO 320 Num Total Oth Stripe Scsi Partition Label Type Size Ctr Size Usage C:ID:L Offset:Size ----- ------ ------ --- ------ ------- ------ ------------- 0 RAID-5 170GB 64KB Open 0:00:0 64.0KB:34.1GB /dev/aacd0 FreeBSD 0:01:0 64.0KB:34.1GB 0:02:0 64.0KB:34.1GB 0:03:0 64.0KB:34.1GB 0:04:0 64.0KB:34.1GB 0:05:0 64.0KB:34.1GB Enclosure ID (C:ID:L) Fan Power Slot Sensor Door Speaker Standard Diagnostic ----------- --- ----- ---- ------ ---- -------- -------- ---------- 0 0:06:0 0 2 7 1 0 No SAF-TE PASSED 1 0:06:1 0 0 0 0 0 No SAF-TE FAILED 2 0:06:2 0 0 0 0 0 No SAF-TE FAILED 3 0:06:3 0 0 0 0 0 No SAF-TE FAILED 4 0:06:4 0 0 0 0 0 No SAF-TE FAILED 5 0:06:5 0 0 0 0 0 No SAF-TE FAILED 6 0:06:6 0 0 0 0 0 No SAF-TE FAILED 7 0:06:7 0 0 0 0 0 No SAF-TE FAILED AAC0> enclosure show temperature Executing: enclosure show temperature Enclosure ID (C:ID:L) Sensor Temperature Threshold Status ----------- ------ ----------- --------- -------- 0 0:06:0 0 87 F 120 NORMAL Is there any other information that I can pull?