From owner-freebsd-stable  Fri Feb 21 12:53:01 1997
Return-Path: <owner-stable>
Received: (from root@localhost)
          by freefall.freebsd.org (8.8.5/8.8.5) id MAA22026
          for stable-outgoing; Fri, 21 Feb 1997 12:53:01 -0800 (PST)
Received: from thompson.ebay.com (thompson.ebay.com [206.184.213.42])
          by freefall.freebsd.org (8.8.5/8.8.5) with ESMTP id MAA22018
          for <freebsd-stable@freebsd.org>; Fri, 21 Feb 1997 12:52:59 -0800 (PST)
Received: from pete.ebay.com ([205.215.226.130]) by thompson.ebay.com (8.8.5/8.6.12) with SMTP id MAA16125 for <freebsd-stable@freebsd.org>; Fri, 21 Feb 1997 12:52:54 -0800 (PST)
Received: by pete.ebay.com with Microsoft Mail
	id <01BC1FF5.EB538860@pete.ebay.com>; Fri, 21 Feb 1997 12:51:13 -0800
Message-ID: <01BC1FF5.EB538860@pete.ebay.com>
From: pete helme <pete@ebay.com>
To: "'freebsd-stable@freebsd.org'" <freebsd-stable@freebsd.org>
Subject: SCSI idle hangs and reboots with 2.1.7
Date: Fri, 21 Feb 1997 12:51:11 -0800
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by freefall.freebsd.org id MAA22020
Sender: owner-stable@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

We're running a heavily loaded web server which was running fine on 2.1.5, when we upgraded to 2.1.6 and then 2.1.7, things went sour and we're now getting frequent hangs and reboots.

The only evidence we have of what's happening is the following, which would be scrolling by on the console when the machine could no longer be accessed on the net:

Feb 18 10:18:07 calculus /kernel: sd0(ahc0:0:0): timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0x0
Feb 18 10:35:28 calculus /kernel: SEQADDR == 0xd
Feb 18 10:35:28 calculus /kernel: Clearing bus reset
Feb 18 10:35:28 calculus /kernel: Clearing 'in-reset' flag
Feb 18 10:35:28 calculus /kernel: sd1(ahc0:1:0): timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0x0
Feb 18 10:35:28 calculus /kernel: SEQADDR == 0x10
Feb 18 10:35:28 calculus /kernel: sd1(ahc0:1:0): timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0x0
Feb 18 10:35:28 calculus /kernel: SEQADDR == 0xc
Feb 18 10:35:28 calculus /kernel: ahc0: Issued Channel A Bus Reset. 2 SCBs aborted
Feb 18 10:35:28 calculus /kernel: Clearing bus reset
Feb 18 10:35:28 calculus /kernel: Clearing 'in-reset' flag
Feb 18 10:35:29 calculus /kernel: sd1(ahc0:1:0): timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0x0
Feb 18 10:38:48 calculus /kernel: SEQADDR == 0xd
Feb 18 10:38:49 calculus /kernel: sd1(ahc0:1:0): timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0x0
Feb 18 10:38:49 calculus /kernel: SEQADDR == 0x8
Feb 18 10:38:49 calculus /kernel: ahc0: Issued Channel A Bus Reset. 2 SCBs aborted
Feb 18 10:57:59 calculus /kernel: Clearing bus reset

When we first upgraded to 2.1.6 we were getting hangs a couple of times a day. When we'd run in and look at the server, we'd see stuff like the above scrolling on the console. The server could be pinged, but we couldn't telnet into it. It was essentially dead and we'd have to reboot. Usually 12 hours later the same thing would happen and we'd have to reboot the machine again. On one occasion we couldn't even ping the machine and it was completely stuck.

We've upgraded to the "final" 2.1.7 and things appear to be better, but we still get the occasional reboots at least once a day. Now it doesn't get stuck, but usually reboots itself after a few minutes of the idle. Here's the latest syslog snippet:

Feb 21 02:12:51 calculus /kernel: sd0(ahc0:0:0): timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0x0
Feb 21 02:12:51 calculus /kernel: SEQADDR == 0x8
Feb 21 02:15:58 calculus /kernel: FreeBSD 2.1.7-RELEASE #0: Thu Feb 20 18:07:36 PST 1997

As you can see, we got the idle report again and it rebooted itself a couple of minutes later.

We thought maybe it was the SCSI chain, so we swapped the Adaptec 2940 for one in a different machine, that made no difference. We also checked the cable and it seems fine. The idle is not always happening on the same device, drives 0 & 1 have been seen with these idle errors so we doubt it's the drives themselves. Again it was working fine last week before we upgraded the OS.

We've been getting a lot of these rtq_reallyold  messages too:

Feb 20 19:22:11 calculus /kernel: in_rtqtimo: adjusted rtq_reallyold to 10

...but we've heard they are innocuous. We did see at least one instance though were there was one of these messages and then, in the same second in the log, the SCSI idles started.

We've tried running Apache 1.1.3 and 1.2b4 and 1.2b6 and that hasn't made any difference with the crashes. We did remove some kernel changes we made to SOMAXCON and TCP options to make it a more generic kernel, but that hasn't gotten rid of the SCSI idles. This is running on a Pentium Pro 200 machine with 128 MBs. A couple of our other machines which have had the 2.1.7 upgrades appear to be OK.

If anyone has any ideas what's going on, please let us know!

Thanks.

pete@ebay.com