From owner-freebsd-stable@FreeBSD.ORG  Sat Sep  2 19:21:04 2006
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
X-Original-To: freebsd-stable@freebsd.org
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 0984E16A4DE
	for <freebsd-stable@freebsd.org>; Sat,  2 Sep 2006 19:21:04 +0000 (UTC)
	(envelope-from lydianconcepts@gmail.com)
Received: from wx-out-0506.google.com (wx-out-0506.google.com [66.249.82.225])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 8A81643D68
	for <freebsd-stable@freebsd.org>; Sat,  2 Sep 2006 19:20:50 +0000 (GMT)
	(envelope-from lydianconcepts@gmail.com)
Received: by wx-out-0506.google.com with SMTP id i27so1408260wxd
	for <freebsd-stable@freebsd.org>; Sat, 02 Sep 2006 12:20:49 -0700 (PDT)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com;
	h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
	b=uRnDY+WFcpFcn1zGm0FyqrNm6V3iV0ENmPjjCDcCgTwZbaClzcKmpYvyfikXd16UJNW319fpX8Un1B6jCe5mp3LhaiK45NV7S+DirVN/UU8K8acXGKqmSvCkbWyTBRr1TgR9yG12zs6H/N85fWXcqBoDRaFlsljFhuHU4/0Pa44=
Received: by 10.90.50.6 with SMTP id x6mr969829agx;
	Sat, 02 Sep 2006 12:20:49 -0700 (PDT)
Received: by 10.90.70.14 with HTTP; Sat, 2 Sep 2006 12:20:48 -0700 (PDT)
Message-ID: <7579f7fb0609021220y2d530c93pebb59bb2c0a70945@mail.gmail.com>
Date: Sat, 2 Sep 2006 12:20:49 -0700
From: "Matthew Jacob" <lydianconcepts@gmail.com>
To: "Alex Salazar" <umbilical.blisters@gmail.com>
In-Reply-To: <40c4bb930609020223h50c43537n1c8b32081ef5c1bf@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <40c4bb930609020223h50c43537n1c8b32081ef5c1bf@mail.gmail.com>
Cc: freebsd-current@freebsd.org, freebsd-stable@freebsd.org
Subject: Re: Several issues on Dell 1950/2950 servers (6-STABLE and
	7-CURRENT)
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Sep 2006 19:21:04 -0000

>
> The OS booted up and the SAS controller was now detected and supported by
> the mpt(4) driver:
> ---
> mpt0: <LSILogic SAS Adapter> port 0xec00-0xecff mem 0xfc4fc000-0xfc4fffff,
> 0xfc4e0000-0xfc4effff irq 64 at device 8.0 on pci2
> mpt0: Reserved 0x100 bytes for rid 0x10 type 4 at 0xec00
> mpt0: Reserved 0x4000 bytes for rid 0x14 type 3 at 0xfc4fc000
> mpt0: [GIANT-LOCKED]
> mpt0: MPI Version=1.5.12.0
> ---
>
> And the related errors showed up immediately, for the first time:
> ---
> mpt0: mpt_cam_event: 0x16
> mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
> mpt0: mpt_cam_event: 0x12
> mpt0: Unhandled Event Notify Frame. Event 0x12 (ACK not required).
> mpt0: mpt_cam_event: MPI_EVENT_SAS_DEVICE_STATUS_CHANGE
> mpt0: mpt_cam_event: MPI_EVENT_SAS_DEVICE_STATUS_CHANGE
> mpt0: mpt_cam_event: 0x16
> mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
> --

These are device arrival events.

>
> When the bootstrap process reached the SCSI probe, there were
> no activity on the screen for about five minutes, so I was forced to use
> the power off button, and after rebooting, the same symptoms were evident,
> so I rebooted the machine once again, this time in verbose mode.
>
> This debug information was being printed on the screen, one character at time,
> at about 1 char/sec:
>
> (probe8:mpt0:0:8:0): error 22

What's at target 8? It isn't happy for a variety of reasons. Oh- I see
from below- it's an SES instance that drops dead if given something at
> lun 0.

> (probe8:mpt0:0:8:0): Unretryable Error
> ---
> pass0 at mpt0 bus 0 target 0 lun 0
> pass0: <MAXTOR ATLAS15K2_073SAS BP00> Fixed Direct Access SCSI-5 device
> > As a workaround, I disabled the APICs (hint.apic.0.disabled),
> and that ~15 minutes delay at boot up, now was gone. Fine.
>
> (BTW, 7-CURRENT has the same problem, but without that huge delay)

Do you have APIC disabled for 7-CURRENT also?

>
> Once I was logged in the server, I proceeded to populate my ports tree,
> by using portsnap(8), so, when I extracted the tarball (portsnap extract),
> there was a lot of the following error message, at about 1 message per second:
>
> mpt0: Unhandled Event Notify Frame. Event 0xe (ACK not required).

Queue Full events from the SAS firmware.

>
> Once in a while, an error message like below, showed up:
> --
> (da0:mpt0:0:0:0): WRITE(10). CDB: 2a 0 1 55 6f 5f 0 0 20 0
> (da0:mpt0:0:0:0): CAM Status: SCSI Status Error
> (da0:mpt0:0:0:0): SCSI Status: Check Condition
> (da0:mpt0:0:0:0): UNIT ATTENTION asc:29,2
> (da0:mpt0:0:0:0): Scsi bus reset occurred

Somebody is reseeting the bus periodically. We (freebsd) aren't
volitionally doing this that I'm aware of here.

> In order to perform those diagnostics, I had to install a SuSe Linux
> Enterprise Server 9, which was also shipped with this machine)

Which is a good way of saying that LSI-Logic support isn't very
evident on FreeBSD.

>
> After reinstalling FreeBSD, I logged remotely into the server, via ssh,
> and fetched the ports snapshot again and extracted once more.
>
> Suddenly, the screen activity ceased and the network connection timed out.
>
> Locally, on the server, there was a lot of mpt(4) errors and warnings.
> ---
> (da0:mpt0:0:0:0): CAM Status 0x18
> (da0:mpt0:0:0:0): Retrying Command
> (... and about 500 more lines like those...)

Hmm.

> ---
>>
> And finally, those errors from mpt(4):
>
> ---
> request 0xc4c4a080:44717 timed out for ccb 0xc4e41400 (req->ccb 0xc4e41400)
> request 0xc4c4b430:44718 timed out for ccb 0xc4ca5800 (req->ccb 0xc4ca5800)
> request 0xc4c4cd80:44719 timed out for ccb 0xc4c52800 (req->ccb 0xc4c52800)
> (... and about 300 more lines like those ...)
> ---
>
> which were followed by the same number of lines like these:
> ---
> mpt0: completing timedout/aborted req 0xc4c4a080:44717
> mpt0: completing timedout/aborted req 0xc4c4b430:44718
> mpt0: completing timedout/aborted req 0xc4c4cd80:44719
> ---
>
> and finishing with this line:
> ---
> mpt0: Timedout requests already complete. Interrupts may not be functioning.
> ---
>

I've seen this on Supermicro EM64T in the past on 7-current, but that
went away about 3-4 weeks ago. It really seemed to me that this was
indeed an interrupt related problem.

Yup, sounds like a mess here.