From owner-freebsd-stable@FreeBSD.ORG Sat Sep 2 19:21:04 2006 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0984E16A4DE for ; Sat, 2 Sep 2006 19:21:04 +0000 (UTC) (envelope-from lydianconcepts@gmail.com) Received: from wx-out-0506.google.com (wx-out-0506.google.com [66.249.82.225]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8A81643D68 for ; Sat, 2 Sep 2006 19:20:50 +0000 (GMT) (envelope-from lydianconcepts@gmail.com) Received: by wx-out-0506.google.com with SMTP id i27so1408260wxd for ; Sat, 02 Sep 2006 12:20:49 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=uRnDY+WFcpFcn1zGm0FyqrNm6V3iV0ENmPjjCDcCgTwZbaClzcKmpYvyfikXd16UJNW319fpX8Un1B6jCe5mp3LhaiK45NV7S+DirVN/UU8K8acXGKqmSvCkbWyTBRr1TgR9yG12zs6H/N85fWXcqBoDRaFlsljFhuHU4/0Pa44= Received: by 10.90.50.6 with SMTP id x6mr969829agx; Sat, 02 Sep 2006 12:20:49 -0700 (PDT) Received: by 10.90.70.14 with HTTP; Sat, 2 Sep 2006 12:20:48 -0700 (PDT) Message-ID: <7579f7fb0609021220y2d530c93pebb59bb2c0a70945@mail.gmail.com> Date: Sat, 2 Sep 2006 12:20:49 -0700 From: "Matthew Jacob" To: "Alex Salazar" In-Reply-To: <40c4bb930609020223h50c43537n1c8b32081ef5c1bf@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <40c4bb930609020223h50c43537n1c8b32081ef5c1bf@mail.gmail.com> Cc: freebsd-current@freebsd.org, freebsd-stable@freebsd.org Subject: Re: Several issues on Dell 1950/2950 servers (6-STABLE and 7-CURRENT) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Sep 2006 19:21:04 -0000 > > The OS booted up and the SAS controller was now detected and supported by > the mpt(4) driver: > --- > mpt0: port 0xec00-0xecff mem 0xfc4fc000-0xfc4fffff, > 0xfc4e0000-0xfc4effff irq 64 at device 8.0 on pci2 > mpt0: Reserved 0x100 bytes for rid 0x10 type 4 at 0xec00 > mpt0: Reserved 0x4000 bytes for rid 0x14 type 3 at 0xfc4fc000 > mpt0: [GIANT-LOCKED] > mpt0: MPI Version=1.5.12.0 > --- > > And the related errors showed up immediately, for the first time: > --- > mpt0: mpt_cam_event: 0x16 > mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required). > mpt0: mpt_cam_event: 0x12 > mpt0: Unhandled Event Notify Frame. Event 0x12 (ACK not required). > mpt0: mpt_cam_event: MPI_EVENT_SAS_DEVICE_STATUS_CHANGE > mpt0: mpt_cam_event: MPI_EVENT_SAS_DEVICE_STATUS_CHANGE > mpt0: mpt_cam_event: 0x16 > mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required). > -- These are device arrival events. > > When the bootstrap process reached the SCSI probe, there were > no activity on the screen for about five minutes, so I was forced to use > the power off button, and after rebooting, the same symptoms were evident, > so I rebooted the machine once again, this time in verbose mode. > > This debug information was being printed on the screen, one character at time, > at about 1 char/sec: > > (probe8:mpt0:0:8:0): error 22 What's at target 8? It isn't happy for a variety of reasons. Oh- I see from below- it's an SES instance that drops dead if given something at > lun 0. > (probe8:mpt0:0:8:0): Unretryable Error > --- > pass0 at mpt0 bus 0 target 0 lun 0 > pass0: Fixed Direct Access SCSI-5 device > > As a workaround, I disabled the APICs (hint.apic.0.disabled), > and that ~15 minutes delay at boot up, now was gone. Fine. > > (BTW, 7-CURRENT has the same problem, but without that huge delay) Do you have APIC disabled for 7-CURRENT also? > > Once I was logged in the server, I proceeded to populate my ports tree, > by using portsnap(8), so, when I extracted the tarball (portsnap extract), > there was a lot of the following error message, at about 1 message per second: > > mpt0: Unhandled Event Notify Frame. Event 0xe (ACK not required). Queue Full events from the SAS firmware. > > Once in a while, an error message like below, showed up: > -- > (da0:mpt0:0:0:0): WRITE(10). CDB: 2a 0 1 55 6f 5f 0 0 20 0 > (da0:mpt0:0:0:0): CAM Status: SCSI Status Error > (da0:mpt0:0:0:0): SCSI Status: Check Condition > (da0:mpt0:0:0:0): UNIT ATTENTION asc:29,2 > (da0:mpt0:0:0:0): Scsi bus reset occurred Somebody is reseeting the bus periodically. We (freebsd) aren't volitionally doing this that I'm aware of here. > In order to perform those diagnostics, I had to install a SuSe Linux > Enterprise Server 9, which was also shipped with this machine) Which is a good way of saying that LSI-Logic support isn't very evident on FreeBSD. > > After reinstalling FreeBSD, I logged remotely into the server, via ssh, > and fetched the ports snapshot again and extracted once more. > > Suddenly, the screen activity ceased and the network connection timed out. > > Locally, on the server, there was a lot of mpt(4) errors and warnings. > --- > (da0:mpt0:0:0:0): CAM Status 0x18 > (da0:mpt0:0:0:0): Retrying Command > (... and about 500 more lines like those...) Hmm. > --- >> > And finally, those errors from mpt(4): > > --- > request 0xc4c4a080:44717 timed out for ccb 0xc4e41400 (req->ccb 0xc4e41400) > request 0xc4c4b430:44718 timed out for ccb 0xc4ca5800 (req->ccb 0xc4ca5800) > request 0xc4c4cd80:44719 timed out for ccb 0xc4c52800 (req->ccb 0xc4c52800) > (... and about 300 more lines like those ...) > --- > > which were followed by the same number of lines like these: > --- > mpt0: completing timedout/aborted req 0xc4c4a080:44717 > mpt0: completing timedout/aborted req 0xc4c4b430:44718 > mpt0: completing timedout/aborted req 0xc4c4cd80:44719 > --- > > and finishing with this line: > --- > mpt0: Timedout requests already complete. Interrupts may not be functioning. > --- > I've seen this on Supermicro EM64T in the past on 7-current, but that went away about 3-4 weeks ago. It really seemed to me that this was indeed an interrupt related problem. Yup, sounds like a mess here.