Date: Sat, 2 Sep 2006 12:20:49 -0700 From: "Matthew Jacob" <lydianconcepts@gmail.com> To: "Alex Salazar" <umbilical.blisters@gmail.com> Cc: freebsd-current@freebsd.org, freebsd-stable@freebsd.org Subject: Re: Several issues on Dell 1950/2950 servers (6-STABLE and 7-CURRENT) Message-ID: <7579f7fb0609021220y2d530c93pebb59bb2c0a70945@mail.gmail.com> In-Reply-To: <40c4bb930609020223h50c43537n1c8b32081ef5c1bf@mail.gmail.com> References: <40c4bb930609020223h50c43537n1c8b32081ef5c1bf@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
> > The OS booted up and the SAS controller was now detected and supported by > the mpt(4) driver: > --- > mpt0: <LSILogic SAS Adapter> port 0xec00-0xecff mem 0xfc4fc000-0xfc4fffff, > 0xfc4e0000-0xfc4effff irq 64 at device 8.0 on pci2 > mpt0: Reserved 0x100 bytes for rid 0x10 type 4 at 0xec00 > mpt0: Reserved 0x4000 bytes for rid 0x14 type 3 at 0xfc4fc000 > mpt0: [GIANT-LOCKED] > mpt0: MPI Version=1.5.12.0 > --- > > And the related errors showed up immediately, for the first time: > --- > mpt0: mpt_cam_event: 0x16 > mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required). > mpt0: mpt_cam_event: 0x12 > mpt0: Unhandled Event Notify Frame. Event 0x12 (ACK not required). > mpt0: mpt_cam_event: MPI_EVENT_SAS_DEVICE_STATUS_CHANGE > mpt0: mpt_cam_event: MPI_EVENT_SAS_DEVICE_STATUS_CHANGE > mpt0: mpt_cam_event: 0x16 > mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required). > -- These are device arrival events. > > When the bootstrap process reached the SCSI probe, there were > no activity on the screen for about five minutes, so I was forced to use > the power off button, and after rebooting, the same symptoms were evident, > so I rebooted the machine once again, this time in verbose mode. > > This debug information was being printed on the screen, one character at time, > at about 1 char/sec: > > (probe8:mpt0:0:8:0): error 22 What's at target 8? It isn't happy for a variety of reasons. Oh- I see from below- it's an SES instance that drops dead if given something at > lun 0. > (probe8:mpt0:0:8:0): Unretryable Error > --- > pass0 at mpt0 bus 0 target 0 lun 0 > pass0: <MAXTOR ATLAS15K2_073SAS BP00> Fixed Direct Access SCSI-5 device > > As a workaround, I disabled the APICs (hint.apic.0.disabled), > and that ~15 minutes delay at boot up, now was gone. Fine. > > (BTW, 7-CURRENT has the same problem, but without that huge delay) Do you have APIC disabled for 7-CURRENT also? > > Once I was logged in the server, I proceeded to populate my ports tree, > by using portsnap(8), so, when I extracted the tarball (portsnap extract), > there was a lot of the following error message, at about 1 message per second: > > mpt0: Unhandled Event Notify Frame. Event 0xe (ACK not required). Queue Full events from the SAS firmware. > > Once in a while, an error message like below, showed up: > -- > (da0:mpt0:0:0:0): WRITE(10). CDB: 2a 0 1 55 6f 5f 0 0 20 0 > (da0:mpt0:0:0:0): CAM Status: SCSI Status Error > (da0:mpt0:0:0:0): SCSI Status: Check Condition > (da0:mpt0:0:0:0): UNIT ATTENTION asc:29,2 > (da0:mpt0:0:0:0): Scsi bus reset occurred Somebody is reseeting the bus periodically. We (freebsd) aren't volitionally doing this that I'm aware of here. > In order to perform those diagnostics, I had to install a SuSe Linux > Enterprise Server 9, which was also shipped with this machine) Which is a good way of saying that LSI-Logic support isn't very evident on FreeBSD. > > After reinstalling FreeBSD, I logged remotely into the server, via ssh, > and fetched the ports snapshot again and extracted once more. > > Suddenly, the screen activity ceased and the network connection timed out. > > Locally, on the server, there was a lot of mpt(4) errors and warnings. > --- > (da0:mpt0:0:0:0): CAM Status 0x18 > (da0:mpt0:0:0:0): Retrying Command > (... and about 500 more lines like those...) Hmm. > --- >> > And finally, those errors from mpt(4): > > --- > request 0xc4c4a080:44717 timed out for ccb 0xc4e41400 (req->ccb 0xc4e41400) > request 0xc4c4b430:44718 timed out for ccb 0xc4ca5800 (req->ccb 0xc4ca5800) > request 0xc4c4cd80:44719 timed out for ccb 0xc4c52800 (req->ccb 0xc4c52800) > (... and about 300 more lines like those ...) > --- > > which were followed by the same number of lines like these: > --- > mpt0: completing timedout/aborted req 0xc4c4a080:44717 > mpt0: completing timedout/aborted req 0xc4c4b430:44718 > mpt0: completing timedout/aborted req 0xc4c4cd80:44719 > --- > > and finishing with this line: > --- > mpt0: Timedout requests already complete. Interrupts may not be functioning. > --- > I've seen this on Supermicro EM64T in the past on 7-current, but that went away about 3-4 weeks ago. It really seemed to me that this was indeed an interrupt related problem. Yup, sounds like a mess here.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?7579f7fb0609021220y2d530c93pebb59bb2c0a70945>