Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 2 Sep 2006 04:23:21 -0500
From:      "Alex Salazar" <umbilical.blisters@gmail.com>
To:        freebsd-stable@freebsd.org
Cc:        freebsd-current@freebsd.org
Subject:   Several issues on Dell 1950/2950 servers (6-STABLE and 7-CURRENT)
Message-ID:  <40c4bb930609020223h50c43537n1c8b32081ef5c1bf@mail.gmail.com>

next in thread | raw e-mail | index | archive | help
Apologies for the long message, and thanks in advance for any response.

I've just bought one of those new generation Dell servers, specifically,
the PowerEdge 1950.

This is a dual Intel Dual Core Xeon 5050, 3.0 GHz, 667MHz FSB,
1GB 533MHz RAM, system.

This server has a LSI Logic SAS 5/i integrated adapter and dual embedded
Broadcom NetXtreme II 5708 Gigabit Ethernet NIC.

When I tried to install from a FreeBSD 6.0-RELEASE i386 CD I had at hand,
no hard disc was detected.

After finding out that SAS controller was not supported on that release,
I grabbed the most recent 6.1-STABLE i386 snapshot (200608) and tried again.
This time, the hard disc was detected properly.

The installation succeeded and, after the post-install configuration,
the system was restarted.

The OS booted up and the SAS controller was now detected and supported by
the mpt(4) driver:
---
mpt0: <LSILogic SAS Adapter> port 0xec00-0xecff mem 0xfc4fc000-0xfc4fffff,
0xfc4e0000-0xfc4effff irq 64 at device 8.0 on pci2
mpt0: Reserved 0x100 bytes for rid 0x10 type 4 at 0xec00
mpt0: Reserved 0x4000 bytes for rid 0x14 type 3 at 0xfc4fc000
mpt0: [GIANT-LOCKED]
mpt0: MPI Version=1.5.12.0
---

And the related errors showed up immediately, for the first time:
---
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x12
mpt0: Unhandled Event Notify Frame. Event 0x12 (ACK not required).
mpt0: mpt_cam_event: MPI_EVENT_SAS_DEVICE_STATUS_CHANGE
mpt0: mpt_cam_event: MPI_EVENT_SAS_DEVICE_STATUS_CHANGE
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
-- 

When the bootstrap process reached the SCSI probe, there were
no activity on the screen for about five minutes, so I was forced to use
the power off button, and after rebooting, the same symptoms were evident,
so I rebooted the machine once again, this time in verbose mode.

This debug information was being printed on the screen, one character at time,
at about 1 char/sec:

(probe8:mpt0:0:8:0): error 22
(probe8:mpt0:0:8:0): Unretryable Error
(probe8:mpt0:0:8:0): error 22
(probe8:mpt0:0:8:0): Unretryable Error
(probe0:mpt0:0:0:1): error 22
(probe0:mpt0:0:0:1): Unretryable Error
(probe1:mpt0:0:8:1): Unexpected Bus Free
(probe1:mpt0:0:8:1): Retrying Command
...
(probe0:mpt0:0:8:7): Unexpected Bus Free
(probe0:mpt0:0:8:7): Retrying Command
(probe0:mpt0:0:8:7): Unexpected Bus Free
(probe0:mpt0:0:8:7): Retrying Command
(probe0:mpt0:0:8:7): Unexpected Bus Free
(probe0:mpt0:0:8:7): Retrying Command
(probe0:mpt0:0:8:7): Unexpected Bus Free
(probe0:mpt0:0:8:7): Retrying Command
(probe0:mpt0:0:8:7): Unexpected Bus Free
(probe0:mpt0:0:8:7): error 5
(probe0:mpt0:0:8:7): Retries Exausted

After 18 (eighteen) minutes, the error messages ceased, and the boot process
continued as usually:
---
pass0 at mpt0 bus 0 target 0 lun 0
pass0: <MAXTOR ATLAS15K2_073SAS BP00> Fixed Direct Access SCSI-5 device
pass0: Serial Number K40C1Q5K
pass0: 300.000MB/s transfers, Tagged Queueing Enabled
pass1 at mpt0 bus 0 target 8 lun 0
pass1: <DP BACKPLANE 1.00> Fixed Enclosure Services SCSI-5 device
pass1: 300.000MB/s transfers, Tagged Queueing Enabled
ses0 at mpt0 bus 0 target 8 lun 0
ses0: <DP BACKPLANE 1.00> Fixed Enclosure Services SCSI-5 device
ses0: 300.000MB/s transfers, Tagged Queueing Enabled
ses0: SCSI-3 SES Device
GEOM: new dida0 at mpt0 bus 0 target 0 lun 0
da0: <MAXTOR ATLAS15K2_073SAS BP00> Fixed Direct Access SCSI-5 device
da0: Serial Number K40C1Q5K
da0: 300.000MB/s transfers, Tagged Queueing Enabled
da0: 70007MB (143374650 512 byte sectors: 255H 63S/T 8924C)
---

As a workaround, I disabled the APICs (hint.apic.0.disabled),
and that ~15 minutes delay at boot up, now was gone. Fine.

(BTW, 7-CURRENT has the same problem, but without that huge delay)

Once I was logged in the server, I proceeded to populate my ports tree,
by using portsnap(8), so, when I extracted the tarball (portsnap extract),
there was a lot of the following error message, at about 1 message per second:

mpt0: Unhandled Event Notify Frame. Event 0xe (ACK not required).

Once in a while, an error message like below, showed up:
-- 
(da0:mpt0:0:0:0): WRITE(10). CDB: 2a 0 1 55 6f 5f 0 0 20 0
(da0:mpt0:0:0:0): CAM Status: SCSI Status Error
(da0:mpt0:0:0:0): SCSI Status: Check Condition
(da0:mpt0:0:0:0): UNIT ATTENTION asc:29,2
(da0:mpt0:0:0:0): Scsi bus reset occurred
-- 

After running some diagnostics included on some utilities CDs shipped with this
server, I concluded this was a software issue:
-- 
Device Name     : SAS Disk 0:0
Description     : SAS MAXTOR ATLAS15K2_073SAS
Test Name       : Disk Self Test
Passes          : 1
Result          : passed
Start Time      : Mon Aug 21 02:04:06 2006
Completion Time : Mon Aug 21 02:26:12 2006
Result Event    : The test operation completed successfully
-- 

In order to perform those diagnostics, I had to install a SuSe Linux
Enterprise Server 9, which was also shipped with this machine)

After reinstalling FreeBSD, I logged remotely into the server, via ssh,
and fetched the ports snapshot again and extracted once more.

Suddenly, the screen activity ceased and the network connection timed out.

Locally, on the server, there was a lot of mpt(4) errors and warnings.
---
(da0:mpt0:0:0:0): CAM Status 0x18
(da0:mpt0:0:0:0): Retrying Command
(... and about 500 more lines like those...)
---

Then, some bce(4) errors, which caused the network interface to be shutdown:
---
bce0: ../../../dev/bce/if_bce.c(5032): Watchdog timeout occurred, resetting
bce0: link state changed to DOWN
bce0: Gigabit link up
bce0: link state changed to UP
bce0: ../../../dev/bce/if_bce.c(5032): Watchdog timeout occurred, resetting
---

And finally, those errors from mpt(4):

---
request 0xc4c4a080:44717 timed out for ccb 0xc4e41400 (req->ccb 0xc4e41400)
request 0xc4c4b430:44718 timed out for ccb 0xc4ca5800 (req->ccb 0xc4ca5800)
request 0xc4c4cd80:44719 timed out for ccb 0xc4c52800 (req->ccb 0xc4c52800)
(... and about 300 more lines like those ...)
---

which were followed by the same number of lines like these:
---
mpt0: completing timedout/aborted req 0xc4c4a080:44717
mpt0: completing timedout/aborted req 0xc4c4b430:44718
mpt0: completing timedout/aborted req 0xc4c4cd80:44719
---

and finishing with this line:
---
mpt0: Timedout requests already complete. Interrupts may not be functioning.
---


After one hour and a half, the system was still unstable and I was forced to
reboot it.




Those are the main issues (mpt(4) and bce(4)) regarding this hardware
configuration and FreeBSD (6-STABLE and 7-CURRENT), however,
two problems were showing up, as well.

1. The first network interface (labeled "Gb 1", on server's case)
was detected as bce1, and the second one (Gb 2), as bce0.
---
bce0: <Broadcom NetXtreme II BCM5708 1000Base-T (B1), v0.9.6>
     mem 0xf4000000-0xf5ffffff irq 16 at device 0.0 on pci8
bce0: Reserved 0x2000000 bytes for rid 0x10 type 3 at 0xf4000000
bce0: ASIC ID 0x57081010; Revision (B1); PCI-X 64-bit 133MHz
miibus0: <MII bus> on bce0
brgphy0: <BCM5708C 10/100/1000baseTX PHY> on miibus0
brgphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX,
         1000baseTX, 1000baseTX-FDX, auto
bce0: bpf attached
bce0: Ethernet address: 00:13:72:f9:xx:xx
bce0: [MPSAFE]
...
bce1: <Broadcom NetXtreme II BCM5708 1000Base-T (B1), v0.9.6>
     mem 0xf8000000-0xf9ffffff irq 16 at device 0.0 on pci4
bce1: Reserved 0x2000000 bytes for rid 0x10 type 3 at 0xf8000000
bce1: ASIC ID 0x57081010; Revision (B1); PCI-X 64-bit 133MHz
miibus1: <MII bus> on bce1
brgphy1: <BCM5708C 10/100/1000baseTX PHY> on miibus1
brgphy1:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX,
         1000baseTX, 1000baseTX-FDX, auto
bce1: bpf attached
bce1: Ethernet address: 00:13:72:f9:xx:xx
bce1: [MPSAFE]
---

According to this log, bce0 is on pci8, while bce1 is on pci4.


2. Sometimes, the server refuses to be halted or rebooted via
shutdown(8) or reboot(8).


Any hint will be appreciated :)


/var/run/dmesg.boot (verbose log)
http://bsdero.tripod.com/dmesg.boot.txt

/var/log/messages (with remarks)
http://bsdero.tripod.com/messages.txt

pciconf -lv output
http://bsdero.tripod.com/pciconf.txt

-- 
Alex Salazar



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?40c4bb930609020223h50c43537n1c8b32081ef5c1bf>