Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 24 Mar 2014 05:32:55 -0700
From:      Sean Bruno <sbruno@ignoranthack.me>
To:        "Nagy, Attila" <bra@fsn.hu>
Cc:        freebsd-scsi@freebsd.org
Subject:   Re: Only 0.44 (always) days of uptime with ciss (w/HP SA P812)
Message-ID:  <1395664375.25687.1.camel@powernoodle.corp.yahoo.com>
In-Reply-To: <532FFF5E.2010900@fsn.hu>
References:  <532FFF5E.2010900@fsn.hu>

next in thread | previous in thread | raw e-mail | index | archive | help

--=-iBA6hzzfyNGWCF1yk9c6
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

On Mon, 2014-03-24 at 10:48 +0100, Nagy, Attila wrote:
> Hi,
>=20
> I have an HP DL360G7 with a HP SmartArray P812 in it, which crashes=20
> exactly (well, some minutes plus or minus, but on the graph it's nearly=
=20
> the same) at 0.44 days of uptime no matter what I do, load the machine=
=20
> until it's so hot, I can't touch it, or just leave it idle.
> The P812 has an HP MDS600 connected to it with 70 1TB disks, with a 6=20
> disk RAID6 (ADG) setup. The volumes have 128k stripe size, because I use=
=20
> ZFS on top of them.
> The zpool is simply a stripe of the RAID6 volumes.
> What may be important: the controller's RAID6 initialization is still=20
> ongoing.
>=20
> In the first sentence idle means the zpool/zfs is just mounted and only=
=20
> some stat()s happening on them (crashes after 0.44 days) and fully=20
> loaded means gstat shows around 100% utilization on the disks nearly all=
=20
> the time (crashes after 0.44 days also).
>=20
> I've already tried with stable/9@r260621 and stable/10@r262152, it's the=
=20
> same.
> I've also tried with Linux (Ubuntu 13.10, hpsa driver, zfs on linux=20
> 0.6.2), it doesn't crash (neither idle or loaded).
> Already swapped the machine and the P812 to a different one, no effect.=
=20
> Everything (DL360, P812, MDS600, disks) has the latest firmware.
>=20
> The currently used ZFS is created under Linux to see whether this causes=
=20
> the problems, but of course there are many different things in the two=
=20
> OS (kernel, HP SA driver, block/SCSI layer and even ZFS is somewhat=20
> different).
> Linux works, FreeBSD crashes no matter what I do.
>=20
> The exact message I can see is (ciss0 is the built-in P411):
> ciss1: ADAPTER HEARTBEAT FAILED
>=20
>=20
> Fatal trap 1: privileged instruction fault while in kernel mode
> cpuid =3D 0; apic id =3D 00
> instruction pointer     =3D 0x20:0xfffffe0c59ff795d
> stack pointer           =3D 0x28:0xfffffe0baf1ab9d0
> frame pointer           =3D 0x28:0xfffffe0baf1aba20
> code segment            =3D base 0x0, limit 0xfffff, type 0x1b
>                          =3D DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags        =3D interrupt enabled, resume, IOPL =3D 0
> current process         =3D 12 (swi4: clock)
> trap number             =3D 1
> panic: privileged instruction fault
> cpuid =3D 0
> KDB: stack backtrace:
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame=20
> 0xfffffe0baf1ab560
> kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0baf1ab610
> panic() at panic+0x155/frame 0xfffffe0baf1ab690
> trap_fatal() at trap_fatal+0x3a2/frame 0xfffffe0baf1ab6f0
> trap() at trap+0x794/frame 0xfffffe0baf1ab910
> calltrap() at calltrap+0x8/frame 0xfffffe0baf1ab910
> --- trap 0x1, rip =3D 0xfffffe0c59ff795d, rsp =3D 0xfffffe0baf1ab9d0, rbp=
 =3D=20
> 0xfffffe0baf1aba20 ---
> (null)() at 0xfffffe0c59ff795d/frame 0xfffffe0baf1aba20
> softclock_call_cc() at softclock_call_cc+0x16c/frame 0xfffffe0000e77120
> kernphys() at 0xffffffff/frame 0xfffffe0000e778a0
> kernphys() at 0xffffffff/frame 0xfffffe0000e78aa0
> kernphys() at 0xffffffff/frame 0xfffffe0000e78c20
> Uptime: 10h18m12s
> (da4:ciss1:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00=
=20
> 00 00
> (da4:ciss1:0:0:0): CAM status: Command timeout
> (da4:ciss1:0:0:0): Error 5, Retries exhausted
> (da4:ciss1:0:0:0): Synchronize cache failed
> (da5:ciss1:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00=
=20
> 00 00
> (da5:ciss1:0:1:0): CAM status: Command timeout
> (da5:ciss1:0:1:0): Error 5, Retries exhausted
> (da5:ciss1:0:1:0): Synchronize cache failed
> (da6:ciss1:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00=
=20
> 00 00
> (da6:ciss1:0:2:0): CAM status: Command timeout
> (da6:ciss1:0:2:0): Error 5, Retries exhausted
> (da6:ciss1:0:2:0): Synchronize cache failed
> (da7:ciss1:0:3:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00=
=20
> 00 00
> (da7:ciss1:0:3:0): CAM status: Command timeout
> (da7:ciss1:0:3:0): Error 5, Retries exhausted
> (da7:ciss1:0:3:0): Synchronize cache failed
> (da8:ciss1:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00=
=20
> 00 00
> (da8:ciss1:0:4:0): CAM status: Command timeout
> (da8:ciss1:0:4:0): Error 5, Retries exhausted
> (da8:ciss1:0:4:0): Synchronize cache failed
> (da9:ciss1:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00=
=20
> 00 00
> (da9:ciss1:0:5:0): CAM status: Command timeout
> (da9:ciss1:0:5:0): Error 5, Retries exhausted
> (da9:ciss1:0:5:0): Synchronize cache failed
> (da10:ciss1:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00=
=20
> 00 00
> (da10:ciss1:0:6:0): CAM status: Command timeout
> (da10:ciss1:0:6:0): Error 5, Retries exhausted
> (da10:ciss1:0:6:0): Synchronize cache failed
> (da11:ciss1:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00=
=20
> 00 00
> (da11:ciss1:0:7:0): CAM status: Command timeout
> (da11:ciss1:0:7:0): Error 5, Retries exhausted
> (da11:ciss1:0:7:0): Synchronize cache failed
> (da12:ciss1:0:8:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00=
=20
> 00 00
> (da12:ciss1:0:8:0): CAM status: Command timeout
> (da12:ciss1:0:8:0): Error 5, Retries exhausted
> (da12:ciss1:0:8:0): Synchronize cache failed
> (da13:ciss1:0:9:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00=
=20
> 00 00
> (da13:ciss1:0:9:0): CAM status: Command timeout
> (da13:ciss1:0:9:0): Error 5, Retries exhausted
> (da13:ciss1:0:9:0): Synchronize cache failed
> (da14:ciss1:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00=
=20
> 00 00
> (da14:ciss1:0:10:0): CAM status: Command timeout
> (da14:ciss1:0:10:0): Error 5, Retries exhausted
> (da14:ciss1:0:10:0): Synchronize cache failed
> Automatic reboot in 15 seconds - press a key on the console to abort
> Rebooting...
>=20
> Dmesg says:
> ciss1: <HP Smart Array P812> port 0x5000-0x50ff mem=20
> 0xfbe00000-0xfbffffff,0xfbdf0000-0xfbdf0fff irq 24 at device 0.0 on pci9
> ciss1: PERFORMANT Transport
> da5 at ciss1 bus 0 scbus2 target 1 lun 0
> da4 at ciss1 bus 0 scbus2 target 0 lun 0
> da6 at ciss1 bus 0 scbus2 target 2 lun 0
> da7 at ciss1 bus 0 scbus2 target 3 lun 0
> da8 at ciss1 bus 0 scbus2 target 4 lun 0
> da9 at ciss1 bus 0 scbus2 target 5 lun 0
> da10 at ciss1 bus 0 scbus2 target 6 lun 0
> da11 at ciss1 bus 0 scbus2 target 7 lun 0
> da12 at ciss1 bus 0 scbus2 target 8 lun 0
> da13 at ciss1 bus 0 scbus2 target 9 lun 0
> da14 at ciss1 bus 0 scbus2 target 10 lun 0
>=20
> I also find it interesting that the machine's IML (Integrated Management=
=20
> Log) contains this message after every crash:
> POST Error: 1719 - A controller failure event occurred prior to this=20
> power-up
>=20
> Which might show that the controller indeed locks up, but why does it do=
=20
> this under FreeBSD and doesn't under Linux?
> I've already tried
> hw.ciss.nop_message_heartbeat=3D1;ciss_force_transport=3D1;ciss_force_int=
errupt=3D1
> without any effect (it freezes after the same time).
>=20
> Last time during the POST the controller said:
> Slot 2  HP Smart Array P812 Controller       (1024MB, v6.40)  11 Logical=
=20
> Drives
> 1719-Slot 2 Drive Array - A controller failure event occurred prior to th=
is
>       power-up.  (Previous lock up code =3D 0x13)
>=20
> Any ideas on what could cause this?
>=20
> Thanks,
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"


Can you open a p/r on this?  I'd like to keep tracking ciss(4) issues.
It seems like there is something odd with our driver when using multiple
controllers.

sean

--=-iBA6hzzfyNGWCF1yk9c6
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAABAgAGBQJTMCXzAAoJEBkJRdwI6BaHLwYH/RqsE/xjpjEFWo+Qnhe8Ewxx
nCUCNHGfYE//HncIlv2KQI+MMa8LpfVgDfO53SHTJ/IIoRdogS4LyHZKdO2jk4Oe
fnQ8MSjVzaguHAogsNUVoOVelatn5FrsL9DLmfn1LccJKd6ONlKwdlp6GQLpulZy
s9ef3GTcKVwnE2BmNhzgO3gmeuBCUAZI4pgOOpW2vquD69sQmD1+qFX4O21vfHKZ
/RCgucPa7nypgtGX2WKn9gd+/eDTTVRy9OeAq4JjiNyOpC9o+8QNb1Zbw6E8DJ1i
nzHlJdZ49vH37peucigkhT0kXWVqJHVORJaEzvisXiZjdueQFLTwIDwNfNuzlXQ=
=JnVI
-----END PGP SIGNATURE-----

--=-iBA6hzzfyNGWCF1yk9c6--




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1395664375.25687.1.camel>