Date: Mon, 24 Mar 2014 05:32:55 -0700 From: Sean Bruno <sbruno@ignoranthack.me> To: "Nagy, Attila" <bra@fsn.hu> Cc: freebsd-scsi@freebsd.org Subject: Re: Only 0.44 (always) days of uptime with ciss (w/HP SA P812) Message-ID: <1395664375.25687.1.camel@powernoodle.corp.yahoo.com> In-Reply-To: <532FFF5E.2010900@fsn.hu> References: <532FFF5E.2010900@fsn.hu>
next in thread | previous in thread | raw e-mail | index | archive | help
--=-iBA6hzzfyNGWCF1yk9c6 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable On Mon, 2014-03-24 at 10:48 +0100, Nagy, Attila wrote: > Hi, >=20 > I have an HP DL360G7 with a HP SmartArray P812 in it, which crashes=20 > exactly (well, some minutes plus or minus, but on the graph it's nearly= =20 > the same) at 0.44 days of uptime no matter what I do, load the machine= =20 > until it's so hot, I can't touch it, or just leave it idle. > The P812 has an HP MDS600 connected to it with 70 1TB disks, with a 6=20 > disk RAID6 (ADG) setup. The volumes have 128k stripe size, because I use= =20 > ZFS on top of them. > The zpool is simply a stripe of the RAID6 volumes. > What may be important: the controller's RAID6 initialization is still=20 > ongoing. >=20 > In the first sentence idle means the zpool/zfs is just mounted and only= =20 > some stat()s happening on them (crashes after 0.44 days) and fully=20 > loaded means gstat shows around 100% utilization on the disks nearly all= =20 > the time (crashes after 0.44 days also). >=20 > I've already tried with stable/9@r260621 and stable/10@r262152, it's the= =20 > same. > I've also tried with Linux (Ubuntu 13.10, hpsa driver, zfs on linux=20 > 0.6.2), it doesn't crash (neither idle or loaded). > Already swapped the machine and the P812 to a different one, no effect.= =20 > Everything (DL360, P812, MDS600, disks) has the latest firmware. >=20 > The currently used ZFS is created under Linux to see whether this causes= =20 > the problems, but of course there are many different things in the two= =20 > OS (kernel, HP SA driver, block/SCSI layer and even ZFS is somewhat=20 > different). > Linux works, FreeBSD crashes no matter what I do. >=20 > The exact message I can see is (ciss0 is the built-in P411): > ciss1: ADAPTER HEARTBEAT FAILED >=20 >=20 > Fatal trap 1: privileged instruction fault while in kernel mode > cpuid =3D 0; apic id =3D 00 > instruction pointer =3D 0x20:0xfffffe0c59ff795d > stack pointer =3D 0x28:0xfffffe0baf1ab9d0 > frame pointer =3D 0x28:0xfffffe0baf1aba20 > code segment =3D base 0x0, limit 0xfffff, type 0x1b > =3D DPL 0, pres 1, long 1, def32 0, gran 1 > processor eflags =3D interrupt enabled, resume, IOPL =3D 0 > current process =3D 12 (swi4: clock) > trap number =3D 1 > panic: privileged instruction fault > cpuid =3D 0 > KDB: stack backtrace: > db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame=20 > 0xfffffe0baf1ab560 > kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0baf1ab610 > panic() at panic+0x155/frame 0xfffffe0baf1ab690 > trap_fatal() at trap_fatal+0x3a2/frame 0xfffffe0baf1ab6f0 > trap() at trap+0x794/frame 0xfffffe0baf1ab910 > calltrap() at calltrap+0x8/frame 0xfffffe0baf1ab910 > --- trap 0x1, rip =3D 0xfffffe0c59ff795d, rsp =3D 0xfffffe0baf1ab9d0, rbp= =3D=20 > 0xfffffe0baf1aba20 --- > (null)() at 0xfffffe0c59ff795d/frame 0xfffffe0baf1aba20 > softclock_call_cc() at softclock_call_cc+0x16c/frame 0xfffffe0000e77120 > kernphys() at 0xffffffff/frame 0xfffffe0000e778a0 > kernphys() at 0xffffffff/frame 0xfffffe0000e78aa0 > kernphys() at 0xffffffff/frame 0xfffffe0000e78c20 > Uptime: 10h18m12s > (da4:ciss1:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00= =20 > 00 00 > (da4:ciss1:0:0:0): CAM status: Command timeout > (da4:ciss1:0:0:0): Error 5, Retries exhausted > (da4:ciss1:0:0:0): Synchronize cache failed > (da5:ciss1:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00= =20 > 00 00 > (da5:ciss1:0:1:0): CAM status: Command timeout > (da5:ciss1:0:1:0): Error 5, Retries exhausted > (da5:ciss1:0:1:0): Synchronize cache failed > (da6:ciss1:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00= =20 > 00 00 > (da6:ciss1:0:2:0): CAM status: Command timeout > (da6:ciss1:0:2:0): Error 5, Retries exhausted > (da6:ciss1:0:2:0): Synchronize cache failed > (da7:ciss1:0:3:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00= =20 > 00 00 > (da7:ciss1:0:3:0): CAM status: Command timeout > (da7:ciss1:0:3:0): Error 5, Retries exhausted > (da7:ciss1:0:3:0): Synchronize cache failed > (da8:ciss1:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00= =20 > 00 00 > (da8:ciss1:0:4:0): CAM status: Command timeout > (da8:ciss1:0:4:0): Error 5, Retries exhausted > (da8:ciss1:0:4:0): Synchronize cache failed > (da9:ciss1:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00= =20 > 00 00 > (da9:ciss1:0:5:0): CAM status: Command timeout > (da9:ciss1:0:5:0): Error 5, Retries exhausted > (da9:ciss1:0:5:0): Synchronize cache failed > (da10:ciss1:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00= =20 > 00 00 > (da10:ciss1:0:6:0): CAM status: Command timeout > (da10:ciss1:0:6:0): Error 5, Retries exhausted > (da10:ciss1:0:6:0): Synchronize cache failed > (da11:ciss1:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00= =20 > 00 00 > (da11:ciss1:0:7:0): CAM status: Command timeout > (da11:ciss1:0:7:0): Error 5, Retries exhausted > (da11:ciss1:0:7:0): Synchronize cache failed > (da12:ciss1:0:8:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00= =20 > 00 00 > (da12:ciss1:0:8:0): CAM status: Command timeout > (da12:ciss1:0:8:0): Error 5, Retries exhausted > (da12:ciss1:0:8:0): Synchronize cache failed > (da13:ciss1:0:9:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00= =20 > 00 00 > (da13:ciss1:0:9:0): CAM status: Command timeout > (da13:ciss1:0:9:0): Error 5, Retries exhausted > (da13:ciss1:0:9:0): Synchronize cache failed > (da14:ciss1:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00= =20 > 00 00 > (da14:ciss1:0:10:0): CAM status: Command timeout > (da14:ciss1:0:10:0): Error 5, Retries exhausted > (da14:ciss1:0:10:0): Synchronize cache failed > Automatic reboot in 15 seconds - press a key on the console to abort > Rebooting... >=20 > Dmesg says: > ciss1: <HP Smart Array P812> port 0x5000-0x50ff mem=20 > 0xfbe00000-0xfbffffff,0xfbdf0000-0xfbdf0fff irq 24 at device 0.0 on pci9 > ciss1: PERFORMANT Transport > da5 at ciss1 bus 0 scbus2 target 1 lun 0 > da4 at ciss1 bus 0 scbus2 target 0 lun 0 > da6 at ciss1 bus 0 scbus2 target 2 lun 0 > da7 at ciss1 bus 0 scbus2 target 3 lun 0 > da8 at ciss1 bus 0 scbus2 target 4 lun 0 > da9 at ciss1 bus 0 scbus2 target 5 lun 0 > da10 at ciss1 bus 0 scbus2 target 6 lun 0 > da11 at ciss1 bus 0 scbus2 target 7 lun 0 > da12 at ciss1 bus 0 scbus2 target 8 lun 0 > da13 at ciss1 bus 0 scbus2 target 9 lun 0 > da14 at ciss1 bus 0 scbus2 target 10 lun 0 >=20 > I also find it interesting that the machine's IML (Integrated Management= =20 > Log) contains this message after every crash: > POST Error: 1719 - A controller failure event occurred prior to this=20 > power-up >=20 > Which might show that the controller indeed locks up, but why does it do= =20 > this under FreeBSD and doesn't under Linux? > I've already tried > hw.ciss.nop_message_heartbeat=3D1;ciss_force_transport=3D1;ciss_force_int= errupt=3D1 > without any effect (it freezes after the same time). >=20 > Last time during the POST the controller said: > Slot 2 HP Smart Array P812 Controller (1024MB, v6.40) 11 Logical= =20 > Drives > 1719-Slot 2 Drive Array - A controller failure event occurred prior to th= is > power-up. (Previous lock up code =3D 0x13) >=20 > Any ideas on what could cause this? >=20 > Thanks, > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" Can you open a p/r on this? I'd like to keep tracking ciss(4) issues. It seems like there is something odd with our driver when using multiple controllers. sean --=-iBA6hzzfyNGWCF1yk9c6 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJTMCXzAAoJEBkJRdwI6BaHLwYH/RqsE/xjpjEFWo+Qnhe8Ewxx nCUCNHGfYE//HncIlv2KQI+MMa8LpfVgDfO53SHTJ/IIoRdogS4LyHZKdO2jk4Oe fnQ8MSjVzaguHAogsNUVoOVelatn5FrsL9DLmfn1LccJKd6ONlKwdlp6GQLpulZy s9ef3GTcKVwnE2BmNhzgO3gmeuBCUAZI4pgOOpW2vquD69sQmD1+qFX4O21vfHKZ /RCgucPa7nypgtGX2WKn9gd+/eDTTVRy9OeAq4JjiNyOpC9o+8QNb1Zbw6E8DJ1i nzHlJdZ49vH37peucigkhT0kXWVqJHVORJaEzvisXiZjdueQFLTwIDwNfNuzlXQ= =JnVI -----END PGP SIGNATURE----- --=-iBA6hzzfyNGWCF1yk9c6--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1395664375.25687.1.camel>