Date: Mon, 24 Mar 2014 10:48:14 +0100 From: "Nagy, Attila" <bra@fsn.hu> To: freebsd-scsi@freebsd.org Subject: Only 0.44 (always) days of uptime with ciss (w/HP SA P812) Message-ID: <532FFF5E.2010900@fsn.hu>
next in thread | raw e-mail | index | archive | help
Hi, I have an HP DL360G7 with a HP SmartArray P812 in it, which crashes exactly (well, some minutes plus or minus, but on the graph it's nearly the same) at 0.44 days of uptime no matter what I do, load the machine until it's so hot, I can't touch it, or just leave it idle. The P812 has an HP MDS600 connected to it with 70 1TB disks, with a 6 disk RAID6 (ADG) setup. The volumes have 128k stripe size, because I use ZFS on top of them. The zpool is simply a stripe of the RAID6 volumes. What may be important: the controller's RAID6 initialization is still ongoing. In the first sentence idle means the zpool/zfs is just mounted and only some stat()s happening on them (crashes after 0.44 days) and fully loaded means gstat shows around 100% utilization on the disks nearly all the time (crashes after 0.44 days also). I've already tried with stable/9@r260621 and stable/10@r262152, it's the same. I've also tried with Linux (Ubuntu 13.10, hpsa driver, zfs on linux 0.6.2), it doesn't crash (neither idle or loaded). Already swapped the machine and the P812 to a different one, no effect. Everything (DL360, P812, MDS600, disks) has the latest firmware. The currently used ZFS is created under Linux to see whether this causes the problems, but of course there are many different things in the two OS (kernel, HP SA driver, block/SCSI layer and even ZFS is somewhat different). Linux works, FreeBSD crashes no matter what I do. The exact message I can see is (ciss0 is the built-in P411): ciss1: ADAPTER HEARTBEAT FAILED Fatal trap 1: privileged instruction fault while in kernel mode cpuid = 0; apic id = 00 instruction pointer = 0x20:0xfffffe0c59ff795d stack pointer = 0x28:0xfffffe0baf1ab9d0 frame pointer = 0x28:0xfffffe0baf1aba20 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (swi4: clock) trap number = 1 panic: privileged instruction fault cpuid = 0 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0baf1ab560 kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0baf1ab610 panic() at panic+0x155/frame 0xfffffe0baf1ab690 trap_fatal() at trap_fatal+0x3a2/frame 0xfffffe0baf1ab6f0 trap() at trap+0x794/frame 0xfffffe0baf1ab910 calltrap() at calltrap+0x8/frame 0xfffffe0baf1ab910 --- trap 0x1, rip = 0xfffffe0c59ff795d, rsp = 0xfffffe0baf1ab9d0, rbp = 0xfffffe0baf1aba20 --- (null)() at 0xfffffe0c59ff795d/frame 0xfffffe0baf1aba20 softclock_call_cc() at softclock_call_cc+0x16c/frame 0xfffffe0000e77120 kernphys() at 0xffffffff/frame 0xfffffe0000e778a0 kernphys() at 0xffffffff/frame 0xfffffe0000e78aa0 kernphys() at 0xffffffff/frame 0xfffffe0000e78c20 Uptime: 10h18m12s (da4:ciss1:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da4:ciss1:0:0:0): CAM status: Command timeout (da4:ciss1:0:0:0): Error 5, Retries exhausted (da4:ciss1:0:0:0): Synchronize cache failed (da5:ciss1:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da5:ciss1:0:1:0): CAM status: Command timeout (da5:ciss1:0:1:0): Error 5, Retries exhausted (da5:ciss1:0:1:0): Synchronize cache failed (da6:ciss1:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da6:ciss1:0:2:0): CAM status: Command timeout (da6:ciss1:0:2:0): Error 5, Retries exhausted (da6:ciss1:0:2:0): Synchronize cache failed (da7:ciss1:0:3:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da7:ciss1:0:3:0): CAM status: Command timeout (da7:ciss1:0:3:0): Error 5, Retries exhausted (da7:ciss1:0:3:0): Synchronize cache failed (da8:ciss1:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da8:ciss1:0:4:0): CAM status: Command timeout (da8:ciss1:0:4:0): Error 5, Retries exhausted (da8:ciss1:0:4:0): Synchronize cache failed (da9:ciss1:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da9:ciss1:0:5:0): CAM status: Command timeout (da9:ciss1:0:5:0): Error 5, Retries exhausted (da9:ciss1:0:5:0): Synchronize cache failed (da10:ciss1:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da10:ciss1:0:6:0): CAM status: Command timeout (da10:ciss1:0:6:0): Error 5, Retries exhausted (da10:ciss1:0:6:0): Synchronize cache failed (da11:ciss1:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da11:ciss1:0:7:0): CAM status: Command timeout (da11:ciss1:0:7:0): Error 5, Retries exhausted (da11:ciss1:0:7:0): Synchronize cache failed (da12:ciss1:0:8:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da12:ciss1:0:8:0): CAM status: Command timeout (da12:ciss1:0:8:0): Error 5, Retries exhausted (da12:ciss1:0:8:0): Synchronize cache failed (da13:ciss1:0:9:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da13:ciss1:0:9:0): CAM status: Command timeout (da13:ciss1:0:9:0): Error 5, Retries exhausted (da13:ciss1:0:9:0): Synchronize cache failed (da14:ciss1:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da14:ciss1:0:10:0): CAM status: Command timeout (da14:ciss1:0:10:0): Error 5, Retries exhausted (da14:ciss1:0:10:0): Synchronize cache failed Automatic reboot in 15 seconds - press a key on the console to abort Rebooting... Dmesg says: ciss1: <HP Smart Array P812> port 0x5000-0x50ff mem 0xfbe00000-0xfbffffff,0xfbdf0000-0xfbdf0fff irq 24 at device 0.0 on pci9 ciss1: PERFORMANT Transport da5 at ciss1 bus 0 scbus2 target 1 lun 0 da4 at ciss1 bus 0 scbus2 target 0 lun 0 da6 at ciss1 bus 0 scbus2 target 2 lun 0 da7 at ciss1 bus 0 scbus2 target 3 lun 0 da8 at ciss1 bus 0 scbus2 target 4 lun 0 da9 at ciss1 bus 0 scbus2 target 5 lun 0 da10 at ciss1 bus 0 scbus2 target 6 lun 0 da11 at ciss1 bus 0 scbus2 target 7 lun 0 da12 at ciss1 bus 0 scbus2 target 8 lun 0 da13 at ciss1 bus 0 scbus2 target 9 lun 0 da14 at ciss1 bus 0 scbus2 target 10 lun 0 I also find it interesting that the machine's IML (Integrated Management Log) contains this message after every crash: POST Error: 1719 - A controller failure event occurred prior to this power-up Which might show that the controller indeed locks up, but why does it do this under FreeBSD and doesn't under Linux? I've already tried hw.ciss.nop_message_heartbeat=1;ciss_force_transport=1;ciss_force_interrupt=1 without any effect (it freezes after the same time). Last time during the POST the controller said: Slot 2 HP Smart Array P812 Controller (1024MB, v6.40) 11 Logical Drives 1719-Slot 2 Drive Array - A controller failure event occurred prior to this power-up. (Previous lock up code = 0x13) Any ideas on what could cause this? Thanks,
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?532FFF5E.2010900>