From owner-freebsd-bugs@FreeBSD.ORG Mon Mar 24 16:50:02 2014 Return-Path: Delivered-To: freebsd-bugs@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id DEA1B34E for ; Mon, 24 Mar 2014 16:50:01 +0000 (UTC) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 8E402C00 for ; Mon, 24 Mar 2014 16:50:01 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.8/8.14.8) with ESMTP id s2OGo1s4023490 for ; Mon, 24 Mar 2014 16:50:01 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.8/8.14.8/Submit) id s2OGo1Yh023484; Mon, 24 Mar 2014 16:50:01 GMT (envelope-from gnats) Resent-Date: Mon, 24 Mar 2014 16:50:01 GMT Resent-Message-Id: <201403241650.s2OGo1Yh023484@freefall.freebsd.org> Resent-From: FreeBSD-gnats-submit@FreeBSD.org (GNATS Filer) Resent-To: freebsd-bugs@FreeBSD.org Resent-Reply-To: FreeBSD-gnats-submit@FreeBSD.org, Nagy@FreeBSD.org, Attila Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id E7E77107 for ; Mon, 24 Mar 2014 16:43:19 +0000 (UTC) Received: from cgiserv.freebsd.org (cgiserv.freebsd.org [IPv6:2001:1900:2254:206a::50:4]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id A7B59BBF for ; Mon, 24 Mar 2014 16:43:19 +0000 (UTC) Received: from cgiserv.freebsd.org ([127.0.1.6]) by cgiserv.freebsd.org (8.14.8/8.14.8) with ESMTP id s2OGhIdx055204 for ; Mon, 24 Mar 2014 16:43:18 GMT (envelope-from nobody@cgiserv.freebsd.org) Received: (from nobody@localhost) by cgiserv.freebsd.org (8.14.8/8.14.8/Submit) id s2OGhIFD055201; Mon, 24 Mar 2014 16:43:18 GMT (envelope-from nobody) Message-Id: <201403241643.s2OGhIFD055201@cgiserv.freebsd.org> Date: Mon, 24 Mar 2014 16:43:18 GMT From: Nagy@FreeBSD.org, Attila To: freebsd-gnats-submit@FreeBSD.org X-Send-Pr-Version: www-3.1 Subject: kern/187903: Only 0.44 (always) days of uptime with ciss (w/HP SA P812) X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Mar 2014 16:50:02 -0000 >Number: 187903 >Category: kern >Synopsis: Only 0.44 (always) days of uptime with ciss (w/HP SA P812) >Confidential: no >Severity: non-critical >Priority: low >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Mon Mar 24 16:50:00 UTC 2014 >Closed-Date: >Last-Modified: >Originator: Nagy, Attila >Release: stable/9@r260621 and stable/10@r262152 >Organization: >Environment: >Description: I have an HP DL360G7 with a HP SmartArray P812 in it, which crashes exactly (well, some minutes plus or minus, but on the graph it's nearly the same) at 0.44 days of uptime no matter what I do, load the machine until it's so hot, I can't touch it, or just leave it idle. The P812 has an HP MDS600 connected to it with 70 1TB disks, with a 6 disk RAID6 (ADG) setup. The volumes have 128k stripe size, because I use ZFS on top of them. The zpool is simply a stripe of the RAID6 volumes. What may be important: the controller's RAID6 initialization is still ongoing. In the first sentence idle means the zpool/zfs is just mounted and only some stat()s happening on them (crashes after 0.44 days) and fully loaded means gstat shows around 100% utilization on the disks nearly all the time (crashes after 0.44 days also). I've already tried with stable/9@r260621 and stable/10@r262152, it's the same. I've also tried with Linux (Ubuntu 13.10, hpsa driver, zfs on linux 0.6.2), it doesn't crash (neither idle or loaded). Already swapped the machine and the P812 to a different one, no effect. Everything (DL360, P812, MDS600, disks) has the latest firmware. The currently used ZFS is created under Linux to see whether this causes the problems, but of course there are many different things in the two OS (kernel, HP SA driver, block/SCSI layer and even ZFS is somewhat different). Linux works, FreeBSD crashes no matter what I do. The exact message I can see is (ciss0 is the built-in P411): ciss1: ADAPTER HEARTBEAT FAILED Fatal trap 1: privileged instruction fault while in kernel mode cpuid = 0; apic id = 00 instruction pointer = 0x20:0xfffffe0c59ff795d stack pointer = 0x28:0xfffffe0baf1ab9d0 frame pointer = 0x28:0xfffffe0baf1aba20 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (swi4: clock) trap number = 1 panic: privileged instruction fault cpuid = 0 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0baf1ab560 kdb_backtrace() at kdb_backtrace+0x39/frame 0xfffffe0baf1ab610 panic() at panic+0x155/frame 0xfffffe0baf1ab690 trap_fatal() at trap_fatal+0x3a2/frame 0xfffffe0baf1ab6f0 trap() at trap+0x794/frame 0xfffffe0baf1ab910 calltrap() at calltrap+0x8/frame 0xfffffe0baf1ab910 --- trap 0x1, rip = 0xfffffe0c59ff795d, rsp = 0xfffffe0baf1ab9d0, rbp = 0xfffffe0baf1aba20 --- (null)() at 0xfffffe0c59ff795d/frame 0xfffffe0baf1aba20 softclock_call_cc() at softclock_call_cc+0x16c/frame 0xfffffe0000e77120 kernphys() at 0xffffffff/frame 0xfffffe0000e778a0 kernphys() at 0xffffffff/frame 0xfffffe0000e78aa0 kernphys() at 0xffffffff/frame 0xfffffe0000e78c20 Uptime: 10h18m12s (da4:ciss1:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da4:ciss1:0:0:0): CAM status: Command timeout (da4:ciss1:0:0:0): Error 5, Retries exhausted (da4:ciss1:0:0:0): Synchronize cache failed (da5:ciss1:0:1:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da5:ciss1:0:1:0): CAM status: Command timeout (da5:ciss1:0:1:0): Error 5, Retries exhausted (da5:ciss1:0:1:0): Synchronize cache failed (da6:ciss1:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da6:ciss1:0:2:0): CAM status: Command timeout (da6:ciss1:0:2:0): Error 5, Retries exhausted (da6:ciss1:0:2:0): Synchronize cache failed (da7:ciss1:0:3:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da7:ciss1:0:3:0): CAM status: Command timeout (da7:ciss1:0:3:0): Error 5, Retries exhausted (da7:ciss1:0:3:0): Synchronize cache failed (da8:ciss1:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da8:ciss1:0:4:0): CAM status: Command timeout (da8:ciss1:0:4:0): Error 5, Retries exhausted (da8:ciss1:0:4:0): Synchronize cache failed (da9:ciss1:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da9:ciss1:0:5:0): CAM status: Command timeout (da9:ciss1:0:5:0): Error 5, Retries exhausted (da9:ciss1:0:5:0): Synchronize cache failed (da10:ciss1:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da10:ciss1:0:6:0): CAM status: Command timeout (da10:ciss1:0:6:0): Error 5, Retries exhausted (da10:ciss1:0:6:0): Synchronize cache failed (da11:ciss1:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da11:ciss1:0:7:0): CAM status: Command timeout (da11:ciss1:0:7:0): Error 5, Retries exhausted (da11:ciss1:0:7:0): Synchronize cache failed (da12:ciss1:0:8:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da12:ciss1:0:8:0): CAM status: Command timeout (da12:ciss1:0:8:0): Error 5, Retries exhausted (da12:ciss1:0:8:0): Synchronize cache failed (da13:ciss1:0:9:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da13:ciss1:0:9:0): CAM status: Command timeout (da13:ciss1:0:9:0): Error 5, Retries exhausted (da13:ciss1:0:9:0): Synchronize cache failed (da14:ciss1:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 (da14:ciss1:0:10:0): CAM status: Command timeout (da14:ciss1:0:10:0): Error 5, Retries exhausted (da14:ciss1:0:10:0): Synchronize cache failed Automatic reboot in 15 seconds - press a key on the console to abort Rebooting... Dmesg says: ciss1: port 0x5000-0x50ff mem 0xfbe00000-0xfbffffff,0xfbdf0000-0xfbdf0fff irq 24 at device 0.0 on pci9 ciss1: PERFORMANT Transport da5 at ciss1 bus 0 scbus2 target 1 lun 0 da4 at ciss1 bus 0 scbus2 target 0 lun 0 da6 at ciss1 bus 0 scbus2 target 2 lun 0 da7 at ciss1 bus 0 scbus2 target 3 lun 0 da8 at ciss1 bus 0 scbus2 target 4 lun 0 da9 at ciss1 bus 0 scbus2 target 5 lun 0 da10 at ciss1 bus 0 scbus2 target 6 lun 0 da11 at ciss1 bus 0 scbus2 target 7 lun 0 da12 at ciss1 bus 0 scbus2 target 8 lun 0 da13 at ciss1 bus 0 scbus2 target 9 lun 0 da14 at ciss1 bus 0 scbus2 target 10 lun 0 I also find it interesting that the machine's IML (Integrated Management Log) contains this message after every crash: POST Error: 1719 - A controller failure event occurred prior to this power-up Which might show that the controller indeed locks up, but why does it do this under FreeBSD and doesn't under Linux? I've already tried hw.ciss.nop_message_heartbeat=1;ciss_force_transport=1;ciss_force_interrupt=1 without any effect (it freezes after the same time). Last time during the POST the controller said: Slot 2 HP Smart Array P812 Controller (1024MB, v6.40) 11 Logical Drives 1719-Slot 2 Drive Array - A controller failure event occurred prior to this power-up. (Previous lock up code = 0x13) Any ideas on what could cause this? Mailing list link: http://lists.freebsd.org/pipermail/freebsd-scsi/2014-March/006292.html >How-To-Repeat: (at least here) Create 11 RAID6 volumes (6 disks each) on a SmartArray P812 with a 128k stripe size, format it with ZFS, leave the system alone for 0.44 days and it crashes. >Fix: >Release-Note: >Audit-Trail: >Unformatted: