From owner-freebsd-fs@FreeBSD.ORG Mon Oct 22 03:15:35 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 61FE5587; Mon, 22 Oct 2012 03:15:35 +0000 (UTC) (envelope-from dg@pki2.com) Received: from btw.pki2.com (btw.pki2.com [IPv6:2001:470:a:6fd::2]) by mx1.freebsd.org (Postfix) with ESMTP id 0B7588FC0C; Mon, 22 Oct 2012 03:15:34 +0000 (UTC) Received: from [127.0.0.1] (localhost [127.0.0.1]) by btw.pki2.com (8.14.5/8.14.5) with ESMTP id q9M3FQHG099378; Sun, 21 Oct 2012 20:15:26 -0700 (PDT) (envelope-from dg@pki2.com) Subject: Discovered stangeness (Was: ZFS hang status update) From: Dennis Glatting To: Andriy Gapon In-Reply-To: <1350711509.86715.59.camel@btw.pki2.com> References: <1350698905.86715.33.camel@btw.pki2.com> <1350711509.86715.59.camel@btw.pki2.com> Content-Type: text/plain; charset="ISO-8859-1" Date: Sun, 21 Oct 2012 20:15:26 -0700 Message-ID: <1350875726.86715.134.camel@btw.pki2.com> Mime-Version: 1.0 X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port Content-Transfer-Encoding: 7bit X-yoursite-MailScanner-Information: Dennis Glatting X-yoursite-MailScanner-ID: q9M3FQHG099378 X-yoursite-MailScanner: Found to be clean X-MailScanner-From: dg@pki2.com Cc: freebsd-fs@freebsd.org X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 22 Oct 2012 03:15:35 -0000 As noted in my previous email, camcontrol against the SSD (da0) would hang and did so across a reboot. I decided to remove the SSD from the system. When I disconnected the SSD and rebooted the boot process included these messages: run_interrupt_driven_hooks: still waiting after 60 seconds for xpt_config run_interrupt_driven_hooks: still waiting after 120 seconds for xpt_config run_interrupt_driven_hooks: still waiting after 180 seconds for xpt_config run_interrupt_driven_hooks: still waiting after 240 seconds for xpt_config The system would eventually continue but hang later in the boot sequence, not reaching the command prompt, at this point: Timecounter "TSC-low" frequency 8594011 Hz quality 800 I removed power from the system and tried again. No luck. I reconnected the SSD and rebooted in verbose, and eventually got this: Timecounter "TSC-low" frequency 8594011 Hz quality 800 GEOM_PART: partition 1 is not aligned on 4096 bytes GEOM_PART: partition 2 is not aligned on 4096 bytes What I eventually discovered is one of the two disks of the OS RAID1 array is suddenly toast. Maybe this is coincidence but could it be the driver is confusing the two LSI chips? I am in the process of rebuilding this system. BTW, I installed ZFS-on-Linux under CentOS 6.3 on one of my other systems that would spontaneously reboot when I would issue a "zfs send" of a data set to it from another system. That system was issued a job with substantial load and has been up for only four hours. It'll be interesting to see if anything happens.