From owner-freebsd-fs@FreeBSD.ORG  Mon Oct 22 03:15:35 2012
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id 61FE5587;
 Mon, 22 Oct 2012 03:15:35 +0000 (UTC) (envelope-from dg@pki2.com)
Received: from btw.pki2.com (btw.pki2.com [IPv6:2001:470:a:6fd::2])
 by mx1.freebsd.org (Postfix) with ESMTP id 0B7588FC0C;
 Mon, 22 Oct 2012 03:15:34 +0000 (UTC)
Received: from [127.0.0.1] (localhost [127.0.0.1])
 by btw.pki2.com (8.14.5/8.14.5) with ESMTP id q9M3FQHG099378;
 Sun, 21 Oct 2012 20:15:26 -0700 (PDT) (envelope-from dg@pki2.com)
Subject: Discovered stangeness (Was: ZFS hang status update)
From: Dennis Glatting <dg@pki2.com>
To: Andriy Gapon <avg@freebsd.org>
In-Reply-To: <1350711509.86715.59.camel@btw.pki2.com>
References: <1350698905.86715.33.camel@btw.pki2.com>
 <1350711509.86715.59.camel@btw.pki2.com>
Content-Type: text/plain; charset="ISO-8859-1"
Date: Sun, 21 Oct 2012 20:15:26 -0700
Message-ID: <1350875726.86715.134.camel@btw.pki2.com>
Mime-Version: 1.0
X-Mailer: Evolution 2.32.1 FreeBSD GNOME Team Port 
Content-Transfer-Encoding: 7bit
X-yoursite-MailScanner-Information: Dennis Glatting
X-yoursite-MailScanner-ID: q9M3FQHG099378
X-yoursite-MailScanner: Found to be clean
X-MailScanner-From: dg@pki2.com
Cc: freebsd-fs@freebsd.org
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 22 Oct 2012 03:15:35 -0000

As noted in my previous email, camcontrol against the SSD (da0) would
hang and did so across a reboot. I decided to remove the SSD from the
system.

When I disconnected the SSD and rebooted the boot process included these
messages:

run_interrupt_driven_hooks: still waiting after 60 seconds for
xpt_config
run_interrupt_driven_hooks: still waiting after 120 seconds for
xpt_config
run_interrupt_driven_hooks: still waiting after 180 seconds for
xpt_config
run_interrupt_driven_hooks: still waiting after 240 seconds for
xpt_config

The system would eventually continue but hang later in the boot
sequence, not reaching the command prompt, at this point:

Timecounter "TSC-low" frequency 8594011 Hz quality 800

I removed power from the system and tried again. No luck. I reconnected
the SSD and rebooted in verbose, and eventually got this:

Timecounter "TSC-low" frequency 8594011 Hz quality 800
GEOM_PART: partition 1 is not aligned on 4096 bytes
GEOM_PART: partition 2 is not aligned on 4096 bytes

What I eventually discovered is one of the two disks of the OS RAID1
array is suddenly toast. Maybe this is coincidence but could it be the
driver is confusing the two LSI chips?

I am in the process of rebuilding this system.


BTW, I installed ZFS-on-Linux under CentOS 6.3 on one of my other
systems that would spontaneously reboot when I would issue a "zfs send"
of a data set to it from another system. That system was issued a job
with substantial load and has been up for only four hours. It'll be
interesting to see if anything happens.