From owner-freebsd-fs@FreeBSD.ORG  Fri Dec  7 13:07:10 2012
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id A44A7FC9
 for <freebsd-fs@freebsd.org>; Fri,  7 Dec 2012 13:07:10 +0000 (UTC)
 (envelope-from joh.hendriks@gmail.com)
Received: from mail-bk0-f54.google.com (mail-bk0-f54.google.com
 [209.85.214.54])
 by mx1.freebsd.org (Postfix) with ESMTP id 250598FC0C
 for <freebsd-fs@freebsd.org>; Fri,  7 Dec 2012 13:07:09 +0000 (UTC)
Received: by mail-bk0-f54.google.com with SMTP id je9so254053bkc.13
 for <freebsd-fs@freebsd.org>; Fri, 07 Dec 2012 05:07:09 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=message-id:date:from:user-agent:mime-version:to:subject:references
 :in-reply-to:content-type:content-transfer-encoding;
 bh=SmDolnwes3XpeSOe2UGrxznguST9Wj3eLWCeBAiPPsQ=;
 b=WnEVK7GgTEX5PaRhtCN38g7EqNTcW5QEudmXW27CEEqQbHwWnwolSxhtjteXwZIlp1
 FA33rctS1AFvBdivVvbSXuAbNlP4iMHJ/5XUoEwmdXReR5mQavnPQNZiK+jSf3pgc86P
 hzAG0chFxrlznP5ZyvXekh7ZrY/M2pPeZsLnb/oWkLNWjAJkNnaRS7KslvkKqLT0HDId
 QSjaeTUtxqypoWRaK7vaqysD60EkkgLWCXGRnU3iMutTPLqfml2gKqKWlCYc+nAW5gjA
 n32jiI2r1w1ex/Gsicyav8bsojLhUOzb2R9h4jQefqSmbDOxvoQjFsfb3kt9INZqPTe/
 zByA==
Received: by 10.204.133.219 with SMTP id g27mr1869207bkt.65.1354885629127;
 Fri, 07 Dec 2012 05:07:09 -0800 (PST)
Received: from [192.168.50.105] (double-l.xs4all.nl. [80.126.205.144])
 by mx.google.com with ESMTPS id o9sm9075407bko.15.2012.12.07.05.07.07
 (version=SSLv3 cipher=OTHER); Fri, 07 Dec 2012 05:07:08 -0800 (PST)
Message-ID: <50C1E9FB.4020108@gmail.com>
Date: Fri, 07 Dec 2012 14:07:07 +0100
From: Johan Hendriks <joh.hendriks@gmail.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:17.0) Gecko/17.0 Thunderbird/17.0
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Subject: Re: ZFS hang
References: <50C1CB34.3000308@icritical.com> <50C1DDE8.9030503@icritical.com>
In-Reply-To: <50C1DDE8.9030503@icritical.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 07 Dec 2012 13:07:10 -0000

Matt Burke schreef:
> After rebooting the box, I've just seen this on the console (after 'Setting
> hostid'):
>
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 21 53 8 0 1 0 0
> (da8:isci0:0:0:0): CAM status: SCSI Status Error
> (da8:isci0:0:0:0): SCSI status: Check Condition
> (da8:isci0:0:0:0): SCSI sense: MEDIUM ERROR asc:ffffffff,ffffffff (Reserved
> ASC/ASCQ pair)
> (da8:isci0:0:0:0): Info: 0x4215378
> (da8:isci0:0:0:0): Retrying command (per sense data)
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 21 53 8 0 1 0 0
> (da8:isci0:0:0:0): CAM status: SCSI Status Error
> (da8:isci0:0:0:0): SCSI status: Check Condition
> (da8:isci0:0:0:0): SCSI sense: MEDIUM ERROR asc:ffffffff,ffffffff (Reserved
> ASC/ASCQ pair)
> (da8:isci0:0:0:0): Info: 0x4215378
> (da8:isci0:0:0:0): Retrying command (per sense data)
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 21 53 8 0 1 0 0
> (da8:isci0:0:0:0): CAM status: SCSI Status Error
> (da8:isci0:0:0:0): SCSI status: Check Condition
> (da8:isci0:0:0:0): SCSI sense: MEDIUM ERROR asc:ffffffff,ffffffff (Reserved
> ASC/ASCQ pair)
> (da8:isci0:0:0:0): Info: 0x4215378
> (da8:isci0:0:0:0): Retrying command (per sense data)
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 21 53 8 0 1 0 0
> (da8:isci0:0:0:0): CAM status: SCSI Status Error
> (da8:isci0:0:0:0): SCSI status: Check Condition
> (da8:isci0:0:0:0): SCSI sense: MEDIUM ERROR asc:ffffffff,ffffffff (Reserved
> ASC/ASCQ pair)
> (da8:isci0:0:0:0): Info: 0x4215378
> (da8:isci0:0:0:0): Retrying command (per sense data)
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 21 53 8 0 1 0 0
> (da8:isci0:0:0:0): CAM status: SCSI Status Error
> (da8:isci0:0:0:0): SCSI status: Check Condition
> (da8:isci0:0:0:0): SCSI sense: MEDIUM ERROR asc:ffffffff,ffffffff (Reserved
> ASC/ASCQ pair)
> (da8:isci0:0:0:0): Info: 0x4215378
> (da8:isci0:0:0:0): Error 5, Retries exhausted
>
> and then again for the following:
>
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 21 65 8 0 1 0 0
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 21 75 8 0 1 0 0
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 21 76 8 0 1 0 0
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 1 82 58 0 1 0 0
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 1 83 58 0 1 0 0
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 1 84 58 0 1 0 0
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 1 94 58 0 1 0 0
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 1 95 58 0 1 0 0 (only 2 retries)
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 1 9b 58 0 1 0 0
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 1 a4 58 0 1 0 0 (only 1 retry)
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 1 a5 58 0 1 0 0
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 1 a6 58 0 1 0 0
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 1 b4 58 0 1 0 0
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 1 b5 58 0 1 0 0 (2 retries)
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 1 b6 58 0 1 0 0
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 1 bc 58 0 1 0 0
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 1 c7 58 0 1 0 0
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 1 d7 58 0 1 0 0 (1 retry)
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 1 d8 58 0 1 0 0
> (da8:isci0:0:0:0): READ(10). CDB: 28 0 4 1 e8 58 0 1 0 0
>
> Obviously, the cause of my problems would seem to be a hosed disk. However
> the kernel msgbuf shows no complaints from the drive before reboot.
>
> da8 is a 60GB OCZ Agility 3 SSD (purchased prior to realising just how
> unreliable they are). According to the SMART data, it's had just 146GB of
> reads and 278GB writes over 3 power cycles with only 3 months power on
> time, similar to the others that have failed (~60% failure rate for ours)
>
> I can understand the drive failing, I just can't understand how it hung the
> system. I have had a similar thing happen on one of these machines before
> (with GENERIC and no dumpdev, so no debugging) with one of these disks on
> an Areca HBA.
>
> I've also had these drives fail on the onboard SATA controller, along with
> SAS drives on the SAS controllers, with no undesirable effects (other than
> having to swap it out).
>
> Could there be a problem with ATA devices on SCSI controllers which is
> causing failures to be silently dropped? Is ZFS lacking a timeout on IO calls?
>
> I'm going to move all these SSDs onto the SATA controller, and see if I can
> replicate the problem, but I'm not holding my breath over a conclusive result.
>
>
>
I had somthing simular.
This was on 9.0 with a LSI 9211-8i controller and a supermicro backplane.
One disk was failing, with simular kernel messages. It was a seagate 300 
GB SAS disk
But zfs got stuck on it. It totally hangs the system.
Which was no fun at all because we use ZFS as the storage for our VMware 
hypervisors.
After replacing that drive all was fine again.
But it looks like ZFS or something else can not handle it properly.
I also had a disk that was known to have bad sectors, this disk was 
ejected from the pool like it should with a message that too many errors 
occured on the disk.

Gr
Johan