From owner-freebsd-stable@FreeBSD.ORG  Sat Oct  2 13:43:32 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 7D35E1065670
	for <freebsd-stable@freebsd.org>; Sat,  2 Oct 2010 13:43:32 +0000 (UTC)
	(envelope-from dan@langille.org)
Received: from nyi.unixathome.org (nyi.unixathome.org [64.147.113.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 4DF008FC16
	for <freebsd-stable@freebsd.org>; Sat,  2 Oct 2010 13:43:32 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
	by nyi.unixathome.org (Postfix) with ESMTP id 7369F509E3
	for <freebsd-stable@freebsd.org>; Sat,  2 Oct 2010 14:43:31 +0100 (BST)
X-Virus-Scanned: amavisd-new at unixathome.org
Received: from nyi.unixathome.org ([127.0.0.1])
	by localhost (nyi.unixathome.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id gw8UFyjs+vXi for <freebsd-stable@freebsd.org>;
	Sat,  2 Oct 2010 14:43:31 +0100 (BST)
Received: from smtp-auth.unixathome.org (smtp-auth.unixathome.org [10.4.7.7])
	(Authenticated sender: hidden)
	by nyi.unixathome.org (Postfix) with ESMTPSA id 0084D509A3  
	for <freebsd-stable@freebsd.org>; Sat,  2 Oct 2010 14:43:30 +0100 (BST)
Message-ID: <4CA73702.5080203@langille.org>
Date: Sat, 02 Oct 2010 09:43:30 -0400
From: Dan Langille <dan@langille.org>
Organization: The FreeBSD Diary
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US;
	rv:1.9.2.9) Gecko/20100915 Thunderbird/3.1.4
MIME-Version: 1.0
To: freebsd-stable <freebsd-stable@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: out of HDD space - zfs degraded
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 02 Oct 2010 13:43:32 -0000

Overnight I was running a zfs send | zfs receive (both within the same 
system / zpool).  The system ran out of space, a drive went off line, 
and the system is degraded.

This is a raidz2 array running on FreeBSD 8.1-STABLE #0: Sat Sep 18 
23:43:48 EDT 2010.

The following logs are also available at 
http://www.langille.org/tmp/zfs-space.txt <- no line wrapping

This is what was running:

# time zfs send storage/bacula@transfer | mbuffer | zfs receive 
storage/compressed/bacula-mbuffer
in @  0.0 kB/s, out @  0.0 kB/s, 3670 GB total, buffer 100% fullcannot 
receive new filesystem stream: out of space
mbuffer: error: outputThread: error writing to <stdout> at offset 
0x395917c4000: Broken pipe

summary: 3670 GByte in 10 h 40 min 97.8 MB/s
mbuffer: warning: error during output to <stdout>: Broken pipe
warning: cannot send 'storage/bacula@transfer': Broken pipe

real    640m48.423s
user    8m52.660s
sys     211m40.862s


Looking in the logs, I see this:

Oct  2 00:50:53 kraken kernel: (ada0:siisch0:0:0:0): lost device
Oct  2 00:50:54 kraken kernel: siisch0: Timeout on slot 30
Oct  2 00:50:54 kraken kernel: siisch0: siis_timeout is 00040000 ss 
40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
Oct  2 00:50:54 kraken kernel: siisch0: Error while READ LOG EXT
Oct  2 00:50:55 kraken kernel: siisch0: Timeout on slot 30
Oct  2 00:50:55 kraken kernel: siisch0: siis_timeout is 00040000 ss 
40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
Oct  2 00:50:55 kraken kernel: siisch0: Error while READ LOG EXT
Oct  2 00:50:56 kraken kernel: siisch0: Timeout on slot 30
Oct  2 00:50:56 kraken kernel: siisch0: siis_timeout is 00040000 ss 
40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
Oct  2 00:50:56 kraken kernel: siisch0: Error while READ LOG EXT
Oct  2 00:50:57 kraken kernel: siisch0: Timeout on slot 30
Oct  2 00:50:57 kraken kernel: siisch0: siis_timeout is 00040000 ss 
40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
Oct  2 00:50:57 kraken kernel: siisch0: Error while READ LOG EXT
Oct  2 00:50:58 kraken kernel: siisch0: Timeout on slot 30
Oct  2 00:50:58 kraken kernel: siisch0: siis_timeout is 00040000 ss 
40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
Oct  2 00:50:58 kraken kernel: siisch0: Error while READ LOG EXT
Oct  2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage 
path=/dev/gpt/disk06-live offset=270336 size=8192 error=6

Oct  2 00:50:59 kraken kernel: (ada0:siisch0:0:0:0): Synchronize cache 
failed
Oct  2 00:50:59 kraken kernel: (ada0:siisch0:0:0:0): removing device entry

Oct  2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage 
path=/dev/gpt/disk06-live offset=2000187564032 size=8192 error=6
Oct  2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage 
path=/dev/gpt/disk06-live offset=2000187826176 size=8192 error=6

$ zpool status
   pool: storage
  state: DEGRADED
  scrub: scrub in progress for 5h32m, 17.16% done, 26h44m to go
config:

         NAME                 STATE     READ WRITE CKSUM
         storage              DEGRADED     0     0     0
           raidz2             DEGRADED     0     0     0
             gpt/disk01-live  ONLINE       0     0     0
             gpt/disk02-live  ONLINE       0     0     0
             gpt/disk03-live  ONLINE       0     0     0
             gpt/disk04-live  ONLINE       0     0     0
             gpt/disk05-live  ONLINE       0     0     0
             gpt/disk06-live  REMOVED      0     0     0
             gpt/disk07-live  ONLINE       0     0     0

$ zfs list
NAME                        USED  AVAIL  REFER  MOUNTPOINT
storage                    6.97T  1.91T  1.75G  /storage
storage/bacula             4.72T  1.91T  4.29T  /storage/bacula
storage/compressed         2.25T  1.91T  46.9K  /storage/compressed
storage/compressed/bacula  2.25T  1.91T  42.7K  /storage/compressed/bacula
storage/pgsql              5.50G  1.91T  5.50G  /storage/pgsql

$ sudo camcontrol devlist
Password:
<Hitachi HDS722020ALA330 JKAOA28A>  at scbus2 target 0 lun 0 (pass1,ada1)
<Hitachi HDS722020ALA330 JKAOA28A>  at scbus3 target 0 lun 0 (pass2,ada2)
<Hitachi HDS722020ALA330 JKAOA28A>  at scbus4 target 0 lun 0 (pass3,ada3)
<Hitachi HDS722020ALA330 JKAOA28A>  at scbus5 target 0 lun 0 (pass4,ada4)
<Hitachi HDS722020ALA330 JKAOA28A>  at scbus6 target 0 lun 0 (pass5,ada5)
<Hitachi HDS722020ALA330 JKAOA28A>  at scbus7 target 0 lun 0 (pass6,ada6)
<ST380815AS 4.AAB>                 at scbus8 target 0 lun 0 (pass7,ada7)
<TSSTcorp CDDVDW SH-S223C SB01>    at scbus9 target 0 lun 0 (cd0,pass8)
<WDC WD1600AAJS-75M0A0 02.03E02>   at scbus10 target 0 lun 0 (pass9,ada8)

I'm not yet sure if the drive is fully dead or not.  This is not a 
hot-swap box.

I'm guessing the first step is to get ada0 back online and then in the 
zpool.  However, I'm reluctant to do a 'camcontrol scan' on this box as 
it it froze up the system the last time I tried that:

   http://docs.freebsd.org/cgi/mid.cgi?4C78FF01.5020500

Any suggestions for getting the drive back online and the zpool stabilized?

-- 
Dan Langille - http://langille.org/