From owner-freebsd-stable@FreeBSD.ORG Sat Oct 2 13:43:32 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7D35E1065670 for ; Sat, 2 Oct 2010 13:43:32 +0000 (UTC) (envelope-from dan@langille.org) Received: from nyi.unixathome.org (nyi.unixathome.org [64.147.113.42]) by mx1.freebsd.org (Postfix) with ESMTP id 4DF008FC16 for ; Sat, 2 Oct 2010 13:43:32 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by nyi.unixathome.org (Postfix) with ESMTP id 7369F509E3 for ; Sat, 2 Oct 2010 14:43:31 +0100 (BST) X-Virus-Scanned: amavisd-new at unixathome.org Received: from nyi.unixathome.org ([127.0.0.1]) by localhost (nyi.unixathome.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id gw8UFyjs+vXi for ; Sat, 2 Oct 2010 14:43:31 +0100 (BST) Received: from smtp-auth.unixathome.org (smtp-auth.unixathome.org [10.4.7.7]) (Authenticated sender: hidden) by nyi.unixathome.org (Postfix) with ESMTPSA id 0084D509A3 for ; Sat, 2 Oct 2010 14:43:30 +0100 (BST) Message-ID: <4CA73702.5080203@langille.org> Date: Sat, 02 Oct 2010 09:43:30 -0400 From: Dan Langille Organization: The FreeBSD Diary User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2.9) Gecko/20100915 Thunderbird/3.1.4 MIME-Version: 1.0 To: freebsd-stable Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: out of HDD space - zfs degraded X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Oct 2010 13:43:32 -0000 Overnight I was running a zfs send | zfs receive (both within the same system / zpool). The system ran out of space, a drive went off line, and the system is degraded. This is a raidz2 array running on FreeBSD 8.1-STABLE #0: Sat Sep 18 23:43:48 EDT 2010. The following logs are also available at http://www.langille.org/tmp/zfs-space.txt <- no line wrapping This is what was running: # time zfs send storage/bacula@transfer | mbuffer | zfs receive storage/compressed/bacula-mbuffer in @ 0.0 kB/s, out @ 0.0 kB/s, 3670 GB total, buffer 100% fullcannot receive new filesystem stream: out of space mbuffer: error: outputThread: error writing to at offset 0x395917c4000: Broken pipe summary: 3670 GByte in 10 h 40 min 97.8 MB/s mbuffer: warning: error during output to : Broken pipe warning: cannot send 'storage/bacula@transfer': Broken pipe real 640m48.423s user 8m52.660s sys 211m40.862s Looking in the logs, I see this: Oct 2 00:50:53 kraken kernel: (ada0:siisch0:0:0:0): lost device Oct 2 00:50:54 kraken kernel: siisch0: Timeout on slot 30 Oct 2 00:50:54 kraken kernel: siisch0: siis_timeout is 00040000 ss 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000 Oct 2 00:50:54 kraken kernel: siisch0: Error while READ LOG EXT Oct 2 00:50:55 kraken kernel: siisch0: Timeout on slot 30 Oct 2 00:50:55 kraken kernel: siisch0: siis_timeout is 00040000 ss 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000 Oct 2 00:50:55 kraken kernel: siisch0: Error while READ LOG EXT Oct 2 00:50:56 kraken kernel: siisch0: Timeout on slot 30 Oct 2 00:50:56 kraken kernel: siisch0: siis_timeout is 00040000 ss 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000 Oct 2 00:50:56 kraken kernel: siisch0: Error while READ LOG EXT Oct 2 00:50:57 kraken kernel: siisch0: Timeout on slot 30 Oct 2 00:50:57 kraken kernel: siisch0: siis_timeout is 00040000 ss 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000 Oct 2 00:50:57 kraken kernel: siisch0: Error while READ LOG EXT Oct 2 00:50:58 kraken kernel: siisch0: Timeout on slot 30 Oct 2 00:50:58 kraken kernel: siisch0: siis_timeout is 00040000 ss 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000 Oct 2 00:50:58 kraken kernel: siisch0: Error while READ LOG EXT Oct 2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage path=/dev/gpt/disk06-live offset=270336 size=8192 error=6 Oct 2 00:50:59 kraken kernel: (ada0:siisch0:0:0:0): Synchronize cache failed Oct 2 00:50:59 kraken kernel: (ada0:siisch0:0:0:0): removing device entry Oct 2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage path=/dev/gpt/disk06-live offset=2000187564032 size=8192 error=6 Oct 2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage path=/dev/gpt/disk06-live offset=2000187826176 size=8192 error=6 $ zpool status pool: storage state: DEGRADED scrub: scrub in progress for 5h32m, 17.16% done, 26h44m to go config: NAME STATE READ WRITE CKSUM storage DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 gpt/disk01-live ONLINE 0 0 0 gpt/disk02-live ONLINE 0 0 0 gpt/disk03-live ONLINE 0 0 0 gpt/disk04-live ONLINE 0 0 0 gpt/disk05-live ONLINE 0 0 0 gpt/disk06-live REMOVED 0 0 0 gpt/disk07-live ONLINE 0 0 0 $ zfs list NAME USED AVAIL REFER MOUNTPOINT storage 6.97T 1.91T 1.75G /storage storage/bacula 4.72T 1.91T 4.29T /storage/bacula storage/compressed 2.25T 1.91T 46.9K /storage/compressed storage/compressed/bacula 2.25T 1.91T 42.7K /storage/compressed/bacula storage/pgsql 5.50G 1.91T 5.50G /storage/pgsql $ sudo camcontrol devlist Password: at scbus2 target 0 lun 0 (pass1,ada1) at scbus3 target 0 lun 0 (pass2,ada2) at scbus4 target 0 lun 0 (pass3,ada3) at scbus5 target 0 lun 0 (pass4,ada4) at scbus6 target 0 lun 0 (pass5,ada5) at scbus7 target 0 lun 0 (pass6,ada6) at scbus8 target 0 lun 0 (pass7,ada7) at scbus9 target 0 lun 0 (cd0,pass8) at scbus10 target 0 lun 0 (pass9,ada8) I'm not yet sure if the drive is fully dead or not. This is not a hot-swap box. I'm guessing the first step is to get ada0 back online and then in the zpool. However, I'm reluctant to do a 'camcontrol scan' on this box as it it froze up the system the last time I tried that: http://docs.freebsd.org/cgi/mid.cgi?4C78FF01.5020500 Any suggestions for getting the drive back online and the zpool stabilized? -- Dan Langille - http://langille.org/