Date: Sat, 02 Oct 2010 18:09:25 -0400 From: Dan Langille <dan@langille.org> To: Jeremy Chadwick <freebsd@jdc.parodius.com>, freebsd-stable <freebsd-stable@freebsd.org> Subject: Re: out of HDD space - zfs degraded Message-ID: <4CA7AD95.9040703@langille.org> In-Reply-To: <20101002141921.GC70283@icarus.home.lan> References: <4CA73702.5080203@langille.org> <20101002141921.GC70283@icarus.home.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
On 10/2/2010 10:19 AM, Jeremy Chadwick wrote: > On Sat, Oct 02, 2010 at 09:43:30AM -0400, Dan Langille wrote: >> Overnight I was running a zfs send | zfs receive (both within the >> same system / zpool). The system ran out of space, a drive went off >> line, and the system is degraded. >> >> This is a raidz2 array running on FreeBSD 8.1-STABLE #0: Sat Sep 18 >> 23:43:48 EDT 2010. >> >> The following logs are also available at >> http://www.langille.org/tmp/zfs-space.txt<- no line wrapping >> >> This is what was running: >> >> # time zfs send storage/bacula@transfer | mbuffer | zfs receive >> storage/compressed/bacula-mbuffer >> in @ 0.0 kB/s, out @ 0.0 kB/s, 3670 GB total, buffer 100% >> fullcannot receive new filesystem stream: out of space >> mbuffer: error: outputThread: error writing to<stdout> at offset >> 0x395917c4000: Broken pipe >> >> summary: 3670 GByte in 10 h 40 min 97.8 MB/s >> mbuffer: warning: error during output to<stdout>: Broken pipe >> warning: cannot send 'storage/bacula@transfer': Broken pipe >> >> real 640m48.423s >> user 8m52.660s >> sys 211m40.862s >> >> >> Looking in the logs, I see this: >> >> Oct 2 00:50:53 kraken kernel: (ada0:siisch0:0:0:0): lost device >> Oct 2 00:50:54 kraken kernel: siisch0: Timeout on slot 30 >> Oct 2 00:50:54 kraken kernel: siisch0: siis_timeout is 00040000 ss >> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000 >> Oct 2 00:50:54 kraken kernel: siisch0: Error while READ LOG EXT >> Oct 2 00:50:55 kraken kernel: siisch0: Timeout on slot 30 >> Oct 2 00:50:55 kraken kernel: siisch0: siis_timeout is 00040000 ss >> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000 >> Oct 2 00:50:55 kraken kernel: siisch0: Error while READ LOG EXT >> Oct 2 00:50:56 kraken kernel: siisch0: Timeout on slot 30 >> Oct 2 00:50:56 kraken kernel: siisch0: siis_timeout is 00040000 ss >> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000 >> Oct 2 00:50:56 kraken kernel: siisch0: Error while READ LOG EXT >> Oct 2 00:50:57 kraken kernel: siisch0: Timeout on slot 30 >> Oct 2 00:50:57 kraken kernel: siisch0: siis_timeout is 00040000 ss >> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000 >> Oct 2 00:50:57 kraken kernel: siisch0: Error while READ LOG EXT >> Oct 2 00:50:58 kraken kernel: siisch0: Timeout on slot 30 >> Oct 2 00:50:58 kraken kernel: siisch0: siis_timeout is 00040000 ss >> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000 >> Oct 2 00:50:58 kraken kernel: siisch0: Error while READ LOG EXT >> Oct 2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage >> path=/dev/gpt/disk06-live offset=270336 size=8192 error=6 >> >> Oct 2 00:50:59 kraken kernel: (ada0:siisch0:0:0:0): Synchronize >> cache failed >> Oct 2 00:50:59 kraken kernel: (ada0:siisch0:0:0:0): removing device entry >> >> Oct 2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage >> path=/dev/gpt/disk06-live offset=2000187564032 size=8192 error=6 >> Oct 2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage >> path=/dev/gpt/disk06-live offset=2000187826176 size=8192 error=6 >> >> $ zpool status >> pool: storage >> state: DEGRADED >> scrub: scrub in progress for 5h32m, 17.16% done, 26h44m to go >> config: >> >> NAME STATE READ WRITE CKSUM >> storage DEGRADED 0 0 0 >> raidz2 DEGRADED 0 0 0 >> gpt/disk01-live ONLINE 0 0 0 >> gpt/disk02-live ONLINE 0 0 0 >> gpt/disk03-live ONLINE 0 0 0 >> gpt/disk04-live ONLINE 0 0 0 >> gpt/disk05-live ONLINE 0 0 0 >> gpt/disk06-live REMOVED 0 0 0 >> gpt/disk07-live ONLINE 0 0 0 >> >> $ zfs list >> NAME USED AVAIL REFER MOUNTPOINT >> storage 6.97T 1.91T 1.75G /storage >> storage/bacula 4.72T 1.91T 4.29T /storage/bacula >> storage/compressed 2.25T 1.91T 46.9K /storage/compressed >> storage/compressed/bacula 2.25T 1.91T 42.7K /storage/compressed/bacula >> storage/pgsql 5.50G 1.91T 5.50G /storage/pgsql >> >> $ sudo camcontrol devlist >> Password: >> <Hitachi HDS722020ALA330 JKAOA28A> at scbus2 target 0 lun 0 (pass1,ada1) >> <Hitachi HDS722020ALA330 JKAOA28A> at scbus3 target 0 lun 0 (pass2,ada2) >> <Hitachi HDS722020ALA330 JKAOA28A> at scbus4 target 0 lun 0 (pass3,ada3) >> <Hitachi HDS722020ALA330 JKAOA28A> at scbus5 target 0 lun 0 (pass4,ada4) >> <Hitachi HDS722020ALA330 JKAOA28A> at scbus6 target 0 lun 0 (pass5,ada5) >> <Hitachi HDS722020ALA330 JKAOA28A> at scbus7 target 0 lun 0 (pass6,ada6) >> <ST380815AS 4.AAB> at scbus8 target 0 lun 0 (pass7,ada7) >> <TSSTcorp CDDVDW SH-S223C SB01> at scbus9 target 0 lun 0 (cd0,pass8) >> <WDC WD1600AAJS-75M0A0 02.03E02> at scbus10 target 0 lun 0 (pass9,ada8) >> >> I'm not yet sure if the drive is fully dead or not. This is not a >> hot-swap box. > > It looks to me like the disk labelled gpt/disk06-live literally stopped > responding to commands. The errors you see are coming from the OS and > the siis(4) controller, and both indicate the actual hard disk isn't > responding to the ATA command READ LOG EXT. error=6 means Device not > configured. > > I can't see how/why running out of space would cause this. It looks > more like that you had a hardware issue of some sort happen during the > course of the operations you were running. It may not have happened > until now because you hadn't utilised writes to that area of the disk > (could have bad sectors there, or physical media/platter problems). > > Please provide smartctl -a output for the drive that's gpt/disk06-live, > which I assume is /dev/ada6 (glabel sure makes correlation easy, doesn't > it? Sigh...). Please put the results up on the web somewhere, not > copy-pasted, otherwise I have to do a bunch of manual work with regarsd > to line wrapping/etc... I'll provide an analysis of SMART stats for > you, to see if anything crazy happened to the disk itself. It is ada0, I'm sure, based on the 'lost device' mentioned in /var/log/messages above. I'm getting nowhere. /dev/ada0 does not exist so there is nothing for smartctl to work on. $ sudo smartctl -a /dev/ada0 smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 8.1-STABLE amd64] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net /dev/ada0: Unable to detect device type Smartctl: please specify device type with the -d option. Use smartctl -h to get a usage summary $ sudo smartctl -d ata -a /dev/ada0da0 smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 8.1-STABLE amd64] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net Smartctl open device: /dev/ada0 failed: No such file or directory $ ls -l /dev/ada0* ls: /dev/ada0*: No such file or directory I am tempted to reboot or do a camontrol scan. -- Dan Langille - http://langille.org/
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4CA7AD95.9040703>