Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 02 Oct 2010 18:09:25 -0400
From:      Dan Langille <dan@langille.org>
To:        Jeremy Chadwick <freebsd@jdc.parodius.com>,  freebsd-stable <freebsd-stable@freebsd.org>
Subject:   Re: out of HDD space - zfs degraded
Message-ID:  <4CA7AD95.9040703@langille.org>
In-Reply-To: <20101002141921.GC70283@icarus.home.lan>
References:  <4CA73702.5080203@langille.org> <20101002141921.GC70283@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
On 10/2/2010 10:19 AM, Jeremy Chadwick wrote:
> On Sat, Oct 02, 2010 at 09:43:30AM -0400, Dan Langille wrote:
>> Overnight I was running a zfs send | zfs receive (both within the
>> same system / zpool).  The system ran out of space, a drive went off
>> line, and the system is degraded.
>>
>> This is a raidz2 array running on FreeBSD 8.1-STABLE #0: Sat Sep 18
>> 23:43:48 EDT 2010.
>>
>> The following logs are also available at
>> http://www.langille.org/tmp/zfs-space.txt<- no line wrapping
>>
>> This is what was running:
>>
>> # time zfs send storage/bacula@transfer | mbuffer | zfs receive
>> storage/compressed/bacula-mbuffer
>> in @  0.0 kB/s, out @  0.0 kB/s, 3670 GB total, buffer 100%
>> fullcannot receive new filesystem stream: out of space
>> mbuffer: error: outputThread: error writing to<stdout>  at offset
>> 0x395917c4000: Broken pipe
>>
>> summary: 3670 GByte in 10 h 40 min 97.8 MB/s
>> mbuffer: warning: error during output to<stdout>: Broken pipe
>> warning: cannot send 'storage/bacula@transfer': Broken pipe
>>
>> real    640m48.423s
>> user    8m52.660s
>> sys     211m40.862s
>>
>>
>> Looking in the logs, I see this:
>>
>> Oct  2 00:50:53 kraken kernel: (ada0:siisch0:0:0:0): lost device
>> Oct  2 00:50:54 kraken kernel: siisch0: Timeout on slot 30
>> Oct  2 00:50:54 kraken kernel: siisch0: siis_timeout is 00040000 ss
>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
>> Oct  2 00:50:54 kraken kernel: siisch0: Error while READ LOG EXT
>> Oct  2 00:50:55 kraken kernel: siisch0: Timeout on slot 30
>> Oct  2 00:50:55 kraken kernel: siisch0: siis_timeout is 00040000 ss
>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
>> Oct  2 00:50:55 kraken kernel: siisch0: Error while READ LOG EXT
>> Oct  2 00:50:56 kraken kernel: siisch0: Timeout on slot 30
>> Oct  2 00:50:56 kraken kernel: siisch0: siis_timeout is 00040000 ss
>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
>> Oct  2 00:50:56 kraken kernel: siisch0: Error while READ LOG EXT
>> Oct  2 00:50:57 kraken kernel: siisch0: Timeout on slot 30
>> Oct  2 00:50:57 kraken kernel: siisch0: siis_timeout is 00040000 ss
>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
>> Oct  2 00:50:57 kraken kernel: siisch0: Error while READ LOG EXT
>> Oct  2 00:50:58 kraken kernel: siisch0: Timeout on slot 30
>> Oct  2 00:50:58 kraken kernel: siisch0: siis_timeout is 00040000 ss
>> 40000000 rs 40000000 es 00000000 sts 801f0040 serr 00000000
>> Oct  2 00:50:58 kraken kernel: siisch0: Error while READ LOG EXT
>> Oct  2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage
>> path=/dev/gpt/disk06-live offset=270336 size=8192 error=6
>>
>> Oct  2 00:50:59 kraken kernel: (ada0:siisch0:0:0:0): Synchronize
>> cache failed
>> Oct  2 00:50:59 kraken kernel: (ada0:siisch0:0:0:0): removing device entry
>>
>> Oct  2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage
>> path=/dev/gpt/disk06-live offset=2000187564032 size=8192 error=6
>> Oct  2 00:50:59 kraken root: ZFS: vdev I/O failure, zpool=storage
>> path=/dev/gpt/disk06-live offset=2000187826176 size=8192 error=6
>>
>> $ zpool status
>>    pool: storage
>>   state: DEGRADED
>>   scrub: scrub in progress for 5h32m, 17.16% done, 26h44m to go
>> config:
>>
>>          NAME                 STATE     READ WRITE CKSUM
>>          storage              DEGRADED     0     0     0
>>            raidz2             DEGRADED     0     0     0
>>              gpt/disk01-live  ONLINE       0     0     0
>>              gpt/disk02-live  ONLINE       0     0     0
>>              gpt/disk03-live  ONLINE       0     0     0
>>              gpt/disk04-live  ONLINE       0     0     0
>>              gpt/disk05-live  ONLINE       0     0     0
>>              gpt/disk06-live  REMOVED      0     0     0
>>              gpt/disk07-live  ONLINE       0     0     0
>>
>> $ zfs list
>> NAME                        USED  AVAIL  REFER  MOUNTPOINT
>> storage                    6.97T  1.91T  1.75G  /storage
>> storage/bacula             4.72T  1.91T  4.29T  /storage/bacula
>> storage/compressed         2.25T  1.91T  46.9K  /storage/compressed
>> storage/compressed/bacula  2.25T  1.91T  42.7K  /storage/compressed/bacula
>> storage/pgsql              5.50G  1.91T  5.50G  /storage/pgsql
>>
>> $ sudo camcontrol devlist
>> Password:
>> <Hitachi HDS722020ALA330 JKAOA28A>   at scbus2 target 0 lun 0 (pass1,ada1)
>> <Hitachi HDS722020ALA330 JKAOA28A>   at scbus3 target 0 lun 0 (pass2,ada2)
>> <Hitachi HDS722020ALA330 JKAOA28A>   at scbus4 target 0 lun 0 (pass3,ada3)
>> <Hitachi HDS722020ALA330 JKAOA28A>   at scbus5 target 0 lun 0 (pass4,ada4)
>> <Hitachi HDS722020ALA330 JKAOA28A>   at scbus6 target 0 lun 0 (pass5,ada5)
>> <Hitachi HDS722020ALA330 JKAOA28A>   at scbus7 target 0 lun 0 (pass6,ada6)
>> <ST380815AS 4.AAB>                  at scbus8 target 0 lun 0 (pass7,ada7)
>> <TSSTcorp CDDVDW SH-S223C SB01>     at scbus9 target 0 lun 0 (cd0,pass8)
>> <WDC WD1600AAJS-75M0A0 02.03E02>    at scbus10 target 0 lun 0 (pass9,ada8)
>>
>> I'm not yet sure if the drive is fully dead or not.  This is not a
>> hot-swap box.
>
> It looks to me like the disk labelled gpt/disk06-live literally stopped
> responding to commands.  The errors you see are coming from the OS and
> the siis(4) controller, and both indicate the actual hard disk isn't
> responding to the ATA command READ LOG EXT.  error=6 means Device not
> configured.
>
> I can't see how/why running out of space would cause this.  It looks
> more like that you had a hardware issue of some sort happen during the
> course of the operations you were running.  It may not have happened
> until now because you hadn't utilised writes to that area of the disk
> (could have bad sectors there, or physical media/platter problems).
>
> Please provide smartctl -a output for the drive that's gpt/disk06-live,
> which I assume is /dev/ada6 (glabel sure makes correlation easy, doesn't
> it?  Sigh...).  Please put the results up on the web somewhere, not
> copy-pasted, otherwise I have to do a bunch of manual work with regarsd
> to line wrapping/etc...  I'll provide an analysis of SMART stats for
> you, to see if anything crazy happened to the disk itself.

It is ada0, I'm sure, based on the 'lost device' mentioned in 
/var/log/messages above.

I'm getting nowhere.  /dev/ada0 does not exist so there is nothing for 
smartctl to work on.

  $ sudo smartctl -a /dev/ada0
smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 8.1-STABLE amd64] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/ada0: Unable to detect device type
Smartctl: please specify device type with the -d option.

Use smartctl -h to get a usage summary




$ sudo smartctl -d ata -a /dev/ada0da0
smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 8.1-STABLE amd64] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Smartctl open device: /dev/ada0 failed: No such file or directory


$ ls -l /dev/ada0*
ls: /dev/ada0*: No such file or directory

I am tempted to reboot or do a camontrol scan.

-- 
Dan Langille - http://langille.org/



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4CA7AD95.9040703>