Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 31 May 2011 02:47:22 -0700
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        Olaf Seibert <O.Seibert@cs.ru.nl>
Cc:        freebsd-stable@freebsd.org, Dan Nelson <dnelson@allantgroup.com>
Subject:   Re: ZFS I/O errors
Message-ID:  <20110531094722.GA96712@icarus.home.lan>
In-Reply-To: <20110531092556.GD6733@twoquid.cs.ru.nl>
References:  <20110530093546.GX6733@twoquid.cs.ru.nl> <20110530101051.GA49825@twoquid.cs.ru.nl> <20110530103349.GA73825@icarus.home.lan> <20110530110946.GC6733@twoquid.cs.ru.nl> <20110530171909.GE6688@dan.emsphone.com> <20110531092556.GD6733@twoquid.cs.ru.nl>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, May 31, 2011 at 11:25:56AM +0200, Olaf Seibert wrote:
> On Mon 30 May 2011 at 12:19:10 -0500, Dan Nelson wrote:
> > The ZFS compression code will panic if it can't allocate the buffer needed
> > to store the compressed data, so that's unlikely to be your problem.  The
> > only time I have seen an "illegal byte sequence" error was when trying to
> > copy raw disk images containing ZFS pools to different disks, and the
> > destination disk was a different size than the original.  I wasn't even able
> > to import the pool in that case, though.  
> 
> Yet somehow some incorrect data got written, it seems. That never
> happened before, fortunately, even though we had crashes before that
> seemed to be related to ZFS running out of memory.
> 
> > The zfs IO code overloads the EILSEQ error code and uses it as a "checksum
> > error" code.  Returning that error for the same block on all disks is
> > definitely weird.  Could you have run a partitioning tool, or some other
> > program that would have done direct writes to all of your component disks?
> 
> I hope I would remember doing that if I did!
> 
> > Your scrub is also a bit worrying - 24k checksum errors definitely shouldn't
> > occur during normal usage.
> 
> It turns out that the errors are easy to provoke: they happen every time
> I do an ls of of the affected directories. There were processes running
> that were likely to be trying to write to the same directories (the file
> system is exported over NFS), so in that case it is easy to imagine that
> the numbers rack up quickly.
> 
> I moved those directories to the side, for the moment, but I haven't
> been able to delete them yet. The data is a bit bigger than we're able
> to backup so "just restoring a backup" isn't an easy thing to do.
> Possibly I could make a new filesystem in the same pool, if that would
> do the trick; it isn't more than 50% full but the affected one is the
> biggest filesystem in it.
> 
> The end result of the scrub is as follows:
> 
>   pool: tank
>  state: ONLINE
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://www.sun.com/msg/ZFS-8000-8A
>  scrub: scrub completed after 12h56m with 3 errors on Mon May 30 23:56:47 2011
> config:
> 
>         NAME        STATE     READ WRITE CKSUM
>         tank        ONLINE       0     0 6.38K
>           raidz2    ONLINE       0     0 25.4K
>             da0     ONLINE       0     0     0
>             da1     ONLINE       0     0     0
>             da2     ONLINE       0     0     0
>             da3     ONLINE       0     0     0
>             da4     ONLINE       0     0     0
>             da5     ONLINE       0     0     0
> 
> errors: Permanent errors have been detected in the following files:
> 
>         tank/vol-fourquid-1:<0x0>
>         tank/vol-fourquid-1@saturday:<0x0>
>         /tank/vol-fourquid-1/.zfs/snapshot/saturday/backups/dumps/dump_usr_friday.dump
>         /tank/vol-fourquid-1/.zfs/snapshot/saturday/sverberne/CLEF-IP11/parts_abs+desc
>         /tank/vol-fourquid-1/.zfs/snapshot/sunday/sverberne/CLEF-IP11/parts_abs+desc
>         /tank/vol-fourquid-1/.zfs/snapshot/monday/sverberne/CLEF-IP11/parts_abs+desc

Mickael Maillot responded to this thread, pointing that situations like
this could be caused by bad RAM.  I admit that's a possibility; with ZFS
in use the most likely memory-utilising piece (meaning volume-wise) of
the system would be the ZFS ARC.  I don't know if you'd necessarily see
things like sig11's on random daemons, etc. (it often depends on where
within the addressing range the bad DRAM chip would be associated).

Can you rule out bad RAM by letting something like memtest86+ run for
12-24 hours?  It's not a 100% infallible utility, but usually for simple
things, it will detect/report errors within the first 15-30 minutes.

Please keep in mind that even if you have ECC RAM, testing with
memtest86+ would be worthwhile.  Single-bit errors are correctable by
ECC, while multi-bit aren't (but are detectable).  "ChipKill" (see
Wikipedia please) might work around this problem, but I've never
personally used it (never seen it on any Intel systems I've used, only
AMD systems).

Finally, depending on what CPU model you have, northbridge problems
(older systems) or on-die MCH (newer CPUs, e.g. Core iX and recent Xeon)
problems could manifest themselves like this.  However, in those
situations I'd imagine you'd be seeing a lot of other oddities on the
system and not limited to just ZFS.

Newer systems which support MCA (again see Wikipedia; Machine Check
Architecture) would/show throw MCEs which FreeBSD 8.x should absolutely
notice/report (you'd see a lot of nastigrams on the console).

I think that about does it for my ideas/blabbing on that topic.

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.               PGP 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110531094722.GA96712>