Date: Wed, 16 Dec 2009 16:21:10 GMT From: Tom Payne <Tom.Payne@unige.ch> To: freebsd-gnats-submit@FreeBSD.org Subject: misc/141685: zfs corruption on adaptec 5805 raid controller Message-ID: <200912161621.nBGGLAF8035555@www.freebsd.org> Resent-Message-ID: <200912161630.nBGGU1tN084593@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
>Number: 141685 >Category: misc >Synopsis: zfs corruption on adaptec 5805 raid controller >Confidential: no >Severity: critical >Priority: medium >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Wed Dec 16 16:30:01 UTC 2009 >Closed-Date: >Last-Modified: >Originator: Tom Payne >Release: 8.0-RELEASE >Organization: ISDC >Environment: FreeBSD isdc3202.isdc.unige.ch 8.0-RELEASE FreeBSD 8.0-RELEASE #0: Sat Nov 21 15:02:08 UTC 2009 root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64 >Description: Short version: zfs on a new 5.44T Adaptec 5805 hardware RAID5 partition reports lots of zfs checksum errors. Tests claim that the hardware is working correctly. Long version: I have an Adaptec RAID 5085 controller with eight 1TB SAS disks: # dmesg | grep aac aac0: <Adaptec RAID 5805> mem 0xfbc00000-0xfbdfffff irq 16 at device 0.0 on pci9 aac0: Enabling 64-bit address support aac0: Enable Raw I/O aac0: Enable 64-bit array aac0: New comm. interface enabled aac0: [ITHREAD] aac0: Adaptec 5805, aac driver 2.0.0-1 aacp0: <SCSI Passthrough Bus> on aac0 aacp1: <SCSI Passthrough Bus> on aac0 aacp2: <SCSI Passthrough Bus> on aac0 aacd0: <RAID 5> on aac0 aacd0: 16370MB (33525760 sectors) aacd1: <RAID 5> on aac0 aacd1: 6657011MB (13633558528 sectors) It's configured with a small partition (aacd0) for the root filesystem, the rest (aacd1) is a single large zpool: # zpool create tank aacd1 # zfs list | head -n 2 NAME USED AVAIL REFER MOUNTPOINT tank 792G 5.44T 18K none After a few days of light use (rsync'ing data from older disk servers) zfs reports lots of checksum errors: # zpool status pool: tank state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: scrub completed after 1h17m with 49 errors on Mon Dec 14 13:35:50 2009 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 98 aacd1 ONLINE 0 0 196 These 49 errors are in various files scattered across the the 200+ zfs filesystems on the disk. /var/log/messages contains, for example: # grep ZFS /var/log/messages Dec 14 13:23:50 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=79622307840 size=131072 Dec 14 13:23:50 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=79622307840 size=131072 Dec 14 13:23:50 isdc3202 root: ZFS: zpool I/O failure, zpool=tank error=86 Dec 14 13:27:47 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=77752696832 size=131072 Dec 14 13:27:47 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=77752696832 size=131072 Dec 14 13:27:47 isdc3202 root: ZFS: zpool I/O failure, zpool=tank error=86 Dec 14 13:28:07 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=1409111293952 size=131072 Dec 14 13:28:07 isdc3202 root: ZFS: checksum mismatch, zpool=tank path=/dev/aacd1 offset=1409111293952 size=131072 Dec 14 13:28:07 isdc3202 root: ZFS: zpool I/O failure, zpool=tank error=86 The 49 checksum errors occur at 49 different offsets in three distinct ranges: 70743228416.. 84649705472 ( 6) 1406828281856..1441780858880 (14) 2749871030272..2817199702016 (29) The Adaptec controller firmware was updated the latest version (at the time of writing) after the first errors were observed. Since the firmware was updated more errors have been observed. # arcconf getversion Controllers found: 1 Controller #1 ============== Firmware : 5.2-0 (17544) Staged Firmware : 5.2-0 (17544) BIOS : 5.2-0 (17544) Driver : 5.2-0 (17544) Boot Flash : 5.2-0 (17544) I ran a verify task on the RAID controller with # arcconf task start 1 logicaldrive 1 verify noprompt As far as I can tell, this verify task did not find any errors. The array status is still reported as "optimal" and there seems to be nothing in the logs. A 24 hour memory test with memtest86+ version 4.00 did not detect any memory errors. Previously, problems have been found with zfs on USB drives: http://lists.freebsd.org/pipermail/freebsd-current/2009-April/005510.html As I understand it, the situation is: - zfs has checksum errors - the hardware RAID believes that the data on disk is consistent - there are no obvious memory problems Could this be a FreeBSD bug? >How-To-Repeat: Unknown >Fix: Unknown >Release-Note: >Audit-Trail: >Unformatted:
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200912161621.nBGGLAF8035555>