From owner-freebsd-current@FreeBSD.ORG  Fri Feb 22 02:15:09 2008
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4783816A402
	for <freebsd-current@freebsd.org>; Fri, 22 Feb 2008 02:15:09 +0000 (UTC)
	(envelope-from rmtodd@ichotolot.servalan.com)
Received: from mx2.synetsystems.com (mx2.synetsystems.com [76.10.206.15])
	by mx1.freebsd.org (Postfix) with ESMTP id E8C8D13C459
	for <freebsd-current@freebsd.org>; Fri, 22 Feb 2008 02:15:08 +0000 (UTC)
	(envelope-from rmtodd@ichotolot.servalan.com)
Received: by mx2.synetsystems.com (Postfix, from userid 66)
	id 2300C65D; Thu, 21 Feb 2008 21:15:08 -0500 (EST)
Received: from rmtodd by servalan.servalan.com with local (Exim 4.68 (FreeBSD))
	(envelope-from <rmtodd@ichotolot.servalan.com>)
	id 1JSMmi-0003zY-Hk; Thu, 21 Feb 2008 19:33:08 -0600
To: Gavin Atkinson <gavin.atkinson@ury.york.ac.uk>
References: <20080215231507.EAB7D641@mx2.synetsystems.com>
	<1203359855.43782.60.camel@buffy.york.ac.uk>
From: Richard Todd <rmtodd@ichotolot.servalan.com>
Date: Thu, 21 Feb 2008 19:33:08 -0600
In-Reply-To: <1203359855.43782.60.camel@buffy.york.ac.uk> (Gavin Atkinson's
	message of "Mon, 18 Feb 2008 18:37:35 +0000")
Message-ID: <x71w75yduz.fsf@ichotolot.servalan.com>
User-Agent: Gnus/5.1008 (Gnus v5.10.8) XEmacs/21.4.21 (berkeley-unix)
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Cc: freebsd-current@freebsd.org
Subject: Re: ZFS: data sometimes not getting correctly flushed to disk
	(possible mmap/write issue)?
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 22 Feb 2008 02:15:09 -0000

Gavin Atkinson <gavin.atkinson@ury.york.ac.uk> writes:

> On Fri, 2008-02-15 at 16:59 -0600, Richard Todd wrote:
>> Further investigation and coming up with test cases brought me to the
>> surprising conclusion that ZFS was to blame -- under certain conditions ZFS
>> seems not to be writing all the data in the file out to disk correctly.
>> The file appears to be good only so long as the file's data is in cache, but
>> the data actually on disk doesn't match what's in cache.

> I don't currently have a system that I can test this with, but is there
> any chance you could make the following change to your script?
>
>         #!/bin/sh
>         NAME=$1
>         LEN=$2
>         rm $NAME
>         ./a.out $NAME $LEN
>         ./gen2.py $NAME
>         md5 $NAME
> + cp $NAME /some/ufs/filesystem/$NAME.1
>         ls -l $NAME
>         sudo umount /u3
>         sudo mount -t zfs u3 /u3
>         md5 $NAME
> + cp $NAME /some/ufs/filesystem/$NAME.2
>         ls -l $NAME
> + md5 /some/ufs/filesystem/$NAME.?
>
> and when you find files that differ, compare the two copies stored on
> the UFS filesystem with cmp(1)?  I'd be interested to know exactly what
> the corruption looks like - is a single bit differen?  Or was an entire
> page destroyed?

Neither, oddly enough.  I modified the script in much the way you requested,
copying the file to a file named "before" on a UFS filesystem before the 
unmount and to one named "after" after the remount.  The cmp -l shows a
difference of 10 bytes that are non-zero in the good copy and 0 in the
bad "after" copy.  

Script started on Mon Feb 18 14:09:47 2008
1 ichotolot ~/bug[ 2:09PM] Z% sh -x test.sh /u3/tmp/foo1.ogg 403
+ NAME=/u3/tmp/foo1.ogg
+ LEN=403
+ rm /u3/tmp/foo1.ogg
+ ./a.out /u3/tmp/foo1.ogg 403
+ ./gen2.py /u3/tmp/foo1.ogg
+ md5 /u3/tmp/foo1.ogg
MD5 (/u3/tmp/foo1.ogg) = 72b9b992f31332fc1682cad1e97fa7e5
+ cp /u3/tmp/foo1.ogg /bootsy/tmp/before
+ ls -l /u3/tmp/foo1.ogg
-rw-r--r--  1 rmtodd  wheel  6238218 Feb 18 14:10 /u3/tmp/foo1.ogg
+ sudo umount /u3
+ sudo mount -t zfs u3 /u3
+ md5 /u3/tmp/foo1.ogg
MD5 (/u3/tmp/foo1.ogg) = 3abd285e657684f4a2f89043f230b164
+ ls -l /u3/tmp/foo1.ogg
-rw-r--r--  1 rmtodd  wheel  6238218 Feb 18 14:10 /u3/tmp/foo1.ogg
+ cp /u3/tmp/foo1.ogg /bootsy/tmp/after
2 ichotolot ~/bug[ 2:10PM] Z% md5 /bootsy/tmp/{before,after}
MD5 (/bootsy/tmp/before) = 72b9b992f31332fc1682cad1e97fa7e5
MD5 (/bootsy/tmp/after) = 3abd285e657684f4a2f89043f230b164
3 ichotolot ~/bug[ 2:10PM] Z% cmp -l !$
cmp -l /bootsy/tmp/{before,after}
6238209 236   0
6238210  71   0
6238211 310   0
6238212 372   0
6238213  16   0
6238214  45   0
6238215 220   0
6238216 112   0
6238217 376   0
6238218   1   0
4 ichotolot ~/bug[ 2:10PM] Z% exit
Script done on Mon Feb 18 14:11:28 2008

An additional, important, note: I've discovered that this program can
trigger more serious lossage in ZFS than I mentioned above: not only
can you get corrupted data on disk, you can manage to corrupt the ZFS
pool space maps, possibly to the point that ZFS panics on boot.  So if
anyone out there wants to test this, best do it on a ZFS pool you
don't mind losing.  (I *did* manage to recover the data off of the ZFS
pool I did trash, and this info might prove useful to someone else in
similar situations.  The procedure was, roughly described, as follows:

        1) Boot the system, record where the ZFS panic was.
        2) Turn off the disks with the corrupted pool.  (Fortunately, mine
           was on an external Firewire disk pair.)
        3) Reboot the system, study where the panic was, and hack on 
           space_map.c to disable the sanity checks on the space map that 
           triggered the panic.  Build and install the new kernel.
        4) Turn the external disks back on, and reiterate 1)-3) until you 
           can manage to get the external pool mounted. 
        5) My external pool (u3) was a mirrored pair of disks, so what I did 
           next was detach one disk from u3 and make it a ZFS pool of its own, 
           call it u3new.
        6) Rsync everything from /u3 to /u3new.
        7) Destroy the old u3 pool and add its disk as a mirror to the u3new mirror.
        8) Reset the mountpoint property of u3new to make it mount at /u3 as before.
        9) Wonder briefly why the ZFS designers implemented "zfs rename" but not a "zpool rename". :-) 
Fortunately, I had relatively decent backups of u3 in case the above didn't
work.)

Anyway, I've done further testing on a different machine, one whose only ZFS
pool isn't used for anything more irreplacable than local cvsup storage. 
The panic/filesystem corruption doesn't seem to be as reliably repeatable 
as the data corruption -- the data corruption I mentioned in my original 
message seems to *always* happen when the program is run with the same 
arguments, the FS corruption doesn't.  I do have some coredumps from this
testing, though.  Here's the one where it paniced while running the program;
note that the crash is apparently somewhere in one of the syncer threads.

Script started on Thu Feb 21 19:23:47 2008
blo-rakane# kgdb kernel.debug /u2/tmp/vmcore.183
[GDB will not be able to debug user-mode threads: /usr/lib/libthread_db.so: Undefined symbol "ps_pglobal_lookup"]
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-marcel-freebsd".
Reading symbols from /boot/kernel/geom_mirror.ko...done.
Loaded symbols for /boot/kernel/geom_mirror.ko
Reading symbols from /boot/kernel/zfs.ko...done.
Loaded symbols for /boot/kernel/zfs.ko
Reading symbols from /boot/kernel/linux.ko...done.
Loaded symbols for /boot/kernel/linux.ko
Reading symbols from /usr/local/modules/fuse.ko...done.
Loaded symbols for /usr/local/modules/fuse.ko
Reading symbols from /boot/kernel/green_saver.ko...done.
Loaded symbols for /boot/kernel/green_saver.ko

Unread portion of the kernel message buffer:
ZFS(panic): freeing free segment (vdev=0 offset=3d40000 size=20000)
panic: ZFS
cpuid = 1
KDB: stack backtrace:
db_trace_self_wrapper(80b3e6ee,a7880840,8079a70f,80b6c6e0,1,...) at db_trace_self_wrapper+0x26
kdb_backtrace(80b6c6e0,1,838e4615,a788084c,1,...) at kdb_backtrace+0x29
panic(838e4615,a7880970,838e4605,838e4657,2853465a,...) at panic+0x10f
zfs_panic_recover(838e4657,0,0,3d40000,0,...) at zfs_panic_recover+0x9a
metaslab_free_dva(297,0,0,83795000,297,...) at metaslab_free_dva+0x340
metaslab_free(83795000,84377374,297,0,0,...) at metaslab_free+0x8a
zio_free_blk(83795000,84377374,297,0,83b8b210,...) at zio_free_blk+0x55
zil_sync(842afc00,84453700,7,3,2,...) at zil_sync+0x160
dmu_objset_sync(842af200,83ab38a0,84453700,0,83728498,...) at dmu_objset_sync+0x1a9
dsl_pool_sync(83728400,297,0,0,0,...) at dsl_pool_sync+0x98
spa_sync(83795000,297,0,13a,83728540,...) at spa_sync+0x3f8
txg_sync_thread(83728400,a7880d38,80b37e3c,30c,83beeab0,...) at txg_sync_thread+0x1ea
fork_exit(838ab370,83728400,a7880d38) at fork_exit+0xb8
fork_trampoline() at fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xa7880d70, ebp = 0 ---
KDB: enter: panic
Physical memory: 625 MB
Dumping 146 MB: 131 115 99 83 67 51 35 19 3

#0  doadump () at pcpu.h:195
195		__asm __volatile("movl %%fs:0,%0" : "=r" (td));
(kgdb) bt
#0  doadump () at pcpu.h:195
#1  0x80492f49 in db_fncall (dummy1=-2141834399, dummy2=0, 
    dummy3=-1484257740, dummy4=0xa7880624 "\200\233D\203@慷\203\001")
    at /usr/src/sys/ddb/db_command.c:514
#2  0x804934cc in db_command (last_cmdp=0x80c361d4, cmd_table=0x0, dopager=1)
    at /usr/src/sys/ddb/db_command.c:411
#3  0x804935da in db_command_loop () at /usr/src/sys/ddb/db_command.c:464
#4  0x80494d7d in db_trap (type=3, code=0) at /usr/src/sys/ddb/db_main.c:228
#5  0x807c3826 in kdb_trap (type=3, code=0, tf=0xa78807cc)
    at /usr/src/sys/kern/subr_kdb.c:510
#6  0x80aa2a5b in trap (frame=0xa78807cc) at /usr/src/sys/i386/i386/trap.c:647
#7  0x80a8872b in calltrap () at /usr/src/sys/i386/i386/exception.s:146
#8  0x807c39aa in kdb_enter (why=0x80b3b898 "panic", msg=0x80b3b898 "panic")
    at cpufunc.h:60
#9  0x8079a72c in panic (fmt=0x838e4615 "ZFS")
    at /usr/src/sys/kern/kern_shutdown.c:556
#10 0x838a75ea in zfs_panic_recover (fmt=Variable "fmt" is not available.
) at cmn_err.h:73
#11 0x8389b0c0 in metaslab_free_dva (spa=0x83795000, dva=Variable "dva" is not available.
)
    at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/metaslab.c:906
#12 0x8389b17a in metaslab_free (spa=0x83795000, bp=0x84377374, txg=663, 
    now=0)
    at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/metaslab.c:1005
#13 0x838c51e5 in zio_free_blk (spa=0x83795000, bp=0x84377374, txg=Unhandled dwarf expression opcode 0x93
)
    at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zio.c:1858
#14 0x838c2600 in zil_sync (zilog=0x842afc00, tx=0x84453700)
    at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zil.c:1156
#15 0x838859f9 in dmu_objset_sync (os=0x842af200, pio=0x83ab38a0, 
    tx=0x84453700)
    at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c:821
#16 0x838964f8 in dsl_pool_sync (dp=0x83728400, txg=Unhandled dwarf expression opcode 0x93
)
    at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c:188
#17 0x838a3cb8 in spa_sync (spa=0x83795000, txg=663)
    at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/spa.c:2989
#18 0x838ab55a in txg_sync_thread (arg=0x83728400)
    at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/txg.c:331
#19 0x80779628 in fork_exit (callout=0x838ab370 <txg_sync_thread>, 
    arg=0x83728400, frame=0xa7880d38) at /usr/src/sys/kern/kern_fork.c:788
#20 0x80a887a0 in fork_trampoline () at /usr/src/sys/i386/i386/exception.s:212
(kgdb) fr 11
#11 0x8389b0c0 in metaslab_free_dva (spa=0x83795000, dva=Variable "dva" is not available.
)
    at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/metaslab.c:906
906					zfs_panic_recover("freeing free segment "
(kgdb) l
901						}
902					}
903				}
904	
905				if (!allocd) {
906					zfs_panic_recover("freeing free segment "
907					    "(vdev=%llu offset=%llx size=%llx)",
908					    (longlong_t)vdev, (longlong_t)offset,
909					    (longlong_t)size);
910				}
(kgdb) q
blo-rakane# 

And here's the later crash on reboot as /etc/rc.d/zfs tries to mount the 
ZFS pool:
Script started on Thu Feb 21 19:25:16 2008
blo-rakane# kgdb kernel.debug /u2/tmp/vmcore.184
[GDB will not be able to debug user-mode threads: /usr/lib/libthread_db.so: Undefined symbol "ps_pglobal_lookup"]
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-marcel-freebsd".
Reading symbols from /boot/kernel/geom_mirror.ko...done.
Loaded symbols for /boot/kernel/geom_mirror.ko
Reading symbols from /boot/kernel/zfs.ko...done.
Loaded symbols for /boot/kernel/zfs.ko

Unread portion of the kernel message buffer:
[I deleted all but the last bits--rmtodd]
<118>Setting hostuuid: 84d7a50d-e797-11db-98e4-02004c01aaf5.
<118>Setting hostid: 0x5241ef96.
<118>Mounting local file systems:
<118>.
WARNING: ZFS is considered to be an experimental feature in FreeBSD.
ZFS WARNING: Recommended minimum kmem_size is 512MB; expect unstable behavior.
	     Consider tuning vm.kmem_size and vm.kmem_size_max
	     in /boot/loader.conf.
ZFS filesystem version 6
ZFS storage pool version 6
ZFS(panic): freeing free segment (vdev=0 offset=3d40000 size=20000)
panic: ZFS
cpuid = 0
Uptime: 15s
Got here! #1
Physical memory: 625 MB
Dumping 46 MB: 31 15

#0  doadump () at pcpu.h:195
195		__asm __volatile("movl %%fs:0,%0" : "=r" (td));
(kgdb) bt
#0  doadump () at pcpu.h:195
#1  0x8079a46a in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:417
#2  0x8079a743 in panic (fmt=Variable "fmt" is not available.
) at /usr/src/sys/kern/kern_shutdown.c:572
#3  0x838a65ea in zfs_panic_recover (fmt=Variable "fmt" is not available.
) at cmn_err.h:73
#4  0x8389a0c0 in metaslab_free_dva (spa=0x8360d000, dva=Variable "dva" is not available.
)
    at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/metaslab.c:906
#5  0x8389a17a in metaslab_free (spa=0x8360d000, bp=0xa78688d0, txg=665, 
    now=0)
    at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/metaslab.c:1005
#6  0x838c41e5 in zio_free_blk (spa=0x8360d000, bp=0xa78688d0, txg=Unhandled dwarf expression opcode 0x93
)
    at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zio.c:1858
#7  0x838c3165 in zil_parse (zilog=0x83735c00, 
    parse_blk_func=0x838c13f0 <zil_free_log_block>, 
    parse_lr_func=0x838c2f60 <zil_free_log_record>, arg=0x83bbeb80, txg=663)
    at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zil.c:254
#8  0x838c34c4 in zil_destroy (zilog=0x83735c00, keep_first=0)
    at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zil.c:449
#9  0x838c3771 in zil_replay (os=0x83737be0, arg=0x83613000, txgp=0x83613024, 
    replay_func=0x838e9660)
    at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zil.c:1562
#10 0x838d4338 in zfs_mount (vfsp=0x8379c538, td=0x83a86440)
    at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:507
#11 0x80813829 in vfs_donmount (td=0x83a86440, fsflags=0, 
    fsoptions=0x837c2e00) at /usr/src/sys/kern/vfs_mount.c:1014
#12 0x80814aa2 in nmount (td=0x83a86440, uap=0xa7868cfc)
    at /usr/src/sys/kern/vfs_mount.c:423
#13 0x80aa21a3 in syscall (frame=0xa7868d38)
    at /usr/src/sys/i386/i386/trap.c:1034
#14 0x80a88790 in Xint0x80_syscall ()
    at /usr/src/sys/i386/i386/exception.s:203
#15 0x00000033 in ?? ()
Previous frame inner to this frame (corrupt stack?)
(kgdb) fr 4
#4  0x8389a0c0 in metaslab_free_dva (spa=0x8360d000, dva=Variable "dva" is not available.
)
    at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/metaslab.c:906
906					zfs_panic_recover("freeing free segment "
(kgdb) l
901						}
902					}
903				}
904	
905				if (!allocd) {
906					zfs_panic_recover("freeing free segment "
907					    "(vdev=%llu offset=%llx size=%llx)",
908					    (longlong_t)vdev, (longlong_t)offset,
909					    (longlong_t)size);
910				}
(kgdb) qiut
Undefined command: "qiut".  Try "help".
(kgdb) quit
blo-rakane# exit
blo-rakane# exit

Script done on Thu Feb 21 19:26:13 2008