From owner-freebsd-current@FreeBSD.ORG Fri Feb 22 02:15:09 2008 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4783816A402 for ; Fri, 22 Feb 2008 02:15:09 +0000 (UTC) (envelope-from rmtodd@ichotolot.servalan.com) Received: from mx2.synetsystems.com (mx2.synetsystems.com [76.10.206.15]) by mx1.freebsd.org (Postfix) with ESMTP id E8C8D13C459 for ; Fri, 22 Feb 2008 02:15:08 +0000 (UTC) (envelope-from rmtodd@ichotolot.servalan.com) Received: by mx2.synetsystems.com (Postfix, from userid 66) id 2300C65D; Thu, 21 Feb 2008 21:15:08 -0500 (EST) Received: from rmtodd by servalan.servalan.com with local (Exim 4.68 (FreeBSD)) (envelope-from ) id 1JSMmi-0003zY-Hk; Thu, 21 Feb 2008 19:33:08 -0600 To: Gavin Atkinson References: <20080215231507.EAB7D641@mx2.synetsystems.com> <1203359855.43782.60.camel@buffy.york.ac.uk> From: Richard Todd Date: Thu, 21 Feb 2008 19:33:08 -0600 In-Reply-To: <1203359855.43782.60.camel@buffy.york.ac.uk> (Gavin Atkinson's message of "Mon, 18 Feb 2008 18:37:35 +0000") Message-ID: User-Agent: Gnus/5.1008 (Gnus v5.10.8) XEmacs/21.4.21 (berkeley-unix) MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Cc: freebsd-current@freebsd.org Subject: Re: ZFS: data sometimes not getting correctly flushed to disk (possible mmap/write issue)? X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Feb 2008 02:15:09 -0000 Gavin Atkinson writes: > On Fri, 2008-02-15 at 16:59 -0600, Richard Todd wrote: >> Further investigation and coming up with test cases brought me to the >> surprising conclusion that ZFS was to blame -- under certain conditions ZFS >> seems not to be writing all the data in the file out to disk correctly. >> The file appears to be good only so long as the file's data is in cache, but >> the data actually on disk doesn't match what's in cache. > I don't currently have a system that I can test this with, but is there > any chance you could make the following change to your script? > > #!/bin/sh > NAME=$1 > LEN=$2 > rm $NAME > ./a.out $NAME $LEN > ./gen2.py $NAME > md5 $NAME > + cp $NAME /some/ufs/filesystem/$NAME.1 > ls -l $NAME > sudo umount /u3 > sudo mount -t zfs u3 /u3 > md5 $NAME > + cp $NAME /some/ufs/filesystem/$NAME.2 > ls -l $NAME > + md5 /some/ufs/filesystem/$NAME.? > > and when you find files that differ, compare the two copies stored on > the UFS filesystem with cmp(1)? I'd be interested to know exactly what > the corruption looks like - is a single bit differen? Or was an entire > page destroyed? Neither, oddly enough. I modified the script in much the way you requested, copying the file to a file named "before" on a UFS filesystem before the unmount and to one named "after" after the remount. The cmp -l shows a difference of 10 bytes that are non-zero in the good copy and 0 in the bad "after" copy. Script started on Mon Feb 18 14:09:47 2008 1 ichotolot ~/bug[ 2:09PM] Z% sh -x test.sh /u3/tmp/foo1.ogg 403 + NAME=/u3/tmp/foo1.ogg + LEN=403 + rm /u3/tmp/foo1.ogg + ./a.out /u3/tmp/foo1.ogg 403 + ./gen2.py /u3/tmp/foo1.ogg + md5 /u3/tmp/foo1.ogg MD5 (/u3/tmp/foo1.ogg) = 72b9b992f31332fc1682cad1e97fa7e5 + cp /u3/tmp/foo1.ogg /bootsy/tmp/before + ls -l /u3/tmp/foo1.ogg -rw-r--r-- 1 rmtodd wheel 6238218 Feb 18 14:10 /u3/tmp/foo1.ogg + sudo umount /u3 + sudo mount -t zfs u3 /u3 + md5 /u3/tmp/foo1.ogg MD5 (/u3/tmp/foo1.ogg) = 3abd285e657684f4a2f89043f230b164 + ls -l /u3/tmp/foo1.ogg -rw-r--r-- 1 rmtodd wheel 6238218 Feb 18 14:10 /u3/tmp/foo1.ogg + cp /u3/tmp/foo1.ogg /bootsy/tmp/after 2 ichotolot ~/bug[ 2:10PM] Z% md5 /bootsy/tmp/{before,after} MD5 (/bootsy/tmp/before) = 72b9b992f31332fc1682cad1e97fa7e5 MD5 (/bootsy/tmp/after) = 3abd285e657684f4a2f89043f230b164 3 ichotolot ~/bug[ 2:10PM] Z% cmp -l !$ cmp -l /bootsy/tmp/{before,after} 6238209 236 0 6238210 71 0 6238211 310 0 6238212 372 0 6238213 16 0 6238214 45 0 6238215 220 0 6238216 112 0 6238217 376 0 6238218 1 0 4 ichotolot ~/bug[ 2:10PM] Z% exit Script done on Mon Feb 18 14:11:28 2008 An additional, important, note: I've discovered that this program can trigger more serious lossage in ZFS than I mentioned above: not only can you get corrupted data on disk, you can manage to corrupt the ZFS pool space maps, possibly to the point that ZFS panics on boot. So if anyone out there wants to test this, best do it on a ZFS pool you don't mind losing. (I *did* manage to recover the data off of the ZFS pool I did trash, and this info might prove useful to someone else in similar situations. The procedure was, roughly described, as follows: 1) Boot the system, record where the ZFS panic was. 2) Turn off the disks with the corrupted pool. (Fortunately, mine was on an external Firewire disk pair.) 3) Reboot the system, study where the panic was, and hack on space_map.c to disable the sanity checks on the space map that triggered the panic. Build and install the new kernel. 4) Turn the external disks back on, and reiterate 1)-3) until you can manage to get the external pool mounted. 5) My external pool (u3) was a mirrored pair of disks, so what I did next was detach one disk from u3 and make it a ZFS pool of its own, call it u3new. 6) Rsync everything from /u3 to /u3new. 7) Destroy the old u3 pool and add its disk as a mirror to the u3new mirror. 8) Reset the mountpoint property of u3new to make it mount at /u3 as before. 9) Wonder briefly why the ZFS designers implemented "zfs rename" but not a "zpool rename". :-) Fortunately, I had relatively decent backups of u3 in case the above didn't work.) Anyway, I've done further testing on a different machine, one whose only ZFS pool isn't used for anything more irreplacable than local cvsup storage. The panic/filesystem corruption doesn't seem to be as reliably repeatable as the data corruption -- the data corruption I mentioned in my original message seems to *always* happen when the program is run with the same arguments, the FS corruption doesn't. I do have some coredumps from this testing, though. Here's the one where it paniced while running the program; note that the crash is apparently somewhere in one of the syncer threads. Script started on Thu Feb 21 19:23:47 2008 blo-rakane# kgdb kernel.debug /u2/tmp/vmcore.183 [GDB will not be able to debug user-mode threads: /usr/lib/libthread_db.so: Undefined symbol "ps_pglobal_lookup"] GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-marcel-freebsd". Reading symbols from /boot/kernel/geom_mirror.ko...done. Loaded symbols for /boot/kernel/geom_mirror.ko Reading symbols from /boot/kernel/zfs.ko...done. Loaded symbols for /boot/kernel/zfs.ko Reading symbols from /boot/kernel/linux.ko...done. Loaded symbols for /boot/kernel/linux.ko Reading symbols from /usr/local/modules/fuse.ko...done. Loaded symbols for /usr/local/modules/fuse.ko Reading symbols from /boot/kernel/green_saver.ko...done. Loaded symbols for /boot/kernel/green_saver.ko Unread portion of the kernel message buffer: ZFS(panic): freeing free segment (vdev=0 offset=3d40000 size=20000) panic: ZFS cpuid = 1 KDB: stack backtrace: db_trace_self_wrapper(80b3e6ee,a7880840,8079a70f,80b6c6e0,1,...) at db_trace_self_wrapper+0x26 kdb_backtrace(80b6c6e0,1,838e4615,a788084c,1,...) at kdb_backtrace+0x29 panic(838e4615,a7880970,838e4605,838e4657,2853465a,...) at panic+0x10f zfs_panic_recover(838e4657,0,0,3d40000,0,...) at zfs_panic_recover+0x9a metaslab_free_dva(297,0,0,83795000,297,...) at metaslab_free_dva+0x340 metaslab_free(83795000,84377374,297,0,0,...) at metaslab_free+0x8a zio_free_blk(83795000,84377374,297,0,83b8b210,...) at zio_free_blk+0x55 zil_sync(842afc00,84453700,7,3,2,...) at zil_sync+0x160 dmu_objset_sync(842af200,83ab38a0,84453700,0,83728498,...) at dmu_objset_sync+0x1a9 dsl_pool_sync(83728400,297,0,0,0,...) at dsl_pool_sync+0x98 spa_sync(83795000,297,0,13a,83728540,...) at spa_sync+0x3f8 txg_sync_thread(83728400,a7880d38,80b37e3c,30c,83beeab0,...) at txg_sync_thread+0x1ea fork_exit(838ab370,83728400,a7880d38) at fork_exit+0xb8 fork_trampoline() at fork_trampoline+0x8 --- trap 0, eip = 0, esp = 0xa7880d70, ebp = 0 --- KDB: enter: panic Physical memory: 625 MB Dumping 146 MB: 131 115 99 83 67 51 35 19 3 #0 doadump () at pcpu.h:195 195 __asm __volatile("movl %%fs:0,%0" : "=r" (td)); (kgdb) bt #0 doadump () at pcpu.h:195 #1 0x80492f49 in db_fncall (dummy1=-2141834399, dummy2=0, dummy3=-1484257740, dummy4=0xa7880624 "\200\233D\203@şB\203\001") at /usr/src/sys/ddb/db_command.c:514 #2 0x804934cc in db_command (last_cmdp=0x80c361d4, cmd_table=0x0, dopager=1) at /usr/src/sys/ddb/db_command.c:411 #3 0x804935da in db_command_loop () at /usr/src/sys/ddb/db_command.c:464 #4 0x80494d7d in db_trap (type=3, code=0) at /usr/src/sys/ddb/db_main.c:228 #5 0x807c3826 in kdb_trap (type=3, code=0, tf=0xa78807cc) at /usr/src/sys/kern/subr_kdb.c:510 #6 0x80aa2a5b in trap (frame=0xa78807cc) at /usr/src/sys/i386/i386/trap.c:647 #7 0x80a8872b in calltrap () at /usr/src/sys/i386/i386/exception.s:146 #8 0x807c39aa in kdb_enter (why=0x80b3b898 "panic", msg=0x80b3b898 "panic") at cpufunc.h:60 #9 0x8079a72c in panic (fmt=0x838e4615 "ZFS") at /usr/src/sys/kern/kern_shutdown.c:556 #10 0x838a75ea in zfs_panic_recover (fmt=Variable "fmt" is not available. ) at cmn_err.h:73 #11 0x8389b0c0 in metaslab_free_dva (spa=0x83795000, dva=Variable "dva" is not available. ) at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/metaslab.c:906 #12 0x8389b17a in metaslab_free (spa=0x83795000, bp=0x84377374, txg=663, now=0) at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/metaslab.c:1005 #13 0x838c51e5 in zio_free_blk (spa=0x83795000, bp=0x84377374, txg=Unhandled dwarf expression opcode 0x93 ) at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zio.c:1858 #14 0x838c2600 in zil_sync (zilog=0x842afc00, tx=0x84453700) at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zil.c:1156 #15 0x838859f9 in dmu_objset_sync (os=0x842af200, pio=0x83ab38a0, tx=0x84453700) at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c:821 #16 0x838964f8 in dsl_pool_sync (dp=0x83728400, txg=Unhandled dwarf expression opcode 0x93 ) at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c:188 #17 0x838a3cb8 in spa_sync (spa=0x83795000, txg=663) at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/spa.c:2989 #18 0x838ab55a in txg_sync_thread (arg=0x83728400) at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/txg.c:331 #19 0x80779628 in fork_exit (callout=0x838ab370 , arg=0x83728400, frame=0xa7880d38) at /usr/src/sys/kern/kern_fork.c:788 #20 0x80a887a0 in fork_trampoline () at /usr/src/sys/i386/i386/exception.s:212 (kgdb) fr 11 #11 0x8389b0c0 in metaslab_free_dva (spa=0x83795000, dva=Variable "dva" is not available. ) at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/metaslab.c:906 906 zfs_panic_recover("freeing free segment " (kgdb) l 901 } 902 } 903 } 904 905 if (!allocd) { 906 zfs_panic_recover("freeing free segment " 907 "(vdev=%llu offset=%llx size=%llx)", 908 (longlong_t)vdev, (longlong_t)offset, 909 (longlong_t)size); 910 } (kgdb) q blo-rakane# And here's the later crash on reboot as /etc/rc.d/zfs tries to mount the ZFS pool: Script started on Thu Feb 21 19:25:16 2008 blo-rakane# kgdb kernel.debug /u2/tmp/vmcore.184 [GDB will not be able to debug user-mode threads: /usr/lib/libthread_db.so: Undefined symbol "ps_pglobal_lookup"] GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-marcel-freebsd". Reading symbols from /boot/kernel/geom_mirror.ko...done. Loaded symbols for /boot/kernel/geom_mirror.ko Reading symbols from /boot/kernel/zfs.ko...done. Loaded symbols for /boot/kernel/zfs.ko Unread portion of the kernel message buffer: [I deleted all but the last bits--rmtodd] <118>Setting hostuuid: 84d7a50d-e797-11db-98e4-02004c01aaf5. <118>Setting hostid: 0x5241ef96. <118>Mounting local file systems: <118>. WARNING: ZFS is considered to be an experimental feature in FreeBSD. ZFS WARNING: Recommended minimum kmem_size is 512MB; expect unstable behavior. Consider tuning vm.kmem_size and vm.kmem_size_max in /boot/loader.conf. ZFS filesystem version 6 ZFS storage pool version 6 ZFS(panic): freeing free segment (vdev=0 offset=3d40000 size=20000) panic: ZFS cpuid = 0 Uptime: 15s Got here! #1 Physical memory: 625 MB Dumping 46 MB: 31 15 #0 doadump () at pcpu.h:195 195 __asm __volatile("movl %%fs:0,%0" : "=r" (td)); (kgdb) bt #0 doadump () at pcpu.h:195 #1 0x8079a46a in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:417 #2 0x8079a743 in panic (fmt=Variable "fmt" is not available. ) at /usr/src/sys/kern/kern_shutdown.c:572 #3 0x838a65ea in zfs_panic_recover (fmt=Variable "fmt" is not available. ) at cmn_err.h:73 #4 0x8389a0c0 in metaslab_free_dva (spa=0x8360d000, dva=Variable "dva" is not available. ) at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/metaslab.c:906 #5 0x8389a17a in metaslab_free (spa=0x8360d000, bp=0xa78688d0, txg=665, now=0) at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/metaslab.c:1005 #6 0x838c41e5 in zio_free_blk (spa=0x8360d000, bp=0xa78688d0, txg=Unhandled dwarf expression opcode 0x93 ) at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zio.c:1858 #7 0x838c3165 in zil_parse (zilog=0x83735c00, parse_blk_func=0x838c13f0 , parse_lr_func=0x838c2f60 , arg=0x83bbeb80, txg=663) at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zil.c:254 #8 0x838c34c4 in zil_destroy (zilog=0x83735c00, keep_first=0) at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zil.c:449 #9 0x838c3771 in zil_replay (os=0x83737be0, arg=0x83613000, txgp=0x83613024, replay_func=0x838e9660) at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zil.c:1562 #10 0x838d4338 in zfs_mount (vfsp=0x8379c538, td=0x83a86440) at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:507 #11 0x80813829 in vfs_donmount (td=0x83a86440, fsflags=0, fsoptions=0x837c2e00) at /usr/src/sys/kern/vfs_mount.c:1014 #12 0x80814aa2 in nmount (td=0x83a86440, uap=0xa7868cfc) at /usr/src/sys/kern/vfs_mount.c:423 #13 0x80aa21a3 in syscall (frame=0xa7868d38) at /usr/src/sys/i386/i386/trap.c:1034 #14 0x80a88790 in Xint0x80_syscall () at /usr/src/sys/i386/i386/exception.s:203 #15 0x00000033 in ?? () Previous frame inner to this frame (corrupt stack?) (kgdb) fr 4 #4 0x8389a0c0 in metaslab_free_dva (spa=0x8360d000, dva=Variable "dva" is not available. ) at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/metaslab.c:906 906 zfs_panic_recover("freeing free segment " (kgdb) l 901 } 902 } 903 } 904 905 if (!allocd) { 906 zfs_panic_recover("freeing free segment " 907 "(vdev=%llu offset=%llx size=%llx)", 908 (longlong_t)vdev, (longlong_t)offset, 909 (longlong_t)size); 910 } (kgdb) qiut Undefined command: "qiut". Try "help". (kgdb) quit blo-rakane# exit blo-rakane# exit Script done on Thu Feb 21 19:26:13 2008