Date: Thu, 27 May 2010 11:35:34 +0300 From: Andriy Gapon <avg@freebsd.org> To: Robert Noland <rnoland@freebsd.org>, freebsd-fs@freebsd.org Cc: freebsd-hackers@freebsd.org, Roman Divacky <rdivacky@freebsd.org> Subject: Re: bin/144214: zfsboot fails on gang block after upgrade to zfs v14 Message-ID: <4BFE2ED6.1070402@freebsd.org> In-Reply-To: <4BEC040E.9080303@FreeBSD.org> References: <4BEBA334.6080101@icyb.net.ua> <4BEC040E.9080303@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
I think I nailed this problem now. What was additionally needed was the following change: if (!vdev || !vdev->v_read) return (EIO); - if (vdev->v_read(vdev, bp, &zio_gb, offset, SPA_GANGBLOCKSIZE)) + if (vdev->v_read(vdev, NULL, &zio_gb, offset, SPA_GANGBLOCKSIZE)) return (EIO); Full patch is here: http://people.freebsd.org/~avg/boot-zfs-gang.diff Apparently I am not as smart as Roman :) because I couldn't find the bug by just starring at this rather small function (for couple of hours), so I had to reproduce the problem to catch it. Hence I am copying hackers@ to share couple of tricks that were new to me. Perhaps, they could help someone else some other day. First, after very helpful hints that I received in parallel from pjd and two Oracle/Sun developers it became very easy to reproduce a pool with files with gang blocks in them. One can set metaslab_gang_bang variable in metaslab.c to some value < 128K and then blocks with size greater than metaslab_gang_bang will be allocated as gang blocks with 25% chance. I personally did something similar but slightly more deterministic: --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c @@ -1572,6 +1572,12 @@ zio_dva_allocate(zio_t *zio) ASSERT3U(zio->io_prop.zp_ndvas, <=, spa_max_replication(spa)); ASSERT3U(zio->io_size, ==, BP_GET_PSIZE(bp)); + /*XXX XXX XXX XXX*/ + if (zio->io_size > 8 * 1024) { + return (zio_write_gang_block(zio)); + } + /*XXX XXX XXX XXX*/ + error = metaslab_alloc(spa, mc, zio->io_size, bp, zio->io_prop.zp_ndvas, zio->io_txg, NULL, 0); This ensured that any block > 8K would be a gang block. Then I compiled zfs.ko with this change and put it into a virtual machine where I created a pool and populated its root/boot filesystem with /boot directory. Booted in virtual machine from the new virtual disk and immediately hit the problem. So far, so good, but still no clue why zfsboot crashes upon encountering a gang block. So I decided to debug the crash with gdb. Standard steps: $ qemu ... -S -s $ gdb ... (gdb) target remote localhost:1234 Now I didn't want to single-step through the whole boot process, so I decided to get some help from gdb. Here's a trick: (gdb) add-symbol-file /usr/obj/usr/src/sys/boot/i386/gptzfsboot/gptzfsboot.out 0xa000 gptzfsboot.out is an ELF image produced by GCC, which then gets transformed into a raw binary and then into final BTX binary (gptzfsboot). gptzfsboot.out is built without much debugging data but at least it contains information about function names. Perhaps it's even possible to compile gptzfsboot.out with higher debug level, then debugging would be much more pleasant. 0xA000 is where _code_ from gptzfsboot.out ends up being loaded in memory. BTW, having only shallow knowledge about boot chain and BTX I didn't know this address. Another GDB trick helped me: (gdb) append memory boot.memdump 0x0 0x10000 This command dumps memory content in range 0x0-0x10000 to a file named boot.memdump. Then I produced a hex dump and searched for byte sequence with which gptzfsboot.bin starts (raw binary produced produced from gptzfsboot.out). Of course, memory dump should be taken after gptzfsboot is loaded into memory :) Catching the right moment requires a little bit of boot process knowledge. I caught it with: (gdb) b *0xC000 That is, memory dump was taken after gdb stopped at the above break point. After that it was a piece of cake. I set break point on zio_read_gang function (after add-symbol-file command) and the stepi-ed through the code (that is, instruction by instruction). The following command made it easier to see what's getting executed: (gdb) display/i 0xA000 + $eip I quickly stepped though the code and saw that a large value was passed to vdev_read as 'bytes' parameter. But this should have been 512. The oversized read into a buffer allocated on stack smashed the stack and that was the end. Backtracking the call chain in source code I immediately noticed the bp condition in vdev_read_phys and realized what the problem was. Hope this would be a useful reading. -- Andriy Gapon
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4BFE2ED6.1070402>