Date: Thu, 27 May 2010 15:40:07 +0100 From: Doug Rabson <dfr@rabson.org> To: Andriy Gapon <avg@freebsd.org> Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, Roman Divacky <rdivacky@freebsd.org>, Robert Noland <rnoland@freebsd.org> Subject: Re: bin/144214: zfsboot fails on gang block after upgrade to zfs v14 Message-ID: <AANLkTinza8LKXH5BrlhHsTtAwzeAgcgwOKSlpPBnuFLM@mail.gmail.com> In-Reply-To: <4BFE2ED6.1070402@freebsd.org> References: <4BEBA334.6080101@icyb.net.ua> <4BEC040E.9080303@FreeBSD.org> <4BFE2ED6.1070402@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 27 May 2010 09:35, Andriy Gapon <avg@freebsd.org> wrote: > > > I think I nailed this problem now. > What was additionally needed was the following change: > if (!vdev || !vdev->v_read) > return (EIO); > - if (vdev->v_read(vdev, bp, &zio_gb, offset, SPA_GANGBLOCKSIZE)) > + if (vdev->v_read(vdev, NULL, &zio_gb, offset, SPA_GANGBLOCKSIZE)) > return (EIO); > > Full patch is here: > http://people.freebsd.org/~avg/boot-zfs-gang.diff > > Apparently I am not as smart as Roman :) because I couldn't find the bug by > just > starring at this rather small function (for couple of hours), so I had to > reproduce the problem to catch it. Hence I am copying hackers@ to share > couple > of tricks that were new to me. Perhaps, they could help someone else some > other > day. > > First, after very helpful hints that I received in parallel from pjd and > two > Oracle/Sun developers it became very easy to reproduce a pool with files > with > gang blocks in them. > One can set metaslab_gang_bang variable in metaslab.c to some value < 128K > and > then blocks with size greater than metaslab_gang_bang will be allocated as > gang > blocks with 25% chance. I personally did something similar but slightly > more > deterministic: > --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c > +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c > @@ -1572,6 +1572,12 @@ zio_dva_allocate(zio_t *zio) > ASSERT3U(zio->io_prop.zp_ndvas, <=, spa_max_replication(spa)); > ASSERT3U(zio->io_size, ==, BP_GET_PSIZE(bp)); > > + /*XXX XXX XXX XXX*/ > + if (zio->io_size > 8 * 1024) { > + return (zio_write_gang_block(zio)); > + } > + /*XXX XXX XXX XXX*/ > + > error = metaslab_alloc(spa, mc, zio->io_size, bp, > zio->io_prop.zp_ndvas, zio->io_txg, NULL, 0); > > This ensured that any block > 8K would be a gang block. > Then I compiled zfs.ko with this change and put it into a virtual machine > where > I created a pool and populated its root/boot filesystem with /boot > directory. > Booted in virtual machine from the new virtual disk and immediately hit the > problem. > > So far, so good, but still no clue why zfsboot crashes upon encountering a > gang > block. > > So I decided to debug the crash with gdb. > Standard steps: > $ qemu ... -S -s > $ gdb > ... > (gdb) target remote localhost:1234 > > Now I didn't want to single-step through the whole boot process, so I > decided to > get some help from gdb. Here's a trick: > (gdb) add-symbol-file > /usr/obj/usr/src/sys/boot/i386/gptzfsboot/gptzfsboot.out > 0xa000 > > gptzfsboot.out is an ELF image produced by GCC, which then gets transformed > into > a raw binary and then into final BTX binary (gptzfsboot). > gptzfsboot.out is built without much debugging data but at least it > contains > information about function names. Perhaps it's even possible to compile > gptzfsboot.out with higher debug level, then debugging would be much more > pleasant. > > 0xA000 is where _code_ from gptzfsboot.out ends up being loaded in memory. > BTW, having only shallow knowledge about boot chain and BTX I didn't know > this > address. Another GDB trick helped me: > (gdb) append memory boot.memdump 0x0 0x10000 > > This command dumps memory content in range 0x0-0x10000 to a file named > boot.memdump. Then I produced a hex dump and searched for byte sequence > with > which gptzfsboot.bin starts (raw binary produced produced from > gptzfsboot.out). > > Of course, memory dump should be taken after gptzfsboot is loaded into > memory :) > Catching the right moment requires a little bit of boot process knowledge. > I caught it with: > (gdb) b *0xC000 > > That is, memory dump was taken after gdb stopped at the above break point. > > After that it was a piece of cake. I set break point on zio_read_gang > function > (after add-symbol-file command) and the stepi-ed through the code (that is, > instruction by instruction). The following command made it easier to see > what's > getting executed: > (gdb) display/i 0xA000 + $eip > > I quickly stepped though the code and saw that a large value was passed to > vdev_read as 'bytes' parameter. But this should have been 512. The > oversized > read into a buffer allocated on stack smashed the stack and that was the > end. > > Backtracking the call chain in source code I immediately noticed the bp > condition in vdev_read_phys and realized what the problem was. > > Hope this would be a useful reading. > Excellent work - thanks for looking into this. I still think its easier to debug this code in userland using a shim that redirects the zfsboot i/o calls to simple read system calls...
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?AANLkTinza8LKXH5BrlhHsTtAwzeAgcgwOKSlpPBnuFLM>