Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 31 Jul 2009 19:10:03 +0200
From:      Thomas Backman <serenity@exscape.org>
To:        Pawel Jakub Dawidek <pjd@freebsd.org>
Cc:        freebsd-fs@freebsd.org, FreeBSD current <freebsd-current@freebsd.org>, Andriy Gapon <avg@FreeBSD.org>
Subject:   Re: zfs: Fatal trap 12: page fault while in kernel mode
Message-ID:  <C208A1B6-719B-497F-B108-528109F87F15@exscape.org>
In-Reply-To: <BEC7EB06-3B69-4BBC-B143-409E55C1F3A8@exscape.org>
References:  <20090727072503.GA52309@jpru.ffm.jpru.de> <20090729084723.GD1586@garage.freebsd.pl> <F4F82B3E-C119-40EF-9AA4-937052876D1E@exscape.org> <4A7030B6.8010205@icyb.net.ua> <97D5950F-4E4D-4446-AC22-92679135868D@exscape.org> <4A7048A9.4020507@icyb.net.ua> <52AA86CB-6C06-4370-BA73-CE19175467D0@exscape.org> <4A705299.8060504@icyb.net.ua> <D3491B77-DA5C-4E10-BE1D-D6EF8CFB112E@exscape.org> <4A7054E1.5060402@icyb.net.ua> <5918824D-A67C-43E6-8685-7B72A52B9CAE@exscape.org> <4A705E50.8070307@icyb.net.ua> <4A70728C.7020004@freebsd.org> <6D47A34B-0753-4CED-BF3D-C505B37748FC@exscape.org> <4A708455.5070304@freebsd.org> <86983A55-E5C4-4C04-A4C7-0AE9A9EE37A3@exscape.org> <4A718E03.6030909@freebsd.org> <71A038EC-02B1-4606-96C2-5E84BE80F005@exscape.org> <4A719CA4.4060400@freebsd.org> <19347561-3CE6-40B3-930A-EB9925D3AFD1@exscape.org> <4A71AD29.10705@freebsd.org> <7544AED1-1216-4A24-B287-F54117641F76@exscape.org> <4 A71B239.8060007@freebsd.org> <3AA3C1CB-CEF7-46CC-A9C7-1648093D679E@exsca! pe.org> <4A71BED8.7050300@freebsd.org> <C7FEDA88-7A89-45DB-BD16-4C8816D17E0D@exscape.org> <BEC7EB06-3B69-4BBC-B143-409E55C1F3A8@exscape.org>

next in thread | previous in thread | raw e-mail | index | archive | help

On Jul 30, 2009, at 20:29, Thomas Backman wrote:

> On Jul 30, 2009, at 18:41, Thomas Backman wrote:
>
>> On Jul 30, 2009, at 17:40, Andriy Gapon wrote:
>>> on 30/07/2009 18:25 Thomas Backman said the following:
>>>> PS. I'll test Pawel's patch sometime after dinner. ;)
>>>
>>> I believe that you should get a perfect result with it.
>>>
>>> -- Andriy Gapon
>> If I dare say it, you were right! I've been testing for about half  
>> an hour or so (probably a bit more) now.
>> Still using DEBUG_VFS_LOCKS, and I've tried the test case several  
>> times, ran an initial backup (i.e. destroy target pool and send| 
>> recv the entire pool) and a few incrementals. Rebooted, tried it  
>> again. No panic, no problems! :)
>> Let's hope it stays this way.
>>
>> So, in short: With that patch (copied here just in case: http://exscape.org/temp/zfs_vnops.working.patch 
>>  ) and the libzfs patch linked previously, it appears zfs send/recv  
>> works plain fine. I have yet to try it with clone/promote and  
>> stuff, but since that gave the same panic that this solved, I'm  
>> hoping there will be no problems with that anymore.
>
> Arrrgh!
> I guess I spoke too soon after all... new panic yet again. :(
> *sigh* It feels as if this will never become stable right now.  
> (Maybe that's because I've spent all day and most of yesterday too  
> on this ;)
>
> [... same panic as I'm posting in the reply below snipped ...]
>
> Unfortunately, I'm not sure I can reproduce this reliably, since it  
> worked a bunch of times both before and after my previous mail.
>
> Oh, and I'm still using -DDEBUG=1 and DEBUG_VFS_LOCKS... If this  
> isn't a new panic because of the changes, perhaps it was triggered  
> now and never before because of the -DDEBUG?

OK, I created a "test case" that triggers this panic for me every  
time, and reproduced it on another machine, so it should, uh, "work"  
for anyone reading this as well.

Here are my patches, and the script used to reproduce the panic:
(This assumes that you've got a clean SVN/cvsup source tree. If you  
have any of the patches mentioned below, remove them from the .patch  
first.)
http://exscape.org/temp/zfs_destroy_panic_patches.patch (contains:  
James R. Van Artsdalen's libzfs_sendrecv patch that makes it not  
coredump(...), activating ZFS debugging (-DDEBUG=1), and Pawel's  
zfs_vnops.c patch.)
http://exscape.org/temp/zfs_destroy_panic.sh (needs bash and 200MB  
free on your /root/-containing FS, unless you change the variables at  
the top; usage: "bash ...sh crash")

You'll need to rebuild zfs.ko and libzfs, and if you use zfs.ko  
already, of course, reboot. (The libzfs patch can be installed and  
used without rebooting.)

1) cd /usr/src; fetch http://exscape.org/temp/zfs_destroy_panic_patches.patch 
  && patch < zfs_destroy_panic_patches.patch
2) cd /usr/src/cddl/lib/libzfs/ ; make && make install
3) cd /usr/src/sys/modules/zfs ; make && make install
3b) (reboot, or kldload zfs)
4) fetch http://exscape.org/temp/zfs_destroy_panic.sh && bash  
zfs_destroy_panic.sh crash

My output (snipped for brevity, most is useless stuff from dd, etc.):
(I prepended a >> to output written by my script; the rest is from  
zfs. This isn't in the script itself.)

 >> Creating pools
 >> Creating filesystems
 >> Creating snapshot(s)
 >> Doing initial clone to slave pool
receiving full stream of crashtestmaster@backup-20090731-185218 into  
crashtestslave@backup-20090731-185218
received 15.0KB stream in 1 seconds (15.0KB/sec)
receiving full stream of crashtestmaster/ 
testroot@backup-20090731-185218 into crashtestslave/ 
testroot@backup-20090731-185218
received 15.0KB stream in 1 seconds (15.0KB/sec)
receiving full stream of crashtestmaster/testroot/ 
testfs@backup-20090731-185218 into crashtestslave/testroot/ 
testfs@backup-20090731-185218
received 1.02MB stream in 1 seconds (1.02MB/sec)
 >> Initial step done!
 >> Destroying testfs
 >> Taking snapshots
 >> Starting backup...
sending from @backup-20090731-185218 to  
crashtestmaster@backup-20090731-185226-11214-7776
sending from @backup-20090731-185218 to crashtestmaster/ 
testroot@backup-20090731-185226-11214-7776
attempting destroy crashtestslave/testroot/testfs@backup-20090731-185218
success
attempting destroy crashtestslave/testroot/testfs
success
receiving incremental stream of  
crashtestmaster@backup-20090731-185226-11214-7776 into  
crashtestslave@backup-20090731-185226-11214-7776
received 312B stream in 1 seconds (312B/sec)
receiving incremental stream of crashtestmaster/ 
testroot@backup-20090731-185226-11214-7776 into crashtestslave/ 
testroot@backup-20090731-185226-11214-7776
[... panic, no no more output ...]


DDB info, etc (from the original box; not the same run as above, but  
the same panic, so...):

Unread portion of the kernel message buffer:
panic: solaris assert: ((zp)->z_vnode)->v_usecount > 0, file: /usr/src/ 
sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/ 
zfs_vfsops.c, line: 920
cpuid = 0
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
panic() at panic+0x182
zfsvfs_teardown() at zfsvfs_teardown+0x24d
zfs_suspend_fs() at zfs_suspend_fs+0x2b
zfs_ioc_recv() at zfs_ioc_recv+0x28b
zfsdev_ioctl() at zfsdev_ioctl+0x8a
devfs_ioctl_f() at devfs_ioctl_f+0x77
kern_ioctl() at kern_ioctl+0xf6
ioctl() at ioctl+0xfd
syscall() at syscall+0x28f
Xfast_syscall() at Xfast_syscall+0xe1
--- syscall (54, FreeBSD ELF64, ioctl), rip = 0x800fe5f7c, rsp =  
0x7fffffff8ee8, rbp = 0x7fffffff9c20 ---
KDB: enter: panic
panic: from debugger
cpuid = 0
Uptime: 25m47s
Physical memory: 2030 MB
Dumping 1663 MB: ...

#11 0xffffffff8033abcb in panic (fmt=Variable "fmt" is not available.
)
     at /usr/src/sys/kern/kern_shutdown.c:558
#12 0xffffffff80b0ec5d in zfsvfs_teardown () from /boot/kernel/zfs.ko
#13 0x0000000000100000 in ?? ()
#14 0xffffff0048a7e250 in ?? ()
#15 0xffffff0048a7e000 in ?? ()
#16 0xffffff00063c0000 in ?? ()
#17 0xffffff803e8f27a0 in ?? ()
#18 0xffffff803e8f27d0 in ?? ()
#19 0xffffff803e8f2770 in ?? ()
#20 0xffffff803e8f2740 in ?? ()
#21 0xffffffff80b0ecab in zfs_suspend_fs () from /boot/kernel/zfs.ko
Previous frame inner to this frame (corrupt stack?)

I commented out -DDEBUG=1 and rebuilt+installed just the zfs module,  
and the panic appears to be gone. With DEBUG, it panicked every time  
(and I tried it at least 4-5 times). Without, it has worked flawlessly  
three times in a row, as has my regular backup.

So, the big, TL;DR question is: is the ASSERT() unnecessary, as Andriy  
proposed it *might* be, or is this a real issue that actually needs  
fixing? It doesn't feel right to just ignore a potential bug by  
ignoring a failed assertion...
Pawel?

Regards,
Thomas



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?C208A1B6-719B-497F-B108-528109F87F15>