Date: Sun, 18 Feb 2018 00:54:21 +0200 From: Andriy Gapon <avg@FreeBSD.org> To: Andrew Reilly <areilly@bigpond.net.au>, kib@freebsd.org, Gleb Smirnoff <glebius@FreeBSD.org> Cc: current@freebsd.org Subject: Re: Since last week (today) current on my Ryzen box is unstable Message-ID: <cc3ae685-5f0e-d968-7b08-60a4836093e1@FreeBSD.org> In-Reply-To: <0CEA9D55-D488-42EC-BBDE-D0B7CE58BAEA@bigpond.net.au> References: <0CEA9D55-D488-42EC-BBDE-D0B7CE58BAEA@bigpond.net.au>
next in thread | previous in thread | raw e-mail | index | archive | help
On 17/02/2018 14:16, Andrew Reilly wrote: > Today's rebuild has given me uptimes of below an hour, usually. The box will stay up in single user mode long enough to rebuild world/kernel, but multi-user it is panicking at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:1592 > > The backtrace shows that it gets to this panic from a sendfile() syscall. The line above is in the middle of a big edit that's part of svn revision 329363. The tripping assertion seems to suggest that m->valid != 0, for whatever that's worth. I am doing a bit of an offline investigation with Andrew and it seems that the actual panic message is this: panic: vm_page_assert_xbusied: page 0xfffff807ebbd8f98 not exclusive busy @ /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:1592 The stack is this: vpanic() at vpanic/frame 0xfffffe00b3c36390 dmu_read_pages() at dmu_read_pages+0x535/frame 0xfffffe00b3c36460 zfs_freebsd_getpages() at zfs_freebsd_getpages+0x24c/frame 0xfffffe00b3c36510 VOP_GETPAGES_APV() at VOP_GETPAGES_APV+0xd9/frame 0xfffffe00b3c36540 vop_stdgetpages_async() at vop_stdgetpages_async+0x49/frame 0xfffffe00b3c36590 VOP_GETPAGES_ASYNC_APV() at VOP_GETPAGES_ASYNC_APV+0xd9/frame 0xfffffe00b3c365c0 vnode_pager_getpages_async() at vnode_pager_getpages_async+0x81/frame 0xfffffe00b3c36650 vn_sendfile() at vn_sendfile+0xe70/frame 0xfffffe00b3c368e0 sendfile() at sendfile+0x149/frame 0xfffffe00b3c36980 amd64_syscall() at amd64_syscall+0x79b/frame 0xfffffe00b3c36ab0 fast_syscall_common() at fast_syscall_common+0x101/frame 0x7fffffffdb00 I looked at sendfile_swapin() code and it seems that it uses the pager API in an undocumented way. Specifically, it inserts bogus_page into the array of requested pages. For starters, bogus_page is not busied and VOP_GETPAGES is documented to have all requested pages exclusively busied. Second, I always had an impression that bogus_page is an implementation detail of the unified buffer / page cache and that other code need not be aware of it. So, my opinion is that the sendfile code uses a "clever hack" that happens to work with the buffer cache based filesystems, but that that hack is a bug. So, I'd prefer that the problem is fixed in that code. But I am open to being convinced that all VOP_GETPAGES implementations, including that in ZFS, must be made aware of bogus_page. Or, at least, that they should not verify that the requested pages are busied. -- Andriy Gapon
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?cc3ae685-5f0e-d968-7b08-60a4836093e1>