Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 18 Feb 2018 09:28:30 +0200
From:      Andriy Gapon <avg@FreeBSD.org>
To:        Gleb Smirnoff <glebius@FreeBSD.org>
Cc:        Andrew Reilly <areilly@bigpond.net.au>, kib@FreeBSD.org, current@FreeBSD.org
Subject:   Re: Since last week (today) current on my Ryzen box is unstable
Message-ID:  <431f3e00-c66a-8e2e-6c61-a315a6353d1d@FreeBSD.org>
In-Reply-To: <20180218023545.GE93303@FreeBSD.org>
References:  <0CEA9D55-D488-42EC-BBDE-D0B7CE58BAEA@bigpond.net.au> <cc3ae685-5f0e-d968-7b08-60a4836093e1@FreeBSD.org> <20180218023545.GE93303@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On 18/02/2018 04:35, Gleb Smirnoff wrote:
>   Andriy,
> 
> On Sun, Feb 18, 2018 at 12:54:21AM +0200, Andriy Gapon wrote:
> A> > Today's rebuild has given me uptimes of below an hour, usually.  The box will stay up in single user mode long enough to rebuild world/kernel, but multi-user it is panicking at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:1592
> A> > 
> A> > The backtrace shows that it gets to this panic from a sendfile() syscall.  The line above is in the middle of a big edit that's part of svn revision 329363.  The tripping assertion seems to suggest that m->valid != 0, for whatever that's worth.
> A> 
> A> I am doing a bit of an offline investigation with Andrew and it seems that the
> A> actual panic message is this:
> A> 
> A> panic: vm_page_assert_xbusied: page 0xfffff807ebbd8f98 not exclusive busy @
> A> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:1592
> A> 
> A> The stack is this:
> A> vpanic() at vpanic/frame 0xfffffe00b3c36390
> A> dmu_read_pages() at dmu_read_pages+0x535/frame 0xfffffe00b3c36460
> A> zfs_freebsd_getpages() at zfs_freebsd_getpages+0x24c/frame 0xfffffe00b3c36510
> A> VOP_GETPAGES_APV() at VOP_GETPAGES_APV+0xd9/frame 0xfffffe00b3c36540
> A> vop_stdgetpages_async() at vop_stdgetpages_async+0x49/frame 0xfffffe00b3c36590
> A> VOP_GETPAGES_ASYNC_APV() at VOP_GETPAGES_ASYNC_APV+0xd9/frame 0xfffffe00b3c365c0
> A> vnode_pager_getpages_async() at vnode_pager_getpages_async+0x81/frame
> A> 0xfffffe00b3c36650
> A> vn_sendfile() at vn_sendfile+0xe70/frame 0xfffffe00b3c368e0
> A> sendfile() at sendfile+0x149/frame 0xfffffe00b3c36980
> A> amd64_syscall() at amd64_syscall+0x79b/frame 0xfffffe00b3c36ab0
> A> fast_syscall_common() at fast_syscall_common+0x101/frame 0x7fffffffdb00
> A> 
> A> I looked at sendfile_swapin() code and it seems that it uses the pager API in an
> A> undocumented way.  Specifically, it inserts bogus_page into the array of
> A> requested pages.  For starters, bogus_page is not busied and VOP_GETPAGES is
> A> documented to have all requested pages exclusively busied.  Second, I always had
> A> an impression that bogus_page is an implementation detail of the unified buffer
> A> / page cache and that other code need not be aware of it.
> A> 
> A> So, my opinion is that the sendfile code uses a "clever hack" that happens to
> A> work with the buffer cache based filesystems, but that that hack is a bug.
> A> So, I'd prefer that the problem is fixed in that code.
> A> But I am open to being convinced that all VOP_GETPAGES implementations,
> A> including that in ZFS, must be made aware of bogus_page.  Or, at least, that
> A> they should not verify that the requested pages are busied.
> 
> This is optimization that improves throughput when file memory cache is
> fragmented. Why don't you like adding the code to zfs_freebsd_getpages()?

I cited two reasons above and expected to hear some counter-points rather than
them being ignored :-)
If we settle upon allowing bogus_page to be used in ma[], then that will
obviously need to be documented.

-- 
Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?431f3e00-c66a-8e2e-6c61-a315a6353d1d>