From owner-freebsd-arch@FreeBSD.ORG  Sun Feb  5 16:47:36 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id D78961065673;
	Sun,  5 Feb 2012 16:47:36 +0000 (UTC)
	(envelope-from kostikbel@gmail.com)
Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200])
	by mx1.freebsd.org (Postfix) with ESMTP id EE7778FC0C;
	Sun,  5 Feb 2012 16:47:35 +0000 (UTC)
Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1])
	by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q15GlUSq039851
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sun, 5 Feb 2012 18:47:30 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1])
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id
	q15GlT5O052800; Sun, 5 Feb 2012 18:47:29 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
Received: (from kostik@localhost)
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q15GlTrw052799; 
	Sun, 5 Feb 2012 18:47:29 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to
	kostikbel@gmail.com using -f
Date: Sun, 5 Feb 2012 18:47:29 +0200
From: Konstantin Belousov <kostikbel@gmail.com>
To: Pawel Jakub Dawidek <pjd@freebsd.org>
Message-ID: <20120205164729.GH3283@deviant.kiev.zoral.com.ua>
References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua>
	<20120205114345.GE30033@garage.freebsd.pl>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="OtP5XuPTaeI2U+qV"
Content-Disposition: inline
In-Reply-To: <20120205114345.GE30033@garage.freebsd.pl>
User-Agent: Mutt/1.4.2.3i
X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua
X-Virus-Status: Clean
X-Spam-Status: No, score=-3.9 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00
	autolearn=ham version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on
	skuns.kiev.zoral.com.ua
Cc: arch@freebsd.org
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 05 Feb 2012 16:47:36 -0000


--OtP5XuPTaeI2U+qV
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sun, Feb 05, 2012 at 12:43:46PM +0100, Pawel Jakub Dawidek wrote:
> On Fri, Feb 03, 2012 at 09:37:19PM +0200, Konstantin Belousov wrote:
> > FreeBSD I/O infrastructure has well known issue with deadlock caused
> > by vnode lock order reversal when buffers supplied to read(2) or
> > write(2) syscalls are backed by mmaped file.
> >=20
> > I previously published the patches to convert i/o path to use VMIO,
> > based on the Jeff Roberson proposal, see
> > http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the
> > deadlock. Since that work is very intrusive and did not got any
> > follow-up, it get stalled.
> >=20
> > Below is very lightweight patch which only goal is to fix deadlock in
> > the least intrusive way. This is possible after FreeBSD got the
> > vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) KPIs.
> > http://people.freebsd.org/~kib/misc/vm1.3.patch
>=20
> +struct rl_q_entry *
> +rlqentry_alloc()
>=20
> Missing 'void'.
Yes, it is style inconsistency. Fixed.

>=20
> +static int
> +rangelock_incompatible(const struct rl_q_entry *e1,
> +    const struct rl_q_entry *e2)
> +{
> +
> +	if ((e1->rl_q_flags & RL_LOCK_TYPE_MASK) =3D=3D RL_LOCK_READ &&
> +	    (e2->rl_q_flags & RL_LOCK_TYPE_MASK) =3D=3D RL_LOCK_READ)
> +		return (0);
> +#define	IN_RANGE(a, e) (a >=3D e->rl_q_start && a < e->rl_q_end)
> +	if (IN_RANGE(e1->rl_q_start, e2) || IN_RANGE(e2->rl_q_start, e1) ||
> +	    IN_RANGE(e1->rl_q_end, e2) || IN_RANGE(e2->rl_q_end, e1))
> +		return (1);
> +#undef	IN_RANGE
> +	return (0);
> +}
>=20
> The checks here are a bit wrong and imcomplete.
>=20
> This is correct:
>=20
> +	if (IN_RANGE(e1->rl_q_start, e2) || IN_RANGE(e2->rl_q_start, e1) ||
>=20
> This is not:
>=20
> +	    IN_RANGE(e1->rl_q_end, e2) || IN_RANGE(e2->rl_q_end, e1))
>=20
> After simplifying one of those checks we get:
>=20
> 	if (end1 >=3D start2 && end1 < end2)
>=20
> This will be true if end1 =3D=3D start2, but in this case it should be
> false.
Agreed, my tests are too strict there. But note that the only consequence
is that some lock requests are postponed.

>=20
> There are also some checks missing. If the first range covers entire
> second range or vice-versa you will return that the locks are
> compatible, eg.
>=20
> 	start1-----start2-----end2-----end1
Isn't IN_RANGE(e2->rl_q_start, e1) cover this case ?

In essence, the checks I wrote verify whether any one region border lies
inside other region.
>=20
> The following check should cover all the cases:
>=20
> 	if (e1->rl_q_start < e2->rl_q_end &&
> 	    e1->rl_q_end > e2->rl_q_start) {
> 		return (1);
> 	}
Your check is much more elegant, I changed the code as you suggested.

>=20
> +static void
> +rangelock_calc_block(struct rangelock *lock)
> +{
> +	struct rl_q_entry *entry, *entry1, *whead;
> +
> +	if (lock->rl_currdep =3D=3D TAILQ_FIRST(&lock->rl_waiters) &&
> +	    lock->rl_currdep !=3D NULL)
> +		lock->rl_currdep =3D TAILQ_NEXT(lock->rl_currdep, rl_q_link);
> +	for (entry =3D lock->rl_currdep; entry;
> +	     entry =3D TAILQ_NEXT(entry, rl_q_link)) {
> +		TAILQ_FOREACH(entry1, &lock->rl_waiters, rl_q_link) {
> +			if (rangelock_incompatible(entry, entry1))
> +				goto out;
> +			if (entry1 =3D=3D entry)
> +				break;
> +		}
> +	}
> +out:
> +	lock->rl_currdep =3D entry;
> +	TAILQ_FOREACH(whead, &lock->rl_waiters, rl_q_link) {
> +		if (whead =3D=3D lock->rl_currdep)
> +			break;
> +		if (!(whead->rl_q_flags & RL_LOCK_GRANTED)) {
> +			whead->rl_q_flags |=3D RL_LOCK_GRANTED;
> +			wakeup(whead);
> +		}
> +	}
> +}
>=20
> This function doesn't look at it is going to scale. Searching for
> incompatible locks looks very expensive.
Well, it is not. I am aware of the tree structure use for the range lock,
but this probably have to wait some time.

>=20
> This function seems to be responsible for waking up waiters. In case of
> range locking this might be tricky, as unlocking read lock on the given
> range doesn't mean the range can be exclusively locked by a waiter, as
> part of this range might be read-locked multiple times or different
> range might be read locked, but which is also covered by the waiting
> writer.
>=20
> Could you elaborate a bit on how this all works together? Does the
> function look through all the waiters and for each waiter goes though
> all the active locks?
>=20
> Do you protect somehow against starving writers?
Due to different review, I already added the comments to the rangelock
implementation.

Basically, the rangelocks in kern_rangelock.c are fully fair. The locks
are granted in the order of waiters arrival (the order is defined as
the order of acquring interlock, in the latest version it is a sleepchannel
spinlock).

The free of some range causes rescan of the queue of waiters starting
from rl_currdep. The rl_currdep stores the first blocked waiter.
If next waiter is compatible with all granted regions, it is moved
before rl_currdep and woken up.
>=20
> +	KASSERT(entry->rl_q_flags & RL_LOCK_GRANTED, ("XXX"));
> +	KASSERT(entry->rl_q_start =3D=3D base, ("XXX"));
> +	KASSERT(entry->rl_q_end >=3D base + len, ("XXX"));
>=20
> I expect those XXX will be replaced with real messages in the final
> version? We could really use simple ASSERT() without obligatory message
> in the kernel...
>=20
We have MPASS. These XXX are fixed.

> +void *
> +rangelock_rlock(struct rangelock *lock, off_t base, size_t len, struct m=
tx *ilk)
> +{
> +	struct rl_q_entry *entry;
> +	struct thread *td;
> +
> +	td =3D curthread;
> +	if (td->td_rlqe !=3D NULL) {
> +		entry =3D td->td_rlqe;
> +		td->td_rlqe =3D NULL;
> +	} else
> +		entry =3D rlqentry_alloc();
> +	entry->rl_q_flags =3D RL_LOCK_READ;
> +	entry->rl_q_start =3D base;
> +	entry->rl_q_end =3D base + len;
> +	return (rangelock_enqueue(lock, entry, ilk));
> +}
> +
> +void *
> +rangelock_wlock(struct rangelock *lock, off_t base, size_t len, struct m=
tx *ilk)
> +{
> +	struct rl_q_entry *entry;
> +	struct thread *td;
> +
> +	td =3D curthread;
> +	if (td->td_rlqe !=3D NULL) {
> +		entry =3D td->td_rlqe;
> +		td->td_rlqe =3D NULL;
> +	} else
> +		entry =3D rlqentry_alloc();
> +	entry->rl_q_flags =3D RL_LOCK_WRITE;
> +	entry->rl_q_start =3D base;
> +	entry->rl_q_end =3D base + len;
> +	return (rangelock_enqueue(lock, entry, ilk));
> +}
>=20
> Those functions are nearly identical, maybe they can be converted to
> something like this:
>=20
> static void *
> rangelock_lock(struct rangelock *lock, off_t base, size_t len, struct mtx=
 *ilk,
>     int flags)
> {
> 	struct rl_q_entry *entry;
> 	struct thread *td;
>=20
> 	td =3D curthread;
> 	if (td->td_rlqe !=3D NULL) {
> 		entry =3D td->td_rlqe;
> 		td->td_rlqe =3D NULL;
> 	} else
> 		entry =3D rlqentry_alloc();
> 	entry->rl_q_flags =3D RL_LOCK_READ;
> 	entry->rl_q_start =3D base;
> 	entry->rl_q_end =3D base + len;
> 	return (rangelock_enqueue(lock, entry, ilk));
> }
>=20
> void *
> rangelock_rlock(struct rangelock *lock, off_t base, size_t len, struct mt=
x *ilk)
> {
>=20
> 	return (range_lock(lock, base, len, ilk, RL_LOCK_READ);
> }
>=20
> void *
> rangelock_wlock(struct rangelock *lock, off_t base, size_t len, struct mt=
x *ilk)
> {
>=20
> 	return (range_lock(lock, base, len, ilk, RL_LOCK_WRITE);
> }
>=20
> Or even merge rangelock_lock() with rangelock_enqueue()?
Done exactly this.

>=20
> @@ -199,6 +200,7 @@ thread_init(void *mem, int size, int flags)
> =20
>  	td->td_sleepqueue =3D sleepq_alloc();
>  	td->td_turnstile =3D turnstile_alloc();
> +	td->td_rlqe =3D rlqentry_alloc();
>=20
> What is the purpose of td_rlqe field? From what I see thread can lock
> more than one range. Could you elaborate on that as well?
td_rlqe is a cached rangelock entry used for the typical case of the
thread not doing recursive i/o. In this case, it is possible to avoid
memory allocation on the i/o hot path by using entry allocated during
thread creation. Since thread creation already allocates many chunks
of memory, and typical thread performs much more i/o then it suffers
creation, this shorten the i/o calculation path.

If tq_rlqe is already consumed, new entry is allocated at the lock time.
>=20
> Do you assert somehow that the same thread is trying to write-lock the
> range it already read-locked?
No. With the latest patch it is indeed can be done, because I started to
record range owners. For now, I added a comment with TODO mentioning the
possible assert.

>=20
> --- a/sys/kern/subr_syscall.c
> +++ b/sys/kern/subr_syscall.c
> @@ -177,6 +177,12 @@ syscallret(struct thread *td, int error, struct sysc=
all_args *sa __unused)
>  	KASSERT(td->td_locks =3D=3D 0,
>  	    ("System call %s returning with %d locks held",
>  	     syscallname(p, sa->code), td->td_locks));
> +	KASSERT((td->td_pflags & TDP_NOFAULTING) =3D=3D 0,
> +	    ("System call %s returning with pagefaults disabled",
> +	     syscallname(p, sa->code)));
> +	KASSERT((td->td_pflags & TDP_NOSLEEPING) =3D=3D 0,
> +	    ("System call %s returning with sleep disabled",
> +	     syscallname(p, sa->code)));
>=20
> We can commit those separately, right?
Definitely. Rangelocks will be committed separately too. The vfs_vnops.c
change is self-contained.

>=20
>  #ifdef MAC
>  	if ((ioflg & IO_NOMACCHECK) =3D=3D 0) {
> -		if (rw =3D=3D UIO_READ)
> -			error =3D mac_vnode_check_read(active_cred, file_cred,
> -			    vp);
> -		else
> +		if (rw =3D=3D UIO_WRITE)
>  			error =3D mac_vnode_check_write(active_cred, file_cred,
>  			    vp);
>  	}
>=20
> I don't see mac_vnode_check_read() call being moved anywhere.
> Was that intended?
No, thank you for pointing this. mdf already noted this.

>=20
> + * The vn_io_fault() is a wrapper around vn_read() and vn_write() to
> + * prevent the following deadlock:
> + *
> + * Assume that the thread A reads from the vnode vp1 info userspace
>=20
> s/info/into/
>=20
> + * buffer buf1 backed by the pages of vnode vp2.  If a page in buf1 is
> + * currently not resident, then system ends up with the call chain
> + *   vn_read() -> VOP_READ(vp1) -> uiomove() -> [Page Fault] ->
> + *     vm_fault(buf1) -> vnode_pager_getpages(vp2) -> VOP_GETPAGES(vp2)
> + * which establishes lock order vp1->vn_lock, then vp2->vn_lock.
> + * If, at the same time, thread B reads from vnode vp2 into buffer buf2
> + * backed by the pages of vnode vp1, and some page in buf2 is not
> + * resident, we get a reversed order vp2->vn_lock, then vp1->vn_lock.
> + *
> + * To prevent the lock order reversal and deadlock, vn_io_fault() does
> + * not allow page faults to happen during VOP_READ or VOP_WRITE().
>=20
> s/VOP_READ/VOP_READ()/
>=20
> +		prot =3D VM_PROT_READ;
> +		if ((fp->f_flag & O_APPEND) || !(flags & FOF_OFFSET))
> +			/* For appenders, punt and lock the whole range. */
> +			rl_cookie =3D vn_rangelock_wlock(vp, 0, (size_t)-1);
>=20
> On 32bit archs size_t is still 32bit, AFAIR and file size can be bigger
> than that. I think we should also use off_t for length (GEOM does that).
Agreed, but I moved to specifying (start,end) instead of (start,len) alread=
y.

>=20
> +		len =3D uio->uio_iov->iov_len;
> +		addr =3D (vm_offset_t)uio->uio_iov->iov_base;
> +		end =3D round_page(addr + len);
> +		cnt =3D howmany(end - trunc_page(addr), PAGE_SIZE);
> +		if (cnt > io_hold_cnt) {
> +			len =3D io_hold_cnt * PAGE_SIZE;
> +again:
> +			end =3D round_page(addr + len);
> +			cnt =3D howmany(end - trunc_page(addr), PAGE_SIZE);
> +			if (cnt > io_hold_cnt) {
> +				len -=3D PAGE_SIZE;
> +				goto again;
> +			}
> +		}
>=20
> This is a bit hard to understand immediately. Maybe something like the
> following is a bit more readable (I assume this goto could happen only
> once):
>=20
> 		len =3D MIN(uio->uio_iov->iov_len, io_hold_cnt * PAGE_SIZE);
> 		addr =3D (vm_offset_t)uio->uio_iov->iov_base;
> 		end =3D round_page(addr + len);
> 		if (howmany(end - trunc_page(addr), PAGE_SIZE) > io_hold_cnt) {
> 			len -=3D PAGE_SIZE;
> 			end =3D round_page(addr + len);
> 		}
I think it can happen twice, if both start and len are perfectly mis-aligne=
d.

>=20
> +vn_truncate(struct file *fp, off_t length, struct ucred *active_cred,
> +    struct thread *td)
> [...]
> +	/*
> +	 * Lock the range where the shortening take place. Increase of
> +	 * file size does not need rangelock, but it is faster to lock
> +	 * the range then call VOP_GETATTR to get the current size and
> +	 * deal with races.
> +	 */
> +	rl_cookie =3D vn_rangelock_wlock(vp, length, -1);
>=20
> Hmm. In case of increasing file size don't we need the range lock
> because we might be racing with write append?
We need to exclude the split i/o from being performed on the disjoint
ranges of the file if truncation happens in between split.

I changed the rangelock obtained in vn_truncate to get the whole range.
The bug appeared because I imported the rangelock from the bigger
patch that indeed handled the expansions.

Thank you for pointing this out.
>=20
> BTW. Comments in sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_rlock=
.c
> might be a useful read.

Thank you for the review.

http://people.freebsd.org/~kib/misc/vm1.8.patch

--OtP5XuPTaeI2U+qV
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAk8usqEACgkQC3+MBN1Mb4j4PgCeN08ELUSUm8La1mEHoyIMXbKS
6mEAnRO62Lib+nyRGb2lxQ5zqz732xmS
=6nrX
-----END PGP SIGNATURE-----

--OtP5XuPTaeI2U+qV--