From owner-freebsd-arch@FreeBSD.ORG Sun Feb 5 16:47:36 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D78961065673; Sun, 5 Feb 2012 16:47:36 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id EE7778FC0C; Sun, 5 Feb 2012 16:47:35 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q15GlUSq039851 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 5 Feb 2012 18:47:30 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q15GlT5O052800; Sun, 5 Feb 2012 18:47:29 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q15GlTrw052799; Sun, 5 Feb 2012 18:47:29 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Sun, 5 Feb 2012 18:47:29 +0200 From: Konstantin Belousov To: Pawel Jakub Dawidek Message-ID: <20120205164729.GH3283@deviant.kiev.zoral.com.ua> References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua> <20120205114345.GE30033@garage.freebsd.pl> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="OtP5XuPTaeI2U+qV" Content-Disposition: inline In-Reply-To: <20120205114345.GE30033@garage.freebsd.pl> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-3.9 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: arch@freebsd.org Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 05 Feb 2012 16:47:36 -0000 --OtP5XuPTaeI2U+qV Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Feb 05, 2012 at 12:43:46PM +0100, Pawel Jakub Dawidek wrote: > On Fri, Feb 03, 2012 at 09:37:19PM +0200, Konstantin Belousov wrote: > > FreeBSD I/O infrastructure has well known issue with deadlock caused > > by vnode lock order reversal when buffers supplied to read(2) or > > write(2) syscalls are backed by mmaped file. > >=20 > > I previously published the patches to convert i/o path to use VMIO, > > based on the Jeff Roberson proposal, see > > http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the > > deadlock. Since that work is very intrusive and did not got any > > follow-up, it get stalled. > >=20 > > Below is very lightweight patch which only goal is to fix deadlock in > > the least intrusive way. This is possible after FreeBSD got the > > vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) KPIs. > > http://people.freebsd.org/~kib/misc/vm1.3.patch >=20 > +struct rl_q_entry * > +rlqentry_alloc() >=20 > Missing 'void'. Yes, it is style inconsistency. Fixed. >=20 > +static int > +rangelock_incompatible(const struct rl_q_entry *e1, > + const struct rl_q_entry *e2) > +{ > + > + if ((e1->rl_q_flags & RL_LOCK_TYPE_MASK) =3D=3D RL_LOCK_READ && > + (e2->rl_q_flags & RL_LOCK_TYPE_MASK) =3D=3D RL_LOCK_READ) > + return (0); > +#define IN_RANGE(a, e) (a >=3D e->rl_q_start && a < e->rl_q_end) > + if (IN_RANGE(e1->rl_q_start, e2) || IN_RANGE(e2->rl_q_start, e1) || > + IN_RANGE(e1->rl_q_end, e2) || IN_RANGE(e2->rl_q_end, e1)) > + return (1); > +#undef IN_RANGE > + return (0); > +} >=20 > The checks here are a bit wrong and imcomplete. >=20 > This is correct: >=20 > + if (IN_RANGE(e1->rl_q_start, e2) || IN_RANGE(e2->rl_q_start, e1) || >=20 > This is not: >=20 > + IN_RANGE(e1->rl_q_end, e2) || IN_RANGE(e2->rl_q_end, e1)) >=20 > After simplifying one of those checks we get: >=20 > if (end1 >=3D start2 && end1 < end2) >=20 > This will be true if end1 =3D=3D start2, but in this case it should be > false. Agreed, my tests are too strict there. But note that the only consequence is that some lock requests are postponed. >=20 > There are also some checks missing. If the first range covers entire > second range or vice-versa you will return that the locks are > compatible, eg. >=20 > start1-----start2-----end2-----end1 Isn't IN_RANGE(e2->rl_q_start, e1) cover this case ? In essence, the checks I wrote verify whether any one region border lies inside other region. >=20 > The following check should cover all the cases: >=20 > if (e1->rl_q_start < e2->rl_q_end && > e1->rl_q_end > e2->rl_q_start) { > return (1); > } Your check is much more elegant, I changed the code as you suggested. >=20 > +static void > +rangelock_calc_block(struct rangelock *lock) > +{ > + struct rl_q_entry *entry, *entry1, *whead; > + > + if (lock->rl_currdep =3D=3D TAILQ_FIRST(&lock->rl_waiters) && > + lock->rl_currdep !=3D NULL) > + lock->rl_currdep =3D TAILQ_NEXT(lock->rl_currdep, rl_q_link); > + for (entry =3D lock->rl_currdep; entry; > + entry =3D TAILQ_NEXT(entry, rl_q_link)) { > + TAILQ_FOREACH(entry1, &lock->rl_waiters, rl_q_link) { > + if (rangelock_incompatible(entry, entry1)) > + goto out; > + if (entry1 =3D=3D entry) > + break; > + } > + } > +out: > + lock->rl_currdep =3D entry; > + TAILQ_FOREACH(whead, &lock->rl_waiters, rl_q_link) { > + if (whead =3D=3D lock->rl_currdep) > + break; > + if (!(whead->rl_q_flags & RL_LOCK_GRANTED)) { > + whead->rl_q_flags |=3D RL_LOCK_GRANTED; > + wakeup(whead); > + } > + } > +} >=20 > This function doesn't look at it is going to scale. Searching for > incompatible locks looks very expensive. Well, it is not. I am aware of the tree structure use for the range lock, but this probably have to wait some time. >=20 > This function seems to be responsible for waking up waiters. In case of > range locking this might be tricky, as unlocking read lock on the given > range doesn't mean the range can be exclusively locked by a waiter, as > part of this range might be read-locked multiple times or different > range might be read locked, but which is also covered by the waiting > writer. >=20 > Could you elaborate a bit on how this all works together? Does the > function look through all the waiters and for each waiter goes though > all the active locks? >=20 > Do you protect somehow against starving writers? Due to different review, I already added the comments to the rangelock implementation. Basically, the rangelocks in kern_rangelock.c are fully fair. The locks are granted in the order of waiters arrival (the order is defined as the order of acquring interlock, in the latest version it is a sleepchannel spinlock). The free of some range causes rescan of the queue of waiters starting from rl_currdep. The rl_currdep stores the first blocked waiter. If next waiter is compatible with all granted regions, it is moved before rl_currdep and woken up. >=20 > + KASSERT(entry->rl_q_flags & RL_LOCK_GRANTED, ("XXX")); > + KASSERT(entry->rl_q_start =3D=3D base, ("XXX")); > + KASSERT(entry->rl_q_end >=3D base + len, ("XXX")); >=20 > I expect those XXX will be replaced with real messages in the final > version? We could really use simple ASSERT() without obligatory message > in the kernel... >=20 We have MPASS. These XXX are fixed. > +void * > +rangelock_rlock(struct rangelock *lock, off_t base, size_t len, struct m= tx *ilk) > +{ > + struct rl_q_entry *entry; > + struct thread *td; > + > + td =3D curthread; > + if (td->td_rlqe !=3D NULL) { > + entry =3D td->td_rlqe; > + td->td_rlqe =3D NULL; > + } else > + entry =3D rlqentry_alloc(); > + entry->rl_q_flags =3D RL_LOCK_READ; > + entry->rl_q_start =3D base; > + entry->rl_q_end =3D base + len; > + return (rangelock_enqueue(lock, entry, ilk)); > +} > + > +void * > +rangelock_wlock(struct rangelock *lock, off_t base, size_t len, struct m= tx *ilk) > +{ > + struct rl_q_entry *entry; > + struct thread *td; > + > + td =3D curthread; > + if (td->td_rlqe !=3D NULL) { > + entry =3D td->td_rlqe; > + td->td_rlqe =3D NULL; > + } else > + entry =3D rlqentry_alloc(); > + entry->rl_q_flags =3D RL_LOCK_WRITE; > + entry->rl_q_start =3D base; > + entry->rl_q_end =3D base + len; > + return (rangelock_enqueue(lock, entry, ilk)); > +} >=20 > Those functions are nearly identical, maybe they can be converted to > something like this: >=20 > static void * > rangelock_lock(struct rangelock *lock, off_t base, size_t len, struct mtx= *ilk, > int flags) > { > struct rl_q_entry *entry; > struct thread *td; >=20 > td =3D curthread; > if (td->td_rlqe !=3D NULL) { > entry =3D td->td_rlqe; > td->td_rlqe =3D NULL; > } else > entry =3D rlqentry_alloc(); > entry->rl_q_flags =3D RL_LOCK_READ; > entry->rl_q_start =3D base; > entry->rl_q_end =3D base + len; > return (rangelock_enqueue(lock, entry, ilk)); > } >=20 > void * > rangelock_rlock(struct rangelock *lock, off_t base, size_t len, struct mt= x *ilk) > { >=20 > return (range_lock(lock, base, len, ilk, RL_LOCK_READ); > } >=20 > void * > rangelock_wlock(struct rangelock *lock, off_t base, size_t len, struct mt= x *ilk) > { >=20 > return (range_lock(lock, base, len, ilk, RL_LOCK_WRITE); > } >=20 > Or even merge rangelock_lock() with rangelock_enqueue()? Done exactly this. >=20 > @@ -199,6 +200,7 @@ thread_init(void *mem, int size, int flags) > =20 > td->td_sleepqueue =3D sleepq_alloc(); > td->td_turnstile =3D turnstile_alloc(); > + td->td_rlqe =3D rlqentry_alloc(); >=20 > What is the purpose of td_rlqe field? From what I see thread can lock > more than one range. Could you elaborate on that as well? td_rlqe is a cached rangelock entry used for the typical case of the thread not doing recursive i/o. In this case, it is possible to avoid memory allocation on the i/o hot path by using entry allocated during thread creation. Since thread creation already allocates many chunks of memory, and typical thread performs much more i/o then it suffers creation, this shorten the i/o calculation path. If tq_rlqe is already consumed, new entry is allocated at the lock time. >=20 > Do you assert somehow that the same thread is trying to write-lock the > range it already read-locked? No. With the latest patch it is indeed can be done, because I started to record range owners. For now, I added a comment with TODO mentioning the possible assert. >=20 > --- a/sys/kern/subr_syscall.c > +++ b/sys/kern/subr_syscall.c > @@ -177,6 +177,12 @@ syscallret(struct thread *td, int error, struct sysc= all_args *sa __unused) > KASSERT(td->td_locks =3D=3D 0, > ("System call %s returning with %d locks held", > syscallname(p, sa->code), td->td_locks)); > + KASSERT((td->td_pflags & TDP_NOFAULTING) =3D=3D 0, > + ("System call %s returning with pagefaults disabled", > + syscallname(p, sa->code))); > + KASSERT((td->td_pflags & TDP_NOSLEEPING) =3D=3D 0, > + ("System call %s returning with sleep disabled", > + syscallname(p, sa->code))); >=20 > We can commit those separately, right? Definitely. Rangelocks will be committed separately too. The vfs_vnops.c change is self-contained. >=20 > #ifdef MAC > if ((ioflg & IO_NOMACCHECK) =3D=3D 0) { > - if (rw =3D=3D UIO_READ) > - error =3D mac_vnode_check_read(active_cred, file_cred, > - vp); > - else > + if (rw =3D=3D UIO_WRITE) > error =3D mac_vnode_check_write(active_cred, file_cred, > vp); > } >=20 > I don't see mac_vnode_check_read() call being moved anywhere. > Was that intended? No, thank you for pointing this. mdf already noted this. >=20 > + * The vn_io_fault() is a wrapper around vn_read() and vn_write() to > + * prevent the following deadlock: > + * > + * Assume that the thread A reads from the vnode vp1 info userspace >=20 > s/info/into/ >=20 > + * buffer buf1 backed by the pages of vnode vp2. If a page in buf1 is > + * currently not resident, then system ends up with the call chain > + * vn_read() -> VOP_READ(vp1) -> uiomove() -> [Page Fault] -> > + * vm_fault(buf1) -> vnode_pager_getpages(vp2) -> VOP_GETPAGES(vp2) > + * which establishes lock order vp1->vn_lock, then vp2->vn_lock. > + * If, at the same time, thread B reads from vnode vp2 into buffer buf2 > + * backed by the pages of vnode vp1, and some page in buf2 is not > + * resident, we get a reversed order vp2->vn_lock, then vp1->vn_lock. > + * > + * To prevent the lock order reversal and deadlock, vn_io_fault() does > + * not allow page faults to happen during VOP_READ or VOP_WRITE(). >=20 > s/VOP_READ/VOP_READ()/ >=20 > + prot =3D VM_PROT_READ; > + if ((fp->f_flag & O_APPEND) || !(flags & FOF_OFFSET)) > + /* For appenders, punt and lock the whole range. */ > + rl_cookie =3D vn_rangelock_wlock(vp, 0, (size_t)-1); >=20 > On 32bit archs size_t is still 32bit, AFAIR and file size can be bigger > than that. I think we should also use off_t for length (GEOM does that). Agreed, but I moved to specifying (start,end) instead of (start,len) alread= y. >=20 > + len =3D uio->uio_iov->iov_len; > + addr =3D (vm_offset_t)uio->uio_iov->iov_base; > + end =3D round_page(addr + len); > + cnt =3D howmany(end - trunc_page(addr), PAGE_SIZE); > + if (cnt > io_hold_cnt) { > + len =3D io_hold_cnt * PAGE_SIZE; > +again: > + end =3D round_page(addr + len); > + cnt =3D howmany(end - trunc_page(addr), PAGE_SIZE); > + if (cnt > io_hold_cnt) { > + len -=3D PAGE_SIZE; > + goto again; > + } > + } >=20 > This is a bit hard to understand immediately. Maybe something like the > following is a bit more readable (I assume this goto could happen only > once): >=20 > len =3D MIN(uio->uio_iov->iov_len, io_hold_cnt * PAGE_SIZE); > addr =3D (vm_offset_t)uio->uio_iov->iov_base; > end =3D round_page(addr + len); > if (howmany(end - trunc_page(addr), PAGE_SIZE) > io_hold_cnt) { > len -=3D PAGE_SIZE; > end =3D round_page(addr + len); > } I think it can happen twice, if both start and len are perfectly mis-aligne= d. >=20 > +vn_truncate(struct file *fp, off_t length, struct ucred *active_cred, > + struct thread *td) > [...] > + /* > + * Lock the range where the shortening take place. Increase of > + * file size does not need rangelock, but it is faster to lock > + * the range then call VOP_GETATTR to get the current size and > + * deal with races. > + */ > + rl_cookie =3D vn_rangelock_wlock(vp, length, -1); >=20 > Hmm. In case of increasing file size don't we need the range lock > because we might be racing with write append? We need to exclude the split i/o from being performed on the disjoint ranges of the file if truncation happens in between split. I changed the rangelock obtained in vn_truncate to get the whole range. The bug appeared because I imported the rangelock from the bigger patch that indeed handled the expansions. Thank you for pointing this out. >=20 > BTW. Comments in sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_rlock= .c > might be a useful read. Thank you for the review. http://people.freebsd.org/~kib/misc/vm1.8.patch --OtP5XuPTaeI2U+qV Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk8usqEACgkQC3+MBN1Mb4j4PgCeN08ELUSUm8La1mEHoyIMXbKS 6mEAnRO62Lib+nyRGb2lxQ5zqz732xmS =6nrX -----END PGP SIGNATURE----- --OtP5XuPTaeI2U+qV--