Date: Sun, 5 Feb 2012 18:47:29 +0200 From: Konstantin Belousov <kostikbel@gmail.com> To: Pawel Jakub Dawidek <pjd@freebsd.org> Cc: arch@freebsd.org Subject: Re: Prefaulting for i/o buffers Message-ID: <20120205164729.GH3283@deviant.kiev.zoral.com.ua> In-Reply-To: <20120205114345.GE30033@garage.freebsd.pl> References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua> <20120205114345.GE30033@garage.freebsd.pl>
next in thread | previous in thread | raw e-mail | index | archive | help
[-- Attachment #1 --] On Sun, Feb 05, 2012 at 12:43:46PM +0100, Pawel Jakub Dawidek wrote: > On Fri, Feb 03, 2012 at 09:37:19PM +0200, Konstantin Belousov wrote: > > FreeBSD I/O infrastructure has well known issue with deadlock caused > > by vnode lock order reversal when buffers supplied to read(2) or > > write(2) syscalls are backed by mmaped file. > > > > I previously published the patches to convert i/o path to use VMIO, > > based on the Jeff Roberson proposal, see > > http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the > > deadlock. Since that work is very intrusive and did not got any > > follow-up, it get stalled. > > > > Below is very lightweight patch which only goal is to fix deadlock in > > the least intrusive way. This is possible after FreeBSD got the > > vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) KPIs. > > http://people.freebsd.org/~kib/misc/vm1.3.patch > > +struct rl_q_entry * > +rlqentry_alloc() > > Missing 'void'. Yes, it is style inconsistency. Fixed. > > +static int > +rangelock_incompatible(const struct rl_q_entry *e1, > + const struct rl_q_entry *e2) > +{ > + > + if ((e1->rl_q_flags & RL_LOCK_TYPE_MASK) == RL_LOCK_READ && > + (e2->rl_q_flags & RL_LOCK_TYPE_MASK) == RL_LOCK_READ) > + return (0); > +#define IN_RANGE(a, e) (a >= e->rl_q_start && a < e->rl_q_end) > + if (IN_RANGE(e1->rl_q_start, e2) || IN_RANGE(e2->rl_q_start, e1) || > + IN_RANGE(e1->rl_q_end, e2) || IN_RANGE(e2->rl_q_end, e1)) > + return (1); > +#undef IN_RANGE > + return (0); > +} > > The checks here are a bit wrong and imcomplete. > > This is correct: > > + if (IN_RANGE(e1->rl_q_start, e2) || IN_RANGE(e2->rl_q_start, e1) || > > This is not: > > + IN_RANGE(e1->rl_q_end, e2) || IN_RANGE(e2->rl_q_end, e1)) > > After simplifying one of those checks we get: > > if (end1 >= start2 && end1 < end2) > > This will be true if end1 == start2, but in this case it should be > false. Agreed, my tests are too strict there. But note that the only consequence is that some lock requests are postponed. > > There are also some checks missing. If the first range covers entire > second range or vice-versa you will return that the locks are > compatible, eg. > > start1-----start2-----end2-----end1 Isn't IN_RANGE(e2->rl_q_start, e1) cover this case ? In essence, the checks I wrote verify whether any one region border lies inside other region. > > The following check should cover all the cases: > > if (e1->rl_q_start < e2->rl_q_end && > e1->rl_q_end > e2->rl_q_start) { > return (1); > } Your check is much more elegant, I changed the code as you suggested. > > +static void > +rangelock_calc_block(struct rangelock *lock) > +{ > + struct rl_q_entry *entry, *entry1, *whead; > + > + if (lock->rl_currdep == TAILQ_FIRST(&lock->rl_waiters) && > + lock->rl_currdep != NULL) > + lock->rl_currdep = TAILQ_NEXT(lock->rl_currdep, rl_q_link); > + for (entry = lock->rl_currdep; entry; > + entry = TAILQ_NEXT(entry, rl_q_link)) { > + TAILQ_FOREACH(entry1, &lock->rl_waiters, rl_q_link) { > + if (rangelock_incompatible(entry, entry1)) > + goto out; > + if (entry1 == entry) > + break; > + } > + } > +out: > + lock->rl_currdep = entry; > + TAILQ_FOREACH(whead, &lock->rl_waiters, rl_q_link) { > + if (whead == lock->rl_currdep) > + break; > + if (!(whead->rl_q_flags & RL_LOCK_GRANTED)) { > + whead->rl_q_flags |= RL_LOCK_GRANTED; > + wakeup(whead); > + } > + } > +} > > This function doesn't look at it is going to scale. Searching for > incompatible locks looks very expensive. Well, it is not. I am aware of the tree structure use for the range lock, but this probably have to wait some time. > > This function seems to be responsible for waking up waiters. In case of > range locking this might be tricky, as unlocking read lock on the given > range doesn't mean the range can be exclusively locked by a waiter, as > part of this range might be read-locked multiple times or different > range might be read locked, but which is also covered by the waiting > writer. > > Could you elaborate a bit on how this all works together? Does the > function look through all the waiters and for each waiter goes though > all the active locks? > > Do you protect somehow against starving writers? Due to different review, I already added the comments to the rangelock implementation. Basically, the rangelocks in kern_rangelock.c are fully fair. The locks are granted in the order of waiters arrival (the order is defined as the order of acquring interlock, in the latest version it is a sleepchannel spinlock). The free of some range causes rescan of the queue of waiters starting from rl_currdep. The rl_currdep stores the first blocked waiter. If next waiter is compatible with all granted regions, it is moved before rl_currdep and woken up. > > + KASSERT(entry->rl_q_flags & RL_LOCK_GRANTED, ("XXX")); > + KASSERT(entry->rl_q_start == base, ("XXX")); > + KASSERT(entry->rl_q_end >= base + len, ("XXX")); > > I expect those XXX will be replaced with real messages in the final > version? We could really use simple ASSERT() without obligatory message > in the kernel... > We have MPASS. These XXX are fixed. > +void * > +rangelock_rlock(struct rangelock *lock, off_t base, size_t len, struct mtx *ilk) > +{ > + struct rl_q_entry *entry; > + struct thread *td; > + > + td = curthread; > + if (td->td_rlqe != NULL) { > + entry = td->td_rlqe; > + td->td_rlqe = NULL; > + } else > + entry = rlqentry_alloc(); > + entry->rl_q_flags = RL_LOCK_READ; > + entry->rl_q_start = base; > + entry->rl_q_end = base + len; > + return (rangelock_enqueue(lock, entry, ilk)); > +} > + > +void * > +rangelock_wlock(struct rangelock *lock, off_t base, size_t len, struct mtx *ilk) > +{ > + struct rl_q_entry *entry; > + struct thread *td; > + > + td = curthread; > + if (td->td_rlqe != NULL) { > + entry = td->td_rlqe; > + td->td_rlqe = NULL; > + } else > + entry = rlqentry_alloc(); > + entry->rl_q_flags = RL_LOCK_WRITE; > + entry->rl_q_start = base; > + entry->rl_q_end = base + len; > + return (rangelock_enqueue(lock, entry, ilk)); > +} > > Those functions are nearly identical, maybe they can be converted to > something like this: > > static void * > rangelock_lock(struct rangelock *lock, off_t base, size_t len, struct mtx *ilk, > int flags) > { > struct rl_q_entry *entry; > struct thread *td; > > td = curthread; > if (td->td_rlqe != NULL) { > entry = td->td_rlqe; > td->td_rlqe = NULL; > } else > entry = rlqentry_alloc(); > entry->rl_q_flags = RL_LOCK_READ; > entry->rl_q_start = base; > entry->rl_q_end = base + len; > return (rangelock_enqueue(lock, entry, ilk)); > } > > void * > rangelock_rlock(struct rangelock *lock, off_t base, size_t len, struct mtx *ilk) > { > > return (range_lock(lock, base, len, ilk, RL_LOCK_READ); > } > > void * > rangelock_wlock(struct rangelock *lock, off_t base, size_t len, struct mtx *ilk) > { > > return (range_lock(lock, base, len, ilk, RL_LOCK_WRITE); > } > > Or even merge rangelock_lock() with rangelock_enqueue()? Done exactly this. > > @@ -199,6 +200,7 @@ thread_init(void *mem, int size, int flags) > > td->td_sleepqueue = sleepq_alloc(); > td->td_turnstile = turnstile_alloc(); > + td->td_rlqe = rlqentry_alloc(); > > What is the purpose of td_rlqe field? From what I see thread can lock > more than one range. Could you elaborate on that as well? td_rlqe is a cached rangelock entry used for the typical case of the thread not doing recursive i/o. In this case, it is possible to avoid memory allocation on the i/o hot path by using entry allocated during thread creation. Since thread creation already allocates many chunks of memory, and typical thread performs much more i/o then it suffers creation, this shorten the i/o calculation path. If tq_rlqe is already consumed, new entry is allocated at the lock time. > > Do you assert somehow that the same thread is trying to write-lock the > range it already read-locked? No. With the latest patch it is indeed can be done, because I started to record range owners. For now, I added a comment with TODO mentioning the possible assert. > > --- a/sys/kern/subr_syscall.c > +++ b/sys/kern/subr_syscall.c > @@ -177,6 +177,12 @@ syscallret(struct thread *td, int error, struct syscall_args *sa __unused) > KASSERT(td->td_locks == 0, > ("System call %s returning with %d locks held", > syscallname(p, sa->code), td->td_locks)); > + KASSERT((td->td_pflags & TDP_NOFAULTING) == 0, > + ("System call %s returning with pagefaults disabled", > + syscallname(p, sa->code))); > + KASSERT((td->td_pflags & TDP_NOSLEEPING) == 0, > + ("System call %s returning with sleep disabled", > + syscallname(p, sa->code))); > > We can commit those separately, right? Definitely. Rangelocks will be committed separately too. The vfs_vnops.c change is self-contained. > > #ifdef MAC > if ((ioflg & IO_NOMACCHECK) == 0) { > - if (rw == UIO_READ) > - error = mac_vnode_check_read(active_cred, file_cred, > - vp); > - else > + if (rw == UIO_WRITE) > error = mac_vnode_check_write(active_cred, file_cred, > vp); > } > > I don't see mac_vnode_check_read() call being moved anywhere. > Was that intended? No, thank you for pointing this. mdf already noted this. > > + * The vn_io_fault() is a wrapper around vn_read() and vn_write() to > + * prevent the following deadlock: > + * > + * Assume that the thread A reads from the vnode vp1 info userspace > > s/info/into/ > > + * buffer buf1 backed by the pages of vnode vp2. If a page in buf1 is > + * currently not resident, then system ends up with the call chain > + * vn_read() -> VOP_READ(vp1) -> uiomove() -> [Page Fault] -> > + * vm_fault(buf1) -> vnode_pager_getpages(vp2) -> VOP_GETPAGES(vp2) > + * which establishes lock order vp1->vn_lock, then vp2->vn_lock. > + * If, at the same time, thread B reads from vnode vp2 into buffer buf2 > + * backed by the pages of vnode vp1, and some page in buf2 is not > + * resident, we get a reversed order vp2->vn_lock, then vp1->vn_lock. > + * > + * To prevent the lock order reversal and deadlock, vn_io_fault() does > + * not allow page faults to happen during VOP_READ or VOP_WRITE(). > > s/VOP_READ/VOP_READ()/ > > + prot = VM_PROT_READ; > + if ((fp->f_flag & O_APPEND) || !(flags & FOF_OFFSET)) > + /* For appenders, punt and lock the whole range. */ > + rl_cookie = vn_rangelock_wlock(vp, 0, (size_t)-1); > > On 32bit archs size_t is still 32bit, AFAIR and file size can be bigger > than that. I think we should also use off_t for length (GEOM does that). Agreed, but I moved to specifying (start,end) instead of (start,len) already. > > + len = uio->uio_iov->iov_len; > + addr = (vm_offset_t)uio->uio_iov->iov_base; > + end = round_page(addr + len); > + cnt = howmany(end - trunc_page(addr), PAGE_SIZE); > + if (cnt > io_hold_cnt) { > + len = io_hold_cnt * PAGE_SIZE; > +again: > + end = round_page(addr + len); > + cnt = howmany(end - trunc_page(addr), PAGE_SIZE); > + if (cnt > io_hold_cnt) { > + len -= PAGE_SIZE; > + goto again; > + } > + } > > This is a bit hard to understand immediately. Maybe something like the > following is a bit more readable (I assume this goto could happen only > once): > > len = MIN(uio->uio_iov->iov_len, io_hold_cnt * PAGE_SIZE); > addr = (vm_offset_t)uio->uio_iov->iov_base; > end = round_page(addr + len); > if (howmany(end - trunc_page(addr), PAGE_SIZE) > io_hold_cnt) { > len -= PAGE_SIZE; > end = round_page(addr + len); > } I think it can happen twice, if both start and len are perfectly mis-aligned. > > +vn_truncate(struct file *fp, off_t length, struct ucred *active_cred, > + struct thread *td) > [...] > + /* > + * Lock the range where the shortening take place. Increase of > + * file size does not need rangelock, but it is faster to lock > + * the range then call VOP_GETATTR to get the current size and > + * deal with races. > + */ > + rl_cookie = vn_rangelock_wlock(vp, length, -1); > > Hmm. In case of increasing file size don't we need the range lock > because we might be racing with write append? We need to exclude the split i/o from being performed on the disjoint ranges of the file if truncation happens in between split. I changed the rangelock obtained in vn_truncate to get the whole range. The bug appeared because I imported the rangelock from the bigger patch that indeed handled the expansions. Thank you for pointing this out. > > BTW. Comments in sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_rlock.c > might be a useful read. Thank you for the review. http://people.freebsd.org/~kib/misc/vm1.8.patch [-- Attachment #2 --] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk8usqEACgkQC3+MBN1Mb4j4PgCeN08ELUSUm8La1mEHoyIMXbKS 6mEAnRO62Lib+nyRGb2lxQ5zqz732xmS =6nrX -----END PGP SIGNATURE-----
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120205164729.GH3283>
