Date: Sun, 26 Feb 2012 15:02:54 +0100 From: Attilio Rao <attilio@freebsd.org> To: Konstantin Belousov <kostikbel@gmail.com> Cc: arch@freebsd.org, Florian Smeets <flo@freebsd.org>, Pawel Jakub Dawidek <pjd@freebsd.org> Subject: Re: Prefaulting for i/o buffers Message-ID: <CAJ-FndDZpDXqDRR=kT_eQcHbeg3vdiUjnygy1=QLvVuumUsgBw@mail.gmail.com> In-Reply-To: <20120225210339.GM55074@deviant.kiev.zoral.com.ua> References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua> <CAJ-FndABi21GfcCRTZizCPc_Mnxm1EY271BiXcYt9SD_zXFpXw@mail.gmail.com> <20120225151334.GH1344@garage.freebsd.pl> <CAJ-FndBBKHrpB1MNJTXx8gkFXR2d-O6k5-HJeOAyv2DznpN-QQ@mail.gmail.com> <20120225210339.GM55074@deviant.kiev.zoral.com.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
Il 25 febbraio 2012 22:03, Konstantin Belousov <kostikbel@gmail.com> ha scritto: > On Sat, Feb 25, 2012 at 06:45:00PM +0100, Attilio Rao wrote: >> Il 25 febbraio 2012 16:13, Pawel Jakub Dawidek <pjd@freebsd.org> ha scritto: >> > On Sat, Feb 25, 2012 at 01:01:32PM +0000, Attilio Rao wrote: >> >> Il 03 febbraio 2012 19:37, Konstantin Belousov <kostikbel@gmail.com> ha scritto: >> >> > FreeBSD I/O infrastructure has well known issue with deadlock caused >> >> > by vnode lock order reversal when buffers supplied to read(2) or >> >> > write(2) syscalls are backed by mmaped file. >> >> > >> >> > I previously published the patches to convert i/o path to use VMIO, >> >> > based on the Jeff Roberson proposal, see >> >> > http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the >> >> > deadlock. Since that work is very intrusive and did not got any >> >> > follow-up, it get stalled. >> >> > >> >> > Below is very lightweight patch which only goal is to fix deadlock in >> >> > the least intrusive way. This is possible after FreeBSD got the >> >> > vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) KPIs. >> >> > http://people.freebsd.org/~kib/misc/vm1.3.patch >> >> >> >> Hi, >> >> I was reviewing: >> >> http://people.freebsd.org/~kib/misc/vm1.11.patch >> >> >> >> and I think it is great. It is simple enough and I don't have further >> >> comments on it. > Thank you. > > This spoiled an announce I intended to send this weekend :) > >> >> >> >> However, as a side note, I was thinking if we could get one day at the >> >> point to integrate rangelocks into vnodes lockmgr directly. >> >> It would be a huge patch, rewrtiting the locking of several members of >> >> vnodes likely, but I think it would be worth it in terms of cleaness >> >> of the interface and less overhead. Also, it would be interesting to >> >> consider merging rangelock implementation in ZFS' one, at some point. >> > >> > I personal opinion about rangelocks and many other VFS features we >> > currently have is that it is good idea in theory, but in practise it >> > tends to overcomplicate VFS. >> > >> > I'm in opinion that we should move as much stuff as we can to individual >> > file systems. We try to implement everything in VFS itself in hope that >> > this will simplify file systems we have. It then turns out only one file >> > system is really using this stuff (most of the time it is UFS) and this >> > is PITA for all the other file systems as well as maintaining VFS. VFS >> > became so complicated over the years that there are maybe few people >> > that can understand it, and every single change to VFS is a huge risk of >> > potentially breaking some unrelated parts. >> >> I think this is questionable due to the following assets: >> - If the problem is filesystems writers having trouble in >> understanding the necessary locking we should really provide cleaner >> and more complete documentation. One would think the same with our VM >> subsystem, but at least in that case there is plenty of comments that >> help understanding how to deal with vm_object, vm_pages locking during >> their lifelines. >> - Our primitives may be more complicated than the >> 'all-in-the-filesystem' one, but at least they offer a complete and >> centralized view over the resources we have allocated in the whole >> system and they allow building better policies about how to manage >> them. One problem I see here, is that those policies are not fully >> implemented, tuned or just got outdated, removing one of the highest >> beneficial that we have by making vnodes so generic >> >> About the thing I mentioned myself: >> - As long as the same path now has both range-locking and vnode >> locking I don't see as a good idea to keep both separated forever. >> Merging them seems to me an important evolution (not only helping >> shrinking the number of primitives themselves but also introducing >> less overhead and likely rewamped scalability for vnodes (but I think >> this needs a deep investigation). > The proper direction to move there is to designate the vnode lock for > the vnode structure protection, and have the range lock protect the > i/o atomicity. This is somewhat done in the proposed patch (since > now vnode lock does not protect the i/o operation, but only chunked > i/o transactions inside the operation). > > The Jeff idea of using page cache as the source of i/o data (implemented > in the VM6 patchset) pushes the idea much further. E.g., the write > does not obtain the write vnode lock typically (but sometimes it had, > to extend the vnode). > > Probably, I will revive VM6 after this change is landed. About that I guess we might be careful. The first thing would be having a very scalable VM subsystem and recent benchmarks have shown that this is not yet the case (Florian, CC'ed, can share some pmc/LOCK_PROFILE analysis on pgsql that, also with the vmcontention patch, shows a lot on contention on vm_object, pmap lock and vm_page_queue_lock. We have some plans for every of them, we will discuss on a separate thread if you prefer). This is just to say, that we may need more work in underground areas to bring VM6 to the point it will really make a difference. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-FndDZpDXqDRR=kT_eQcHbeg3vdiUjnygy1=QLvVuumUsgBw>