From owner-freebsd-arch@FreeBSD.ORG  Sun Feb 26 14:02:56 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 7ECC6106566C;
	Sun, 26 Feb 2012 14:02:56 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com
	[209.85.215.54])
	by mx1.freebsd.org (Postfix) with ESMTP id 922998FC0C;
	Sun, 26 Feb 2012 14:02:55 +0000 (UTC)
Received: by lagz14 with SMTP id z14so6300630lag.13
	for <multiple recipients>; Sun, 26 Feb 2012 06:02:54 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type;
	bh=VkgoBzXDFoYv5pvvupxftiespIM+8ejQqbPGy3xiNOs=;
	b=A5cAa6FJS0ZJWWEB+OKz7ScLmtoY6HqbjVo3mF5q5iJJkZkhloUbUjAuW3C39T22BK
	E76eVQy2+tZBQFyckK3sxi2dsF9q19qFrX9daEDNPl9N0PtI7C4hOQ9m4T+Kk6lLH7NJ
	7/fW21rhdMBYY6dSc7J+AHo/UqLCYvk/SqpOo=
MIME-Version: 1.0
Received: by 10.152.130.234 with SMTP id oh10mr7407327lab.35.1330264974466;
	Sun, 26 Feb 2012 06:02:54 -0800 (PST)
Sender: asmrookie@gmail.com
Received: by 10.112.41.5 with HTTP; Sun, 26 Feb 2012 06:02:54 -0800 (PST)
In-Reply-To: <20120225210339.GM55074@deviant.kiev.zoral.com.ua>
References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua>
	<CAJ-FndABi21GfcCRTZizCPc_Mnxm1EY271BiXcYt9SD_zXFpXw@mail.gmail.com>
	<20120225151334.GH1344@garage.freebsd.pl>
	<CAJ-FndBBKHrpB1MNJTXx8gkFXR2d-O6k5-HJeOAyv2DznpN-QQ@mail.gmail.com>
	<20120225210339.GM55074@deviant.kiev.zoral.com.ua>
Date: Sun, 26 Feb 2012 15:02:54 +0100
X-Google-Sender-Auth: 1JO8JTL6BqDB7R3Hz9TRRKxu-uQ
Message-ID: <CAJ-FndDZpDXqDRR=kT_eQcHbeg3vdiUjnygy1=QLvVuumUsgBw@mail.gmail.com>
From: Attilio Rao <attilio@freebsd.org>
To: Konstantin Belousov <kostikbel@gmail.com>
Content-Type: text/plain; charset=UTF-8
Cc: arch@freebsd.org, Florian Smeets <flo@freebsd.org>,
	Pawel Jakub Dawidek <pjd@freebsd.org>
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 26 Feb 2012 14:02:56 -0000

Il 25 febbraio 2012 22:03, Konstantin Belousov <kostikbel@gmail.com> ha scritto:
> On Sat, Feb 25, 2012 at 06:45:00PM +0100, Attilio Rao wrote:
>> Il 25 febbraio 2012 16:13, Pawel Jakub Dawidek <pjd@freebsd.org> ha scritto:
>> > On Sat, Feb 25, 2012 at 01:01:32PM +0000, Attilio Rao wrote:
>> >> Il 03 febbraio 2012 19:37, Konstantin Belousov <kostikbel@gmail.com> ha scritto:
>> >> > FreeBSD I/O infrastructure has well known issue with deadlock caused
>> >> > by vnode lock order reversal when buffers supplied to read(2) or
>> >> > write(2) syscalls are backed by mmaped file.
>> >> >
>> >> > I previously published the patches to convert i/o path to use VMIO,
>> >> > based on the Jeff Roberson proposal, see
>> >> > http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the
>> >> > deadlock. Since that work is very intrusive and did not got any
>> >> > follow-up, it get stalled.
>> >> >
>> >> > Below is very lightweight patch which only goal is to fix deadlock in
>> >> > the least intrusive way. This is possible after FreeBSD got the
>> >> > vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) KPIs.
>> >> > http://people.freebsd.org/~kib/misc/vm1.3.patch
>> >>
>> >> Hi,
>> >> I was reviewing:
>> >> http://people.freebsd.org/~kib/misc/vm1.11.patch
>> >>
>> >> and I think it is great. It is simple enough and I don't have further
>> >> comments on it.
> Thank you.
>
> This spoiled an announce I intended to send this weekend :)
>
>> >>
>> >> However, as a side note, I was thinking if we could get one day at the
>> >> point to integrate rangelocks into vnodes lockmgr directly.
>> >> It would be a huge patch, rewrtiting the locking of several members of
>> >> vnodes likely, but I think it would be worth it in terms of cleaness
>> >> of the interface and less overhead. Also, it would be interesting to
>> >> consider merging rangelock implementation in ZFS' one, at some point.
>> >
>> > I personal opinion about rangelocks and many other VFS features we
>> > currently have is that it is good idea in theory, but in practise it
>> > tends to overcomplicate VFS.
>> >
>> > I'm in opinion that we should move as much stuff as we can to individual
>> > file systems. We try to implement everything in VFS itself in hope that
>> > this will simplify file systems we have. It then turns out only one file
>> > system is really using this stuff (most of the time it is UFS) and this
>> > is PITA for all the other file systems as well as maintaining VFS. VFS
>> > became so complicated over the years that there are maybe few people
>> > that can understand it, and every single change to VFS is a huge risk of
>> > potentially breaking some unrelated parts.
>>
>> I think this is questionable due to the following assets:
>> - If the problem is filesystems writers having trouble in
>> understanding the necessary locking we should really provide cleaner
>> and more complete documentation. One would think the same with our VM
>> subsystem, but at least in that case there is plenty of comments that
>> help understanding how to deal with vm_object, vm_pages locking during
>> their lifelines.
>> - Our primitives may be more complicated than the
>> 'all-in-the-filesystem' one, but at least they offer a complete and
>> centralized view over the resources we have allocated in the whole
>> system and they allow building better policies about how to manage
>> them. One problem I see here, is that those policies are not fully
>> implemented, tuned or just got outdated, removing one of the highest
>> beneficial that we have by making vnodes so generic
>>
>> About the thing I mentioned myself:
>> - As long as the same path now has both range-locking and vnode
>> locking I don't see as a good idea to keep both separated forever.
>> Merging them seems to me an important evolution (not only helping
>> shrinking the number of primitives themselves but also introducing
>> less overhead and likely rewamped scalability for vnodes (but I think
>> this needs a deep investigation).
> The proper direction to move there is to designate the vnode lock for
> the vnode structure protection, and have the range lock protect the
> i/o atomicity. This is somewhat done in the proposed patch (since
> now vnode lock does not protect the i/o operation, but only chunked
> i/o transactions inside the operation).
>
> The Jeff idea of using page cache as the source of i/o data (implemented
> in the VM6 patchset) pushes the idea much further. E.g., the write
> does not obtain the write vnode lock typically (but sometimes it had,
> to extend the vnode).
>
> Probably, I will revive VM6 after this change is landed.

About that I guess we might be careful.
The first thing would be having a very scalable VM subsystem and
recent benchmarks have shown that this is not yet the case (Florian,
CC'ed, can share some pmc/LOCK_PROFILE analysis on pgsql that, also
with the vmcontention patch, shows a lot on contention on vm_object,
pmap lock and vm_page_queue_lock. We have some plans for every of
them, we will discuss on a separate thread if you prefer). This is
just to say, that we may need more work in underground areas to bring
VM6 to the point it will really make a difference.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein