From owner-freebsd-hackers@FreeBSD.ORG  Wed Sep 10 06:10:28 2014
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 35880E2E;
 Wed, 10 Sep 2014 06:10:28 +0000 (UTC)
Received: from pp1.rice.edu (proofpoint1.mail.rice.edu [128.42.201.100])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id ECCE7C0E;
 Wed, 10 Sep 2014 06:10:27 +0000 (UTC)
Received: from pps.filterd (pp1.rice.edu [127.0.0.1])
 by pp1.rice.edu (8.14.5/8.14.5) with SMTP id s8A688mn022353;
 Wed, 10 Sep 2014 01:10:12 -0500
Received: from mh3.mail.rice.edu (mh3.mail.rice.edu [128.42.199.10])
 by pp1.rice.edu with ESMTP id 1p8w07hm32-1;
 Wed, 10 Sep 2014 01:10:12 -0500
X-Virus-Scanned: by amavis-2.7.0 at mh3.mail.rice.edu, auth channel
Received: from 108-254-203-201.lightspeed.hstntx.sbcglobal.net
 (108-254-203-201.lightspeed.hstntx.sbcglobal.net [108.254.203.201])
 (using TLSv1 with cipher RC4-MD5 (128/128 bits))
 (No client certificate requested) (Authenticated sender: alc)
 by mh3.mail.rice.edu (Postfix) with ESMTPSA id A5BDF403E5;
 Wed, 10 Sep 2014 01:10:11 -0500 (CDT)
Message-ID: <540FEB43.5030704@rice.edu>
Date: Wed, 10 Sep 2014 01:10:11 -0500
From: Alan Cox <alc@rice.edu>
User-Agent: Mozilla/5.0 (X11; FreeBSD i386;
 rv:24.0) Gecko/20100101 Thunderbird/24.2.0
MIME-Version: 1.0
To: Konstantin Belousov <kostikbel@gmail.com>
Subject: Re: mmap MAP_NOSYNC regression in 10.x
References: <540903FF.6010602@degoeje.nl>
 <CAJUyCcNiLwLuL9crpQBjSdg4ED5kR53fPjyJG3HNmP5Roor8RQ@mail.gmail.com>
 <20140905080633.GM2737@kib.kiev.ua> <5409A4F8.6020204@degoeje.nl>
 <20140905123826.GP2737@kib.kiev.ua> <540BA416.7010106@rice.edu>
 <20140908084126.GX2737@kib.kiev.ua> <540DEE8F.5080005@rice.edu>
 <20140909083847.GB2737@kib.kiev.ua>
In-Reply-To: <20140909083847.GB2737@kib.kiev.ua>
X-Enigmail-Version: 1.6
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0
 kscore.is_bulkscore=0
 kscore.compositescore=0 circleOfTrustscore=0
 compositescore=0.248919945447816 urlsuspect_oldscore=0.0298999927260837
 suspectscore=3 recipient_domain_to_sender_totalscore=0 phishscore=0
 bulkscore=0 kscore.is_spamscore=0 recipient_to_sender_totalscore=0
 recipient_domain_to_sender_domain_totalscore=498
 rbsscore=0.248919945447816 spamscore=0
 recipient_to_sender_domain_totalscore=0 urlsuspectscore=0.9 adultscore=0
 classifier=spam adjust=0 reason=mlx scancount=1 engine=7.0.1-1402240000
 definitions=main-1409100071
Cc: alc@freebsd.org, Pieter de Goeje <pieter@degoeje.nl>, hackers@freebsd.org
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 10 Sep 2014 06:10:28 -0000

On 09/09/2014 03:38, Konstantin Belousov wrote:
> On Mon, Sep 08, 2014 at 12:59:43PM -0500, Alan Cox wrote:
>> On 09/08/2014 03:41, Konstantin Belousov wrote:
>>> On Sat, Sep 06, 2014 at 07:17:26PM -0500, Alan Cox wrote:
>>>> On 09/05/2014 07:38, Konstantin Belousov wrote:
>>>>> On Fri, Sep 05, 2014 at 01:56:40PM +0200, Pieter de Goeje wrote:
>>>>>> Thanks, works for me!
>>>>> I realized that the patch contains yet another bug. The oflags page
>>>>> flags update is protected by the exclusive vm object lock, which is only
>>>>> held in shared mode on the fast path. Below is the fixed patch, where
>>>>> I take the page lock around setting VPO_NOSYNC (exclusive lock owners
>>>>> cannot race with fast path since we own shared lock, and two parallel
>>>>> fast path execution would be handled by the page lock).
>>>> Suppose that the page is clean and two threads are executing this code
>>>> concurrently.  One's map entry has MAP_NOSYNC set, and the other's
>>>> doesn't.  Let's call these threads NOSYNC and SYNC, respectively.
>>>>
>>>> Suppose that the thread SYNC is slightly ahead.  It has already
>>>> performed "m->oflags &= ~VPO_NOSYNC;" and now it's about to perform
>>>> "vm_page_dirty(fs.m);".  However, just before the thread SYNC calls
>>>> vm_page_dirty(), the thread NOSYNC evaluates "m->dirty == 0", which is
>>>> still true, and it performs "m->oflags |= VPO_NOSYNC; "
>>>>
>>>> This can't happen on the slow path.  That is, a fault by a thread
>>>> without MAP_NOSYNC set on its map entry will reliably clear VPO_NOSYNC.
>>> As I understand things, it is indeed not possible on the slow path, due
>>> to PG_RW only set from pmap_enter(), am I right ?  I.e. this is another
>>> place where the rule 'no PG_RW without PG_M' is important.
>>
>> Yes, it's not possible, but I'm a little confused by the rest of your
>> question, specifically, the statement "no PG_RW without PG_M".  Did you
>> actually mean "no PG_M without PG_RW"?
> No, I mean what I did wrote, and I was wrong.
>
>>
>>> Let me formulate my question another way:  what are the guarantees we
>>> provide to the applications when the same page is mapped with and
>>> without MAP_NOSYNC simultaneously ?  Is it contractually guaranteed that
>>> any write from !MAP_NOSYNC entry triggers write in the syncer activity
>>> period ?
>>
>> Yes, that is the intent.  However, I can think of at least one case
>> where the existing code doesn't work as intended.  Suppose that the
>> first fault on a !MAP_NOSYNC entry is triggered by a read access.  Then,
>> vm_fault() won't call vm_page_dirty(), but it will nonetheless install a
>> mapping in the pmap that allows write access.  Now, suppose this same
>> process writes to the page.  Finally, suppose that the second fault
>> happens on a MAP_NOSYNC entry.  That fault will see a clean page, i.e.,
>> m->dirty == 0, and set VPO_NOSYNC on the page, even though the first
>> faulting process that wants the page sync'ed has dirtied the page.
> Yes, you are right.  I convinced myself that this cannot happen, due to
> the false assumption above, and the fact that page flushing removes
> write bit from pte.


There is another case that I've wondered about and not had the time to
check out.  What happens when a page is mapped by an entry with
MAP_NOSYNC and a write(2) is performed?  Is the page being flushed or not?


>>
>>
>>>> The best course of action may be to fall back to the slow path if you
>>>> actually need to change VPO_NOSYNC's state.  Usually, you won't need to.
>>>>
>>> Let me first try to improve the original patch to handle
>>> MAP_ENTRY_NOSYNC on fast path as well. It seems to be one of the cases
>>> when the parallel faults are actually useful.
>>
>> I think it may be time to take a step back, decide what semantics we
>> really want, and see if there is a better way of implementing those
>> semantics.  The current approach based on toggling VPO_NOSYNC only
>> really works for the simplest cases.
> Still, I believe that the current form of the patch completely repeats
> the existing semantic of the slow path in the fast path.
>
> I may propose to avoid putting vm_page_dirty() in the scope of page
> lock for vm_fault_dirty(), mostly to simplify the logic, not to
> provide any useful optimization.


I don't think that you can do that.  It leads to the race that I
described in my first email.


>>
>>> One more note: the previous patch handled m->oflags inconsistency for
>>> setting VPO_NOSYNC operation, but missed the clear one line later.
>>> I think that increasing the page lock to cover also the vm_page_dirty()
>>> would fix the race you described, and the second manipulation with
>>> oflags.
>>>
>>> diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c
>>> index 30b0456..944b479 100644
>>> --- a/sys/vm/vm_fault.c
>>> +++ b/sys/vm/vm_fault.c
>>> @@ -174,6 +174,70 @@ unlock_and_deallocate(struct faultstate *fs)
>>>  	}
>>>  }
>>>  
>>> +static void
>>> +vm_fault_dirty(vm_map_entry_t entry, vm_page_t m, vm_prot_t prot,
>>> +    vm_prot_t fault_type, int fault_flags, boolean_t set_wd)
>>> +{
>>> +	boolean_t need_dirty;
>>> +
>>> +	if (((prot & VM_PROT_WRITE) == 0 &&
>>> +	    (fault_flags & VM_FAULT_DIRTY) == 0) ||
>>> +	    (m->oflags & VPO_UNMANAGED) != 0)
>>> +		return;
>>> +
>>> +	VM_OBJECT_ASSERT_LOCKED(m->object);
>>> +
>>> +	need_dirty = ((fault_type & VM_PROT_WRITE) != 0 &&
>>> +	    (fault_flags & VM_FAULT_CHANGE_WIRING) == 0) ||
>>> +	    (fault_flags & VM_FAULT_DIRTY) != 0;
>>> +
>>> +	if (set_wd)
>>> +		vm_object_set_writeable_dirty(m->object);
>>> +	else
>>> +		/*
>>> +		 * If two callers of vm_fault_dirty() with set_wd ==
>>> +		 * FALSE, one for the map entry with MAP_ENTRY_NOSYNC
>>> +		 * flag set, other with flag clear, race, it is
>>> +		 * possible for the no-NOSYNC thread to see m->dirty
>>> +		 * != 0 and not clear VPO_NOSYNC.  Take vm_page lock
>>> +		 * around manipulation of VPO_NOSYNC and
>>> +		 * vm_page_dirty() call, to avoid the race and keep
>>> +		 * m->oflags consistent.
>>> +		 */
>>> +		vm_page_lock(m);
>>> +
>>> +	/*
>>> +	 * If this is a NOSYNC mmap we do not want to set VPO_NOSYNC
>>> +	 * if the page is already dirty to prevent data written with
>>> +	 * the expectation of being synced from not being synced.
>>> +	 * Likewise if this entry does not request NOSYNC then make
>>> +	 * sure the page isn't marked NOSYNC.  Applications sharing
>>> +	 * data should use the same flags to avoid ping ponging.
>>> +	 */
>>> +	if ((entry->eflags & MAP_ENTRY_NOSYNC) != 0) {
>>> +		if (m->dirty == 0) {
>>> +			m->oflags |= VPO_NOSYNC;
>>> +		}
>>> +	} else {
>>> +		m->oflags &= ~VPO_NOSYNC;
>>> +	}
>>> +
>>> +	/*
>>> +	 * If the fault is a write, we know that this page is being
>>> +	 * written NOW so dirty it explicitly to save on
>>> +	 * pmap_is_modified() calls later.
>>> +	 *
>>> +	 * Also tell the backing pager, if any, that it should remove
>>> +	 * any swap backing since the page is now dirty.
>>> +	 */
>>> +	if (need_dirty)
>>> +		vm_page_dirty(m);
>>> +	if (!set_wd)
>>> +		vm_page_unlock(m);
>>> +	if (need_dirty)
>>> +		vm_pager_page_unswapped(m);
>>> +}
>>> +
>>>  /*
>>>   * TRYPAGER - used by vm_fault to calculate whether the pager for the
>>>   *	      current object *might* contain the page.
>>> @@ -321,11 +385,8 @@ RetryFault:;
>>>  			vm_page_hold(m);
>>>  			vm_page_unlock(m);
>>>  		}
>>> -		if ((fault_type & VM_PROT_WRITE) != 0 &&
>>> -		    (m->oflags & VPO_UNMANAGED) == 0) {
>>> -			vm_page_dirty(m);
>>> -			vm_pager_page_unswapped(m);
>>> -		}
>>> +		vm_fault_dirty(fs.entry, m, prot, fault_type, fault_flags,
>>> +		    FALSE);
>>>  		VM_OBJECT_RUNLOCK(fs.first_object);
>>>  		if (!wired)
>>>  			vm_fault_prefault(&fs, vaddr, 0, 0);
>>> @@ -898,42 +959,7 @@ vnode_locked:
>>>  	if (hardfault)
>>>  		fs.entry->next_read = fs.pindex + faultcount - reqpage;
>>>  
>>> -	if (((prot & VM_PROT_WRITE) != 0 ||
>>> -	    (fault_flags & VM_FAULT_DIRTY) != 0) &&
>>> -	    (fs.m->oflags & VPO_UNMANAGED) == 0) {
>>> -		vm_object_set_writeable_dirty(fs.object);
>>> -
>>> -		/*
>>> -		 * If this is a NOSYNC mmap we do not want to set VPO_NOSYNC
>>> -		 * if the page is already dirty to prevent data written with
>>> -		 * the expectation of being synced from not being synced.
>>> -		 * Likewise if this entry does not request NOSYNC then make
>>> -		 * sure the page isn't marked NOSYNC.  Applications sharing
>>> -		 * data should use the same flags to avoid ping ponging.
>>> -		 */
>>> -		if (fs.entry->eflags & MAP_ENTRY_NOSYNC) {
>>> -			if (fs.m->dirty == 0)
>>> -				fs.m->oflags |= VPO_NOSYNC;
>>> -		} else {
>>> -			fs.m->oflags &= ~VPO_NOSYNC;
>>> -		}
>>> -
>>> -		/*
>>> -		 * If the fault is a write, we know that this page is being
>>> -		 * written NOW so dirty it explicitly to save on 
>>> -		 * pmap_is_modified() calls later.
>>> -		 *
>>> -		 * Also tell the backing pager, if any, that it should remove
>>> -		 * any swap backing since the page is now dirty.
>>> -		 */
>>> -		if (((fault_type & VM_PROT_WRITE) != 0 &&
>>> -		    (fault_flags & VM_FAULT_CHANGE_WIRING) == 0) ||
>>> -		    (fault_flags & VM_FAULT_DIRTY) != 0) {
>>> -			vm_page_dirty(fs.m);
>>> -			vm_pager_page_unswapped(fs.m);
>>> -		}
>>> -	}
>>> -
>>> +	vm_fault_dirty(fs.entry, fs.m, prot, fault_type, fault_flags, TRUE);
>>>  	vm_page_assert_xbusied(fs.m);
>>>  
>>>  	/*
>>> diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h
>>> index f12b76c..a45648d 100644
>>> --- a/sys/vm/vm_page.h
>>> +++ b/sys/vm/vm_page.h
>>> @@ -147,7 +147,7 @@ struct vm_page {
>>>  	uint16_t hold_count;		/* page hold count (P) */
>>>  	uint16_t flags;			/* page PG_* flags (P) */
>>>  	uint8_t aflags;			/* access is atomic */
>>> -	uint8_t oflags;			/* page VPO_* flags (O) */
>>> +	uint8_t oflags;			/* page VPO_* flags (OM) */
>>>  	uint8_t	queue;			/* page queue index (P,Q) */
>>>  	int8_t psind;			/* pagesizes[] index (O) */
>>>  	int8_t segind;
>>> @@ -163,8 +163,9 @@ struct vm_page {
>>>  /*
>>>   * Page flags stored in oflags:
>>>   *
>>> - * Access to these page flags is synchronized by the lock on the object
>>> - * containing the page (O).
>>> + * Access to these page flags is synchronized by the exclusive lock on
>>> + * the object containing the page, or combination of shared object
>>> + * lock and the page lock (OM).
>>>   *
>>>   * Note: VPO_UNMANAGED (used by OBJT_DEVICE, OBJT_PHYS and OBJT_SG)
>>>   * 	 indicates that the page is not under PV management but