From owner-freebsd-hackers@FreeBSD.ORG Wed Sep 10 06:10:28 2014 Return-Path: Delivered-To: hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 35880E2E; Wed, 10 Sep 2014 06:10:28 +0000 (UTC) Received: from pp1.rice.edu (proofpoint1.mail.rice.edu [128.42.201.100]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id ECCE7C0E; Wed, 10 Sep 2014 06:10:27 +0000 (UTC) Received: from pps.filterd (pp1.rice.edu [127.0.0.1]) by pp1.rice.edu (8.14.5/8.14.5) with SMTP id s8A688mn022353; Wed, 10 Sep 2014 01:10:12 -0500 Received: from mh3.mail.rice.edu (mh3.mail.rice.edu [128.42.199.10]) by pp1.rice.edu with ESMTP id 1p8w07hm32-1; Wed, 10 Sep 2014 01:10:12 -0500 X-Virus-Scanned: by amavis-2.7.0 at mh3.mail.rice.edu, auth channel Received: from 108-254-203-201.lightspeed.hstntx.sbcglobal.net (108-254-203-201.lightspeed.hstntx.sbcglobal.net [108.254.203.201]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) (Authenticated sender: alc) by mh3.mail.rice.edu (Postfix) with ESMTPSA id A5BDF403E5; Wed, 10 Sep 2014 01:10:11 -0500 (CDT) Message-ID: <540FEB43.5030704@rice.edu> Date: Wed, 10 Sep 2014 01:10:11 -0500 From: Alan Cox User-Agent: Mozilla/5.0 (X11; FreeBSD i386; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Konstantin Belousov Subject: Re: mmap MAP_NOSYNC regression in 10.x References: <540903FF.6010602@degoeje.nl> <20140905080633.GM2737@kib.kiev.ua> <5409A4F8.6020204@degoeje.nl> <20140905123826.GP2737@kib.kiev.ua> <540BA416.7010106@rice.edu> <20140908084126.GX2737@kib.kiev.ua> <540DEE8F.5080005@rice.edu> <20140909083847.GB2737@kib.kiev.ua> In-Reply-To: <20140909083847.GB2737@kib.kiev.ua> X-Enigmail-Version: 1.6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 kscore.is_bulkscore=0 kscore.compositescore=0 circleOfTrustscore=0 compositescore=0.248919945447816 urlsuspect_oldscore=0.0298999927260837 suspectscore=3 recipient_domain_to_sender_totalscore=0 phishscore=0 bulkscore=0 kscore.is_spamscore=0 recipient_to_sender_totalscore=0 recipient_domain_to_sender_domain_totalscore=498 rbsscore=0.248919945447816 spamscore=0 recipient_to_sender_domain_totalscore=0 urlsuspectscore=0.9 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=7.0.1-1402240000 definitions=main-1409100071 Cc: alc@freebsd.org, Pieter de Goeje , hackers@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Sep 2014 06:10:28 -0000 On 09/09/2014 03:38, Konstantin Belousov wrote: > On Mon, Sep 08, 2014 at 12:59:43PM -0500, Alan Cox wrote: >> On 09/08/2014 03:41, Konstantin Belousov wrote: >>> On Sat, Sep 06, 2014 at 07:17:26PM -0500, Alan Cox wrote: >>>> On 09/05/2014 07:38, Konstantin Belousov wrote: >>>>> On Fri, Sep 05, 2014 at 01:56:40PM +0200, Pieter de Goeje wrote: >>>>>> Thanks, works for me! >>>>> I realized that the patch contains yet another bug. The oflags page >>>>> flags update is protected by the exclusive vm object lock, which is only >>>>> held in shared mode on the fast path. Below is the fixed patch, where >>>>> I take the page lock around setting VPO_NOSYNC (exclusive lock owners >>>>> cannot race with fast path since we own shared lock, and two parallel >>>>> fast path execution would be handled by the page lock). >>>> Suppose that the page is clean and two threads are executing this code >>>> concurrently. One's map entry has MAP_NOSYNC set, and the other's >>>> doesn't. Let's call these threads NOSYNC and SYNC, respectively. >>>> >>>> Suppose that the thread SYNC is slightly ahead. It has already >>>> performed "m->oflags &= ~VPO_NOSYNC;" and now it's about to perform >>>> "vm_page_dirty(fs.m);". However, just before the thread SYNC calls >>>> vm_page_dirty(), the thread NOSYNC evaluates "m->dirty == 0", which is >>>> still true, and it performs "m->oflags |= VPO_NOSYNC; " >>>> >>>> This can't happen on the slow path. That is, a fault by a thread >>>> without MAP_NOSYNC set on its map entry will reliably clear VPO_NOSYNC. >>> As I understand things, it is indeed not possible on the slow path, due >>> to PG_RW only set from pmap_enter(), am I right ? I.e. this is another >>> place where the rule 'no PG_RW without PG_M' is important. >> >> Yes, it's not possible, but I'm a little confused by the rest of your >> question, specifically, the statement "no PG_RW without PG_M". Did you >> actually mean "no PG_M without PG_RW"? > No, I mean what I did wrote, and I was wrong. > >> >>> Let me formulate my question another way: what are the guarantees we >>> provide to the applications when the same page is mapped with and >>> without MAP_NOSYNC simultaneously ? Is it contractually guaranteed that >>> any write from !MAP_NOSYNC entry triggers write in the syncer activity >>> period ? >> >> Yes, that is the intent. However, I can think of at least one case >> where the existing code doesn't work as intended. Suppose that the >> first fault on a !MAP_NOSYNC entry is triggered by a read access. Then, >> vm_fault() won't call vm_page_dirty(), but it will nonetheless install a >> mapping in the pmap that allows write access. Now, suppose this same >> process writes to the page. Finally, suppose that the second fault >> happens on a MAP_NOSYNC entry. That fault will see a clean page, i.e., >> m->dirty == 0, and set VPO_NOSYNC on the page, even though the first >> faulting process that wants the page sync'ed has dirtied the page. > Yes, you are right. I convinced myself that this cannot happen, due to > the false assumption above, and the fact that page flushing removes > write bit from pte. There is another case that I've wondered about and not had the time to check out. What happens when a page is mapped by an entry with MAP_NOSYNC and a write(2) is performed? Is the page being flushed or not? >> >> >>>> The best course of action may be to fall back to the slow path if you >>>> actually need to change VPO_NOSYNC's state. Usually, you won't need to. >>>> >>> Let me first try to improve the original patch to handle >>> MAP_ENTRY_NOSYNC on fast path as well. It seems to be one of the cases >>> when the parallel faults are actually useful. >> >> I think it may be time to take a step back, decide what semantics we >> really want, and see if there is a better way of implementing those >> semantics. The current approach based on toggling VPO_NOSYNC only >> really works for the simplest cases. > Still, I believe that the current form of the patch completely repeats > the existing semantic of the slow path in the fast path. > > I may propose to avoid putting vm_page_dirty() in the scope of page > lock for vm_fault_dirty(), mostly to simplify the logic, not to > provide any useful optimization. I don't think that you can do that. It leads to the race that I described in my first email. >> >>> One more note: the previous patch handled m->oflags inconsistency for >>> setting VPO_NOSYNC operation, but missed the clear one line later. >>> I think that increasing the page lock to cover also the vm_page_dirty() >>> would fix the race you described, and the second manipulation with >>> oflags. >>> >>> diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c >>> index 30b0456..944b479 100644 >>> --- a/sys/vm/vm_fault.c >>> +++ b/sys/vm/vm_fault.c >>> @@ -174,6 +174,70 @@ unlock_and_deallocate(struct faultstate *fs) >>> } >>> } >>> >>> +static void >>> +vm_fault_dirty(vm_map_entry_t entry, vm_page_t m, vm_prot_t prot, >>> + vm_prot_t fault_type, int fault_flags, boolean_t set_wd) >>> +{ >>> + boolean_t need_dirty; >>> + >>> + if (((prot & VM_PROT_WRITE) == 0 && >>> + (fault_flags & VM_FAULT_DIRTY) == 0) || >>> + (m->oflags & VPO_UNMANAGED) != 0) >>> + return; >>> + >>> + VM_OBJECT_ASSERT_LOCKED(m->object); >>> + >>> + need_dirty = ((fault_type & VM_PROT_WRITE) != 0 && >>> + (fault_flags & VM_FAULT_CHANGE_WIRING) == 0) || >>> + (fault_flags & VM_FAULT_DIRTY) != 0; >>> + >>> + if (set_wd) >>> + vm_object_set_writeable_dirty(m->object); >>> + else >>> + /* >>> + * If two callers of vm_fault_dirty() with set_wd == >>> + * FALSE, one for the map entry with MAP_ENTRY_NOSYNC >>> + * flag set, other with flag clear, race, it is >>> + * possible for the no-NOSYNC thread to see m->dirty >>> + * != 0 and not clear VPO_NOSYNC. Take vm_page lock >>> + * around manipulation of VPO_NOSYNC and >>> + * vm_page_dirty() call, to avoid the race and keep >>> + * m->oflags consistent. >>> + */ >>> + vm_page_lock(m); >>> + >>> + /* >>> + * If this is a NOSYNC mmap we do not want to set VPO_NOSYNC >>> + * if the page is already dirty to prevent data written with >>> + * the expectation of being synced from not being synced. >>> + * Likewise if this entry does not request NOSYNC then make >>> + * sure the page isn't marked NOSYNC. Applications sharing >>> + * data should use the same flags to avoid ping ponging. >>> + */ >>> + if ((entry->eflags & MAP_ENTRY_NOSYNC) != 0) { >>> + if (m->dirty == 0) { >>> + m->oflags |= VPO_NOSYNC; >>> + } >>> + } else { >>> + m->oflags &= ~VPO_NOSYNC; >>> + } >>> + >>> + /* >>> + * If the fault is a write, we know that this page is being >>> + * written NOW so dirty it explicitly to save on >>> + * pmap_is_modified() calls later. >>> + * >>> + * Also tell the backing pager, if any, that it should remove >>> + * any swap backing since the page is now dirty. >>> + */ >>> + if (need_dirty) >>> + vm_page_dirty(m); >>> + if (!set_wd) >>> + vm_page_unlock(m); >>> + if (need_dirty) >>> + vm_pager_page_unswapped(m); >>> +} >>> + >>> /* >>> * TRYPAGER - used by vm_fault to calculate whether the pager for the >>> * current object *might* contain the page. >>> @@ -321,11 +385,8 @@ RetryFault:; >>> vm_page_hold(m); >>> vm_page_unlock(m); >>> } >>> - if ((fault_type & VM_PROT_WRITE) != 0 && >>> - (m->oflags & VPO_UNMANAGED) == 0) { >>> - vm_page_dirty(m); >>> - vm_pager_page_unswapped(m); >>> - } >>> + vm_fault_dirty(fs.entry, m, prot, fault_type, fault_flags, >>> + FALSE); >>> VM_OBJECT_RUNLOCK(fs.first_object); >>> if (!wired) >>> vm_fault_prefault(&fs, vaddr, 0, 0); >>> @@ -898,42 +959,7 @@ vnode_locked: >>> if (hardfault) >>> fs.entry->next_read = fs.pindex + faultcount - reqpage; >>> >>> - if (((prot & VM_PROT_WRITE) != 0 || >>> - (fault_flags & VM_FAULT_DIRTY) != 0) && >>> - (fs.m->oflags & VPO_UNMANAGED) == 0) { >>> - vm_object_set_writeable_dirty(fs.object); >>> - >>> - /* >>> - * If this is a NOSYNC mmap we do not want to set VPO_NOSYNC >>> - * if the page is already dirty to prevent data written with >>> - * the expectation of being synced from not being synced. >>> - * Likewise if this entry does not request NOSYNC then make >>> - * sure the page isn't marked NOSYNC. Applications sharing >>> - * data should use the same flags to avoid ping ponging. >>> - */ >>> - if (fs.entry->eflags & MAP_ENTRY_NOSYNC) { >>> - if (fs.m->dirty == 0) >>> - fs.m->oflags |= VPO_NOSYNC; >>> - } else { >>> - fs.m->oflags &= ~VPO_NOSYNC; >>> - } >>> - >>> - /* >>> - * If the fault is a write, we know that this page is being >>> - * written NOW so dirty it explicitly to save on >>> - * pmap_is_modified() calls later. >>> - * >>> - * Also tell the backing pager, if any, that it should remove >>> - * any swap backing since the page is now dirty. >>> - */ >>> - if (((fault_type & VM_PROT_WRITE) != 0 && >>> - (fault_flags & VM_FAULT_CHANGE_WIRING) == 0) || >>> - (fault_flags & VM_FAULT_DIRTY) != 0) { >>> - vm_page_dirty(fs.m); >>> - vm_pager_page_unswapped(fs.m); >>> - } >>> - } >>> - >>> + vm_fault_dirty(fs.entry, fs.m, prot, fault_type, fault_flags, TRUE); >>> vm_page_assert_xbusied(fs.m); >>> >>> /* >>> diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h >>> index f12b76c..a45648d 100644 >>> --- a/sys/vm/vm_page.h >>> +++ b/sys/vm/vm_page.h >>> @@ -147,7 +147,7 @@ struct vm_page { >>> uint16_t hold_count; /* page hold count (P) */ >>> uint16_t flags; /* page PG_* flags (P) */ >>> uint8_t aflags; /* access is atomic */ >>> - uint8_t oflags; /* page VPO_* flags (O) */ >>> + uint8_t oflags; /* page VPO_* flags (OM) */ >>> uint8_t queue; /* page queue index (P,Q) */ >>> int8_t psind; /* pagesizes[] index (O) */ >>> int8_t segind; >>> @@ -163,8 +163,9 @@ struct vm_page { >>> /* >>> * Page flags stored in oflags: >>> * >>> - * Access to these page flags is synchronized by the lock on the object >>> - * containing the page (O). >>> + * Access to these page flags is synchronized by the exclusive lock on >>> + * the object containing the page, or combination of shared object >>> + * lock and the page lock (OM). >>> * >>> * Note: VPO_UNMANAGED (used by OBJT_DEVICE, OBJT_PHYS and OBJT_SG) >>> * indicates that the page is not under PV management but