From owner-freebsd-hackers@FreeBSD.ORG Tue Sep 9 08:39:02 2014 Return-Path: Delivered-To: hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 02538D2E; Tue, 9 Sep 2014 08:39:02 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 7E1112F2; Tue, 9 Sep 2014 08:39:01 +0000 (UTC) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.14.9/8.14.9) with ESMTP id s898cmTo074518 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 9 Sep 2014 11:38:48 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua s898cmTo074518 Received: (from kostik@localhost) by tom.home (8.14.9/8.14.9/Submit) id s898clCr074516; Tue, 9 Sep 2014 11:38:47 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Tue, 9 Sep 2014 11:38:47 +0300 From: Konstantin Belousov To: Alan Cox Subject: Re: mmap MAP_NOSYNC regression in 10.x Message-ID: <20140909083847.GB2737@kib.kiev.ua> References: <540903FF.6010602@degoeje.nl> <20140905080633.GM2737@kib.kiev.ua> <5409A4F8.6020204@degoeje.nl> <20140905123826.GP2737@kib.kiev.ua> <540BA416.7010106@rice.edu> <20140908084126.GX2737@kib.kiev.ua> <540DEE8F.5080005@rice.edu> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="6iAIcqJi9p/aaN8Y" Content-Disposition: inline In-Reply-To: <540DEE8F.5080005@rice.edu> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.0 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on tom.home Cc: alc@freebsd.org, Pieter de Goeje , hackers@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Sep 2014 08:39:02 -0000 --6iAIcqJi9p/aaN8Y Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Sep 08, 2014 at 12:59:43PM -0500, Alan Cox wrote: > On 09/08/2014 03:41, Konstantin Belousov wrote: > > On Sat, Sep 06, 2014 at 07:17:26PM -0500, Alan Cox wrote: > >> On 09/05/2014 07:38, Konstantin Belousov wrote: > >>> On Fri, Sep 05, 2014 at 01:56:40PM +0200, Pieter de Goeje wrote: > >>>> Thanks, works for me! > >>> I realized that the patch contains yet another bug. The oflags page > >>> flags update is protected by the exclusive vm object lock, which is o= nly > >>> held in shared mode on the fast path. Below is the fixed patch, where > >>> I take the page lock around setting VPO_NOSYNC (exclusive lock owners > >>> cannot race with fast path since we own shared lock, and two parallel > >>> fast path execution would be handled by the page lock). > >> > >> Suppose that the page is clean and two threads are executing this code > >> concurrently. One's map entry has MAP_NOSYNC set, and the other's > >> doesn't. Let's call these threads NOSYNC and SYNC, respectively. > >> > >> Suppose that the thread SYNC is slightly ahead. It has already > >> performed "m->oflags &=3D ~VPO_NOSYNC;" and now it's about to perform > >> "vm_page_dirty(fs.m);". However, just before the thread SYNC calls > >> vm_page_dirty(), the thread NOSYNC evaluates "m->dirty =3D=3D 0", whic= h is > >> still true, and it performs "m->oflags |=3D VPO_NOSYNC; " > >> > >> This can't happen on the slow path. That is, a fault by a thread > >> without MAP_NOSYNC set on its map entry will reliably clear VPO_NOSYNC. > > As I understand things, it is indeed not possible on the slow path, due > > to PG_RW only set from pmap_enter(), am I right ? I.e. this is another > > place where the rule 'no PG_RW without PG_M' is important. >=20 >=20 > Yes, it's not possible, but I'm a little confused by the rest of your > question, specifically, the statement "no PG_RW without PG_M". Did you > actually mean "no PG_M without PG_RW"? No, I mean what I did wrote, and I was wrong. >=20 >=20 > > Let me formulate my question another way: what are the guarantees we > > provide to the applications when the same page is mapped with and > > without MAP_NOSYNC simultaneously ? Is it contractually guaranteed that > > any write from !MAP_NOSYNC entry triggers write in the syncer activity > > period ? >=20 >=20 > Yes, that is the intent. However, I can think of at least one case > where the existing code doesn't work as intended. Suppose that the > first fault on a !MAP_NOSYNC entry is triggered by a read access. Then, > vm_fault() won't call vm_page_dirty(), but it will nonetheless install a > mapping in the pmap that allows write access. Now, suppose this same > process writes to the page. Finally, suppose that the second fault > happens on a MAP_NOSYNC entry. That fault will see a clean page, i.e., > m->dirty =3D=3D 0, and set VPO_NOSYNC on the page, even though the first > faulting process that wants the page sync'ed has dirtied the page. Yes, you are right. I convinced myself that this cannot happen, due to the false assumption above, and the fact that page flushing removes write bit from pte. >=20 >=20 >=20 > >> The best course of action may be to fall back to the slow path if you > >> actually need to change VPO_NOSYNC's state. Usually, you won't need t= o. > >> > > Let me first try to improve the original patch to handle > > MAP_ENTRY_NOSYNC on fast path as well. It seems to be one of the cases > > when the parallel faults are actually useful. >=20 >=20 > I think it may be time to take a step back, decide what semantics we > really want, and see if there is a better way of implementing those > semantics. The current approach based on toggling VPO_NOSYNC only > really works for the simplest cases. Still, I believe that the current form of the patch completely repeats the existing semantic of the slow path in the fast path. I may propose to avoid putting vm_page_dirty() in the scope of page lock for vm_fault_dirty(), mostly to simplify the logic, not to provide any useful optimization. >=20 >=20 > > One more note: the previous patch handled m->oflags inconsistency for > > setting VPO_NOSYNC operation, but missed the clear one line later. > > I think that increasing the page lock to cover also the vm_page_dirty() > > would fix the race you described, and the second manipulation with > > oflags. > > > > diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c > > index 30b0456..944b479 100644 > > --- a/sys/vm/vm_fault.c > > +++ b/sys/vm/vm_fault.c > > @@ -174,6 +174,70 @@ unlock_and_deallocate(struct faultstate *fs) > > } > > } > > =20 > > +static void > > +vm_fault_dirty(vm_map_entry_t entry, vm_page_t m, vm_prot_t prot, > > + vm_prot_t fault_type, int fault_flags, boolean_t set_wd) > > +{ > > + boolean_t need_dirty; > > + > > + if (((prot & VM_PROT_WRITE) =3D=3D 0 && > > + (fault_flags & VM_FAULT_DIRTY) =3D=3D 0) || > > + (m->oflags & VPO_UNMANAGED) !=3D 0) > > + return; > > + > > + VM_OBJECT_ASSERT_LOCKED(m->object); > > + > > + need_dirty =3D ((fault_type & VM_PROT_WRITE) !=3D 0 && > > + (fault_flags & VM_FAULT_CHANGE_WIRING) =3D=3D 0) || > > + (fault_flags & VM_FAULT_DIRTY) !=3D 0; > > + > > + if (set_wd) > > + vm_object_set_writeable_dirty(m->object); > > + else > > + /* > > + * If two callers of vm_fault_dirty() with set_wd =3D=3D > > + * FALSE, one for the map entry with MAP_ENTRY_NOSYNC > > + * flag set, other with flag clear, race, it is > > + * possible for the no-NOSYNC thread to see m->dirty > > + * !=3D 0 and not clear VPO_NOSYNC. Take vm_page lock > > + * around manipulation of VPO_NOSYNC and > > + * vm_page_dirty() call, to avoid the race and keep > > + * m->oflags consistent. > > + */ > > + vm_page_lock(m); > > + > > + /* > > + * If this is a NOSYNC mmap we do not want to set VPO_NOSYNC > > + * if the page is already dirty to prevent data written with > > + * the expectation of being synced from not being synced. > > + * Likewise if this entry does not request NOSYNC then make > > + * sure the page isn't marked NOSYNC. Applications sharing > > + * data should use the same flags to avoid ping ponging. > > + */ > > + if ((entry->eflags & MAP_ENTRY_NOSYNC) !=3D 0) { > > + if (m->dirty =3D=3D 0) { > > + m->oflags |=3D VPO_NOSYNC; > > + } > > + } else { > > + m->oflags &=3D ~VPO_NOSYNC; > > + } > > + > > + /* > > + * If the fault is a write, we know that this page is being > > + * written NOW so dirty it explicitly to save on > > + * pmap_is_modified() calls later. > > + * > > + * Also tell the backing pager, if any, that it should remove > > + * any swap backing since the page is now dirty. > > + */ > > + if (need_dirty) > > + vm_page_dirty(m); > > + if (!set_wd) > > + vm_page_unlock(m); > > + if (need_dirty) > > + vm_pager_page_unswapped(m); > > +} > > + > > /* > > * TRYPAGER - used by vm_fault to calculate whether the pager for the > > * current object *might* contain the page. > > @@ -321,11 +385,8 @@ RetryFault:; > > vm_page_hold(m); > > vm_page_unlock(m); > > } > > - if ((fault_type & VM_PROT_WRITE) !=3D 0 && > > - (m->oflags & VPO_UNMANAGED) =3D=3D 0) { > > - vm_page_dirty(m); > > - vm_pager_page_unswapped(m); > > - } > > + vm_fault_dirty(fs.entry, m, prot, fault_type, fault_flags, > > + FALSE); > > VM_OBJECT_RUNLOCK(fs.first_object); > > if (!wired) > > vm_fault_prefault(&fs, vaddr, 0, 0); > > @@ -898,42 +959,7 @@ vnode_locked: > > if (hardfault) > > fs.entry->next_read =3D fs.pindex + faultcount - reqpage; > > =20 > > - if (((prot & VM_PROT_WRITE) !=3D 0 || > > - (fault_flags & VM_FAULT_DIRTY) !=3D 0) && > > - (fs.m->oflags & VPO_UNMANAGED) =3D=3D 0) { > > - vm_object_set_writeable_dirty(fs.object); > > - > > - /* > > - * If this is a NOSYNC mmap we do not want to set VPO_NOSYNC > > - * if the page is already dirty to prevent data written with > > - * the expectation of being synced from not being synced. > > - * Likewise if this entry does not request NOSYNC then make > > - * sure the page isn't marked NOSYNC. Applications sharing > > - * data should use the same flags to avoid ping ponging. > > - */ > > - if (fs.entry->eflags & MAP_ENTRY_NOSYNC) { > > - if (fs.m->dirty =3D=3D 0) > > - fs.m->oflags |=3D VPO_NOSYNC; > > - } else { > > - fs.m->oflags &=3D ~VPO_NOSYNC; > > - } > > - > > - /* > > - * If the fault is a write, we know that this page is being > > - * written NOW so dirty it explicitly to save on=20 > > - * pmap_is_modified() calls later. > > - * > > - * Also tell the backing pager, if any, that it should remove > > - * any swap backing since the page is now dirty. > > - */ > > - if (((fault_type & VM_PROT_WRITE) !=3D 0 && > > - (fault_flags & VM_FAULT_CHANGE_WIRING) =3D=3D 0) || > > - (fault_flags & VM_FAULT_DIRTY) !=3D 0) { > > - vm_page_dirty(fs.m); > > - vm_pager_page_unswapped(fs.m); > > - } > > - } > > - > > + vm_fault_dirty(fs.entry, fs.m, prot, fault_type, fault_flags, TRUE); > > vm_page_assert_xbusied(fs.m); > > =20 > > /* > > diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h > > index f12b76c..a45648d 100644 > > --- a/sys/vm/vm_page.h > > +++ b/sys/vm/vm_page.h > > @@ -147,7 +147,7 @@ struct vm_page { > > uint16_t hold_count; /* page hold count (P) */ > > uint16_t flags; /* page PG_* flags (P) */ > > uint8_t aflags; /* access is atomic */ > > - uint8_t oflags; /* page VPO_* flags (O) */ > > + uint8_t oflags; /* page VPO_* flags (OM) */ > > uint8_t queue; /* page queue index (P,Q) */ > > int8_t psind; /* pagesizes[] index (O) */ > > int8_t segind; > > @@ -163,8 +163,9 @@ struct vm_page { > > /* > > * Page flags stored in oflags: > > * > > - * Access to these page flags is synchronized by the lock on the object > > - * containing the page (O). > > + * Access to these page flags is synchronized by the exclusive lock on > > + * the object containing the page, or combination of shared object > > + * lock and the page lock (OM). > > * > > * Note: VPO_UNMANAGED (used by OBJT_DEVICE, OBJT_PHYS and OBJT_SG) > > * indicates that the page is not under PV management but --6iAIcqJi9p/aaN8Y Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBAgAGBQJUDryXAAoJEJDCuSvBvK1BKWMQAIu9EE++i0GUBWHz0jiTx3iz rrD4XdwOvNJq4kicpIpvv4d5Q8uzEAOECIYXXVxoHkcyXrtZImFikWUCXAHsIRSU p7ZkxW6NBrWYtsbBuJSQvYINyC6fn+iQYvk10O0CyXBchFHc4um2QB0FqkXui4xF vXNiVFWSe7RBChInGeJZ8H8t2iq73TUqlHuIoRSm/4Fu9PRgSHDH6GA6gY707nLq D+Cb37N9l9OUaV0SVdA+OqiuBi9jGq7/lSPIvQz3HCUmQrWmxP4pPIJOfNTYMQJs jpSHfyaWLcwcGnX1oycVWoPOreuWXT/cfq3jrbEZ4xM3iNrFv09G+o6EQQZIsTzO WoBxtwWygyI73TC4K9i91McLxi+DsUyRM4N6A5qlQ3dQLbFmiGyFtcl7HpNZvdga AqRilih9CXJFi5StuL7ySsG17+1QnJqkLSAbkBcAaHL8K3+aMUBK02oK4UW7X7gI 3q5CBclctvfSogX0DQAxpUzaKZkK0jJgqhpW/tMGA5q1wmJAITwXpw2ckqD7btHj i2L8mtKeGReqdjop4GikXclL7sMCu/HMmpWDNJb/v6nNXM+imX7lCn4MGLAkkZwD lp8t+mVwIOvF0WaJpsE8DoMlV5AfS6WZ1so5S/byldl7kvDuyLeTNAPiLHNWekaq 0A5CaSSr+yCYp0o/P+yL =NVhN -----END PGP SIGNATURE----- --6iAIcqJi9p/aaN8Y--