Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 9 Sep 2014 11:38:47 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Alan Cox <alc@rice.edu>
Cc:        alc@freebsd.org, Pieter de Goeje <pieter@degoeje.nl>, hackers@freebsd.org
Subject:   Re: mmap MAP_NOSYNC regression in 10.x
Message-ID:  <20140909083847.GB2737@kib.kiev.ua>
In-Reply-To: <540DEE8F.5080005@rice.edu>
References:  <540903FF.6010602@degoeje.nl> <CAJUyCcNiLwLuL9crpQBjSdg4ED5kR53fPjyJG3HNmP5Roor8RQ@mail.gmail.com> <20140905080633.GM2737@kib.kiev.ua> <5409A4F8.6020204@degoeje.nl> <20140905123826.GP2737@kib.kiev.ua> <540BA416.7010106@rice.edu> <20140908084126.GX2737@kib.kiev.ua> <540DEE8F.5080005@rice.edu>

next in thread | previous in thread | raw e-mail | index | archive | help

--6iAIcqJi9p/aaN8Y
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Mon, Sep 08, 2014 at 12:59:43PM -0500, Alan Cox wrote:
> On 09/08/2014 03:41, Konstantin Belousov wrote:
> > On Sat, Sep 06, 2014 at 07:17:26PM -0500, Alan Cox wrote:
> >> On 09/05/2014 07:38, Konstantin Belousov wrote:
> >>> On Fri, Sep 05, 2014 at 01:56:40PM +0200, Pieter de Goeje wrote:
> >>>> Thanks, works for me!
> >>> I realized that the patch contains yet another bug. The oflags page
> >>> flags update is protected by the exclusive vm object lock, which is o=
nly
> >>> held in shared mode on the fast path. Below is the fixed patch, where
> >>> I take the page lock around setting VPO_NOSYNC (exclusive lock owners
> >>> cannot race with fast path since we own shared lock, and two parallel
> >>> fast path execution would be handled by the page lock).
> >>
> >> Suppose that the page is clean and two threads are executing this code
> >> concurrently.  One's map entry has MAP_NOSYNC set, and the other's
> >> doesn't.  Let's call these threads NOSYNC and SYNC, respectively.
> >>
> >> Suppose that the thread SYNC is slightly ahead.  It has already
> >> performed "m->oflags &=3D ~VPO_NOSYNC;" and now it's about to perform
> >> "vm_page_dirty(fs.m);".  However, just before the thread SYNC calls
> >> vm_page_dirty(), the thread NOSYNC evaluates "m->dirty =3D=3D 0", whic=
h is
> >> still true, and it performs "m->oflags |=3D VPO_NOSYNC; "
> >>
> >> This can't happen on the slow path.  That is, a fault by a thread
> >> without MAP_NOSYNC set on its map entry will reliably clear VPO_NOSYNC.
> > As I understand things, it is indeed not possible on the slow path, due
> > to PG_RW only set from pmap_enter(), am I right ?  I.e. this is another
> > place where the rule 'no PG_RW without PG_M' is important.
>=20
>=20
> Yes, it's not possible, but I'm a little confused by the rest of your
> question, specifically, the statement "no PG_RW without PG_M".  Did you
> actually mean "no PG_M without PG_RW"?
No, I mean what I did wrote, and I was wrong.

>=20
>=20
> > Let me formulate my question another way:  what are the guarantees we
> > provide to the applications when the same page is mapped with and
> > without MAP_NOSYNC simultaneously ?  Is it contractually guaranteed that
> > any write from !MAP_NOSYNC entry triggers write in the syncer activity
> > period ?
>=20
>=20
> Yes, that is the intent.  However, I can think of at least one case
> where the existing code doesn't work as intended.  Suppose that the
> first fault on a !MAP_NOSYNC entry is triggered by a read access.  Then,
> vm_fault() won't call vm_page_dirty(), but it will nonetheless install a
> mapping in the pmap that allows write access.  Now, suppose this same
> process writes to the page.  Finally, suppose that the second fault
> happens on a MAP_NOSYNC entry.  That fault will see a clean page, i.e.,
> m->dirty =3D=3D 0, and set VPO_NOSYNC on the page, even though the first
> faulting process that wants the page sync'ed has dirtied the page.
Yes, you are right.  I convinced myself that this cannot happen, due to
the false assumption above, and the fact that page flushing removes
write bit from pte.

>=20
>=20
>=20
> >> The best course of action may be to fall back to the slow path if you
> >> actually need to change VPO_NOSYNC's state.  Usually, you won't need t=
o.
> >>
> > Let me first try to improve the original patch to handle
> > MAP_ENTRY_NOSYNC on fast path as well. It seems to be one of the cases
> > when the parallel faults are actually useful.
>=20
>=20
> I think it may be time to take a step back, decide what semantics we
> really want, and see if there is a better way of implementing those
> semantics.  The current approach based on toggling VPO_NOSYNC only
> really works for the simplest cases.

Still, I believe that the current form of the patch completely repeats
the existing semantic of the slow path in the fast path.

I may propose to avoid putting vm_page_dirty() in the scope of page
lock for vm_fault_dirty(), mostly to simplify the logic, not to
provide any useful optimization.

>=20
>=20
> > One more note: the previous patch handled m->oflags inconsistency for
> > setting VPO_NOSYNC operation, but missed the clear one line later.
> > I think that increasing the page lock to cover also the vm_page_dirty()
> > would fix the race you described, and the second manipulation with
> > oflags.
> >
> > diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c
> > index 30b0456..944b479 100644
> > --- a/sys/vm/vm_fault.c
> > +++ b/sys/vm/vm_fault.c
> > @@ -174,6 +174,70 @@ unlock_and_deallocate(struct faultstate *fs)
> >  	}
> >  }
> > =20
> > +static void
> > +vm_fault_dirty(vm_map_entry_t entry, vm_page_t m, vm_prot_t prot,
> > +    vm_prot_t fault_type, int fault_flags, boolean_t set_wd)
> > +{
> > +	boolean_t need_dirty;
> > +
> > +	if (((prot & VM_PROT_WRITE) =3D=3D 0 &&
> > +	    (fault_flags & VM_FAULT_DIRTY) =3D=3D 0) ||
> > +	    (m->oflags & VPO_UNMANAGED) !=3D 0)
> > +		return;
> > +
> > +	VM_OBJECT_ASSERT_LOCKED(m->object);
> > +
> > +	need_dirty =3D ((fault_type & VM_PROT_WRITE) !=3D 0 &&
> > +	    (fault_flags & VM_FAULT_CHANGE_WIRING) =3D=3D 0) ||
> > +	    (fault_flags & VM_FAULT_DIRTY) !=3D 0;
> > +
> > +	if (set_wd)
> > +		vm_object_set_writeable_dirty(m->object);
> > +	else
> > +		/*
> > +		 * If two callers of vm_fault_dirty() with set_wd =3D=3D
> > +		 * FALSE, one for the map entry with MAP_ENTRY_NOSYNC
> > +		 * flag set, other with flag clear, race, it is
> > +		 * possible for the no-NOSYNC thread to see m->dirty
> > +		 * !=3D 0 and not clear VPO_NOSYNC.  Take vm_page lock
> > +		 * around manipulation of VPO_NOSYNC and
> > +		 * vm_page_dirty() call, to avoid the race and keep
> > +		 * m->oflags consistent.
> > +		 */
> > +		vm_page_lock(m);
> > +
> > +	/*
> > +	 * If this is a NOSYNC mmap we do not want to set VPO_NOSYNC
> > +	 * if the page is already dirty to prevent data written with
> > +	 * the expectation of being synced from not being synced.
> > +	 * Likewise if this entry does not request NOSYNC then make
> > +	 * sure the page isn't marked NOSYNC.  Applications sharing
> > +	 * data should use the same flags to avoid ping ponging.
> > +	 */
> > +	if ((entry->eflags & MAP_ENTRY_NOSYNC) !=3D 0) {
> > +		if (m->dirty =3D=3D 0) {
> > +			m->oflags |=3D VPO_NOSYNC;
> > +		}
> > +	} else {
> > +		m->oflags &=3D ~VPO_NOSYNC;
> > +	}
> > +
> > +	/*
> > +	 * If the fault is a write, we know that this page is being
> > +	 * written NOW so dirty it explicitly to save on
> > +	 * pmap_is_modified() calls later.
> > +	 *
> > +	 * Also tell the backing pager, if any, that it should remove
> > +	 * any swap backing since the page is now dirty.
> > +	 */
> > +	if (need_dirty)
> > +		vm_page_dirty(m);
> > +	if (!set_wd)
> > +		vm_page_unlock(m);
> > +	if (need_dirty)
> > +		vm_pager_page_unswapped(m);
> > +}
> > +
> >  /*
> >   * TRYPAGER - used by vm_fault to calculate whether the pager for the
> >   *	      current object *might* contain the page.
> > @@ -321,11 +385,8 @@ RetryFault:;
> >  			vm_page_hold(m);
> >  			vm_page_unlock(m);
> >  		}
> > -		if ((fault_type & VM_PROT_WRITE) !=3D 0 &&
> > -		    (m->oflags & VPO_UNMANAGED) =3D=3D 0) {
> > -			vm_page_dirty(m);
> > -			vm_pager_page_unswapped(m);
> > -		}
> > +		vm_fault_dirty(fs.entry, m, prot, fault_type, fault_flags,
> > +		    FALSE);
> >  		VM_OBJECT_RUNLOCK(fs.first_object);
> >  		if (!wired)
> >  			vm_fault_prefault(&fs, vaddr, 0, 0);
> > @@ -898,42 +959,7 @@ vnode_locked:
> >  	if (hardfault)
> >  		fs.entry->next_read =3D fs.pindex + faultcount - reqpage;
> > =20
> > -	if (((prot & VM_PROT_WRITE) !=3D 0 ||
> > -	    (fault_flags & VM_FAULT_DIRTY) !=3D 0) &&
> > -	    (fs.m->oflags & VPO_UNMANAGED) =3D=3D 0) {
> > -		vm_object_set_writeable_dirty(fs.object);
> > -
> > -		/*
> > -		 * If this is a NOSYNC mmap we do not want to set VPO_NOSYNC
> > -		 * if the page is already dirty to prevent data written with
> > -		 * the expectation of being synced from not being synced.
> > -		 * Likewise if this entry does not request NOSYNC then make
> > -		 * sure the page isn't marked NOSYNC.  Applications sharing
> > -		 * data should use the same flags to avoid ping ponging.
> > -		 */
> > -		if (fs.entry->eflags & MAP_ENTRY_NOSYNC) {
> > -			if (fs.m->dirty =3D=3D 0)
> > -				fs.m->oflags |=3D VPO_NOSYNC;
> > -		} else {
> > -			fs.m->oflags &=3D ~VPO_NOSYNC;
> > -		}
> > -
> > -		/*
> > -		 * If the fault is a write, we know that this page is being
> > -		 * written NOW so dirty it explicitly to save on=20
> > -		 * pmap_is_modified() calls later.
> > -		 *
> > -		 * Also tell the backing pager, if any, that it should remove
> > -		 * any swap backing since the page is now dirty.
> > -		 */
> > -		if (((fault_type & VM_PROT_WRITE) !=3D 0 &&
> > -		    (fault_flags & VM_FAULT_CHANGE_WIRING) =3D=3D 0) ||
> > -		    (fault_flags & VM_FAULT_DIRTY) !=3D 0) {
> > -			vm_page_dirty(fs.m);
> > -			vm_pager_page_unswapped(fs.m);
> > -		}
> > -	}
> > -
> > +	vm_fault_dirty(fs.entry, fs.m, prot, fault_type, fault_flags, TRUE);
> >  	vm_page_assert_xbusied(fs.m);
> > =20
> >  	/*
> > diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h
> > index f12b76c..a45648d 100644
> > --- a/sys/vm/vm_page.h
> > +++ b/sys/vm/vm_page.h
> > @@ -147,7 +147,7 @@ struct vm_page {
> >  	uint16_t hold_count;		/* page hold count (P) */
> >  	uint16_t flags;			/* page PG_* flags (P) */
> >  	uint8_t aflags;			/* access is atomic */
> > -	uint8_t oflags;			/* page VPO_* flags (O) */
> > +	uint8_t oflags;			/* page VPO_* flags (OM) */
> >  	uint8_t	queue;			/* page queue index (P,Q) */
> >  	int8_t psind;			/* pagesizes[] index (O) */
> >  	int8_t segind;
> > @@ -163,8 +163,9 @@ struct vm_page {
> >  /*
> >   * Page flags stored in oflags:
> >   *
> > - * Access to these page flags is synchronized by the lock on the object
> > - * containing the page (O).
> > + * Access to these page flags is synchronized by the exclusive lock on
> > + * the object containing the page, or combination of shared object
> > + * lock and the page lock (OM).
> >   *
> >   * Note: VPO_UNMANAGED (used by OBJT_DEVICE, OBJT_PHYS and OBJT_SG)
> >   * 	 indicates that the page is not under PV management but

--6iAIcqJi9p/aaN8Y
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBAgAGBQJUDryXAAoJEJDCuSvBvK1BKWMQAIu9EE++i0GUBWHz0jiTx3iz
rrD4XdwOvNJq4kicpIpvv4d5Q8uzEAOECIYXXVxoHkcyXrtZImFikWUCXAHsIRSU
p7ZkxW6NBrWYtsbBuJSQvYINyC6fn+iQYvk10O0CyXBchFHc4um2QB0FqkXui4xF
vXNiVFWSe7RBChInGeJZ8H8t2iq73TUqlHuIoRSm/4Fu9PRgSHDH6GA6gY707nLq
D+Cb37N9l9OUaV0SVdA+OqiuBi9jGq7/lSPIvQz3HCUmQrWmxP4pPIJOfNTYMQJs
jpSHfyaWLcwcGnX1oycVWoPOreuWXT/cfq3jrbEZ4xM3iNrFv09G+o6EQQZIsTzO
WoBxtwWygyI73TC4K9i91McLxi+DsUyRM4N6A5qlQ3dQLbFmiGyFtcl7HpNZvdga
AqRilih9CXJFi5StuL7ySsG17+1QnJqkLSAbkBcAaHL8K3+aMUBK02oK4UW7X7gI
3q5CBclctvfSogX0DQAxpUzaKZkK0jJgqhpW/tMGA5q1wmJAITwXpw2ckqD7btHj
i2L8mtKeGReqdjop4GikXclL7sMCu/HMmpWDNJb/v6nNXM+imX7lCn4MGLAkkZwD
lp8t+mVwIOvF0WaJpsE8DoMlV5AfS6WZ1so5S/byldl7kvDuyLeTNAPiLHNWekaq
0A5CaSSr+yCYp0o/P+yL
=NVhN
-----END PGP SIGNATURE-----

--6iAIcqJi9p/aaN8Y--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140909083847.GB2737>