Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 10 Sep 2014 12:47:05 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Alan Cox <alc@rice.edu>
Cc:        alc@freebsd.org, Pieter de Goeje <pieter@degoeje.nl>, hackers@freebsd.org
Subject:   Re: mmap MAP_NOSYNC regression in 10.x
Message-ID:  <20140910094705.GG2737@kib.kiev.ua>
In-Reply-To: <540FEB43.5030704@rice.edu>
References:  <540903FF.6010602@degoeje.nl> <CAJUyCcNiLwLuL9crpQBjSdg4ED5kR53fPjyJG3HNmP5Roor8RQ@mail.gmail.com> <20140905080633.GM2737@kib.kiev.ua> <5409A4F8.6020204@degoeje.nl> <20140905123826.GP2737@kib.kiev.ua> <540BA416.7010106@rice.edu> <20140908084126.GX2737@kib.kiev.ua> <540DEE8F.5080005@rice.edu> <20140909083847.GB2737@kib.kiev.ua> <540FEB43.5030704@rice.edu>

next in thread | previous in thread | raw e-mail | index | archive | help

--q21uyrmm0NnpxAH4
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Sep 10, 2014 at 01:10:11AM -0500, Alan Cox wrote:
> On 09/09/2014 03:38, Konstantin Belousov wrote:
> > On Mon, Sep 08, 2014 at 12:59:43PM -0500, Alan Cox wrote:
> >> On 09/08/2014 03:41, Konstantin Belousov wrote:
> >>> On Sat, Sep 06, 2014 at 07:17:26PM -0500, Alan Cox wrote:
> >>>> On 09/05/2014 07:38, Konstantin Belousov wrote:
> >>>>> On Fri, Sep 05, 2014 at 01:56:40PM +0200, Pieter de Goeje wrote:
> >>>>>> Thanks, works for me!
> >>>>> I realized that the patch contains yet another bug. The oflags page
> >>>>> flags update is protected by the exclusive vm object lock, which is=
 only
> >>>>> held in shared mode on the fast path. Below is the fixed patch, whe=
re
> >>>>> I take the page lock around setting VPO_NOSYNC (exclusive lock owne=
rs
> >>>>> cannot race with fast path since we own shared lock, and two parall=
el
> >>>>> fast path execution would be handled by the page lock).
> >>>> Suppose that the page is clean and two threads are executing this co=
de
> >>>> concurrently.  One's map entry has MAP_NOSYNC set, and the other's
> >>>> doesn't.  Let's call these threads NOSYNC and SYNC, respectively.
> >>>>
> >>>> Suppose that the thread SYNC is slightly ahead.  It has already
> >>>> performed "m->oflags &=3D ~VPO_NOSYNC;" and now it's about to perform
> >>>> "vm_page_dirty(fs.m);".  However, just before the thread SYNC calls
> >>>> vm_page_dirty(), the thread NOSYNC evaluates "m->dirty =3D=3D 0", wh=
ich is
> >>>> still true, and it performs "m->oflags |=3D VPO_NOSYNC; "
> >>>>
> >>>> This can't happen on the slow path.  That is, a fault by a thread
> >>>> without MAP_NOSYNC set on its map entry will reliably clear VPO_NOSY=
NC.
> >>> As I understand things, it is indeed not possible on the slow path, d=
ue
> >>> to PG_RW only set from pmap_enter(), am I right ?  I.e. this is anoth=
er
> >>> place where the rule 'no PG_RW without PG_M' is important.
> >>
> >> Yes, it's not possible, but I'm a little confused by the rest of your
> >> question, specifically, the statement "no PG_RW without PG_M".  Did you
> >> actually mean "no PG_M without PG_RW"?
> > No, I mean what I did wrote, and I was wrong.
> >
> >>
> >>> Let me formulate my question another way:  what are the guarantees we
> >>> provide to the applications when the same page is mapped with and
> >>> without MAP_NOSYNC simultaneously ?  Is it contractually guaranteed t=
hat
> >>> any write from !MAP_NOSYNC entry triggers write in the syncer activity
> >>> period ?
> >>
> >> Yes, that is the intent.  However, I can think of at least one case
> >> where the existing code doesn't work as intended.  Suppose that the
> >> first fault on a !MAP_NOSYNC entry is triggered by a read access.  The=
n,
> >> vm_fault() won't call vm_page_dirty(), but it will nonetheless install=
 a
> >> mapping in the pmap that allows write access.  Now, suppose this same
> >> process writes to the page.  Finally, suppose that the second fault
> >> happens on a MAP_NOSYNC entry.  That fault will see a clean page, i.e.,
> >> m->dirty =3D=3D 0, and set VPO_NOSYNC on the page, even though the fir=
st
> >> faulting process that wants the page sync'ed has dirtied the page.
> > Yes, you are right.  I convinced myself that this cannot happen, due to
> > the false assumption above, and the fact that page flushing removes
> > write bit from pte.
>=20
>=20
> There is another case that I've wondered about and not had the time to
> check out.  What happens when a page is mapped by an entry with
> MAP_NOSYNC and a write(2) is performed?  Is the page being flushed or not?
Do you mean write(2) to the file range represented by the page ? If yes,
the page mapping mode (or mapping existence at all) is irrelevant there.
(Delayed) write is recorded at the buffer layer as B_DELWRI buffer, for
filesystems which use buffer cache. Typically, the page flush operation
also converts dirty pages into the same dirty buffers.

It is syncer or buffer daemon which write out the buffers.

>=20
>=20
> >>
> >>
> >>>> The best course of action may be to fall back to the slow path if you
> >>>> actually need to change VPO_NOSYNC's state.  Usually, you won't need=
 to.
> >>>>
> >>> Let me first try to improve the original patch to handle
> >>> MAP_ENTRY_NOSYNC on fast path as well. It seems to be one of the cases
> >>> when the parallel faults are actually useful.
> >>
> >> I think it may be time to take a step back, decide what semantics we
> >> really want, and see if there is a better way of implementing those
> >> semantics.  The current approach based on toggling VPO_NOSYNC only
> >> really works for the simplest cases.
> > Still, I believe that the current form of the patch completely repeats
> > the existing semantic of the slow path in the fast path.
> >
> > I may propose to avoid putting vm_page_dirty() in the scope of page
> > lock for vm_fault_dirty(), mostly to simplify the logic, not to
> > provide any useful optimization.
>=20
>=20
> I don't think that you can do that.  It leads to the race that I
> described in my first email.

Yes, but I mean, since there are other ways to get dirtyness from
!MAP_NOSYNC write accesses to be ignored, there is no new harm from
the supposedly rare race.

Anyway, my proposal to move the vm_page_dirty() from under the page
lock was to make the vm_fault_dirty() code less convoluted, not to
micro-optimize things.
>=20
>=20
> >>
> >>> One more note: the previous patch handled m->oflags inconsistency for
> >>> setting VPO_NOSYNC operation, but missed the clear one line later.
> >>> I think that increasing the page lock to cover also the vm_page_dirty=
()
> >>> would fix the race you described, and the second manipulation with
> >>> oflags.
> >>>
> >>> diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c
> >>> index 30b0456..944b479 100644
> >>> --- a/sys/vm/vm_fault.c
> >>> +++ b/sys/vm/vm_fault.c
> >>> @@ -174,6 +174,70 @@ unlock_and_deallocate(struct faultstate *fs)
> >>>  	}
> >>>  }
> >>> =20
> >>> +static void
> >>> +vm_fault_dirty(vm_map_entry_t entry, vm_page_t m, vm_prot_t prot,
> >>> +    vm_prot_t fault_type, int fault_flags, boolean_t set_wd)
> >>> +{
> >>> +	boolean_t need_dirty;
> >>> +
> >>> +	if (((prot & VM_PROT_WRITE) =3D=3D 0 &&
> >>> +	    (fault_flags & VM_FAULT_DIRTY) =3D=3D 0) ||
> >>> +	    (m->oflags & VPO_UNMANAGED) !=3D 0)
> >>> +		return;
> >>> +
> >>> +	VM_OBJECT_ASSERT_LOCKED(m->object);
> >>> +
> >>> +	need_dirty =3D ((fault_type & VM_PROT_WRITE) !=3D 0 &&
> >>> +	    (fault_flags & VM_FAULT_CHANGE_WIRING) =3D=3D 0) ||
> >>> +	    (fault_flags & VM_FAULT_DIRTY) !=3D 0;
> >>> +
> >>> +	if (set_wd)
> >>> +		vm_object_set_writeable_dirty(m->object);
> >>> +	else
> >>> +		/*
> >>> +		 * If two callers of vm_fault_dirty() with set_wd =3D=3D
> >>> +		 * FALSE, one for the map entry with MAP_ENTRY_NOSYNC
> >>> +		 * flag set, other with flag clear, race, it is
> >>> +		 * possible for the no-NOSYNC thread to see m->dirty
> >>> +		 * !=3D 0 and not clear VPO_NOSYNC.  Take vm_page lock
> >>> +		 * around manipulation of VPO_NOSYNC and
> >>> +		 * vm_page_dirty() call, to avoid the race and keep
> >>> +		 * m->oflags consistent.
> >>> +		 */
> >>> +		vm_page_lock(m);
> >>> +
> >>> +	/*
> >>> +	 * If this is a NOSYNC mmap we do not want to set VPO_NOSYNC
> >>> +	 * if the page is already dirty to prevent data written with
> >>> +	 * the expectation of being synced from not being synced.
> >>> +	 * Likewise if this entry does not request NOSYNC then make
> >>> +	 * sure the page isn't marked NOSYNC.  Applications sharing
> >>> +	 * data should use the same flags to avoid ping ponging.
> >>> +	 */
> >>> +	if ((entry->eflags & MAP_ENTRY_NOSYNC) !=3D 0) {
> >>> +		if (m->dirty =3D=3D 0) {
> >>> +			m->oflags |=3D VPO_NOSYNC;
> >>> +		}
> >>> +	} else {
> >>> +		m->oflags &=3D ~VPO_NOSYNC;
> >>> +	}
> >>> +
> >>> +	/*
> >>> +	 * If the fault is a write, we know that this page is being
> >>> +	 * written NOW so dirty it explicitly to save on
> >>> +	 * pmap_is_modified() calls later.
> >>> +	 *
> >>> +	 * Also tell the backing pager, if any, that it should remove
> >>> +	 * any swap backing since the page is now dirty.
> >>> +	 */
> >>> +	if (need_dirty)
> >>> +		vm_page_dirty(m);
> >>> +	if (!set_wd)
> >>> +		vm_page_unlock(m);
> >>> +	if (need_dirty)
> >>> +		vm_pager_page_unswapped(m);
> >>> +}
> >>> +
> >>>  /*
> >>>   * TRYPAGER - used by vm_fault to calculate whether the pager for the
> >>>   *	      current object *might* contain the page.
> >>> @@ -321,11 +385,8 @@ RetryFault:;
> >>>  			vm_page_hold(m);
> >>>  			vm_page_unlock(m);
> >>>  		}
> >>> -		if ((fault_type & VM_PROT_WRITE) !=3D 0 &&
> >>> -		    (m->oflags & VPO_UNMANAGED) =3D=3D 0) {
> >>> -			vm_page_dirty(m);
> >>> -			vm_pager_page_unswapped(m);
> >>> -		}
> >>> +		vm_fault_dirty(fs.entry, m, prot, fault_type, fault_flags,
> >>> +		    FALSE);
> >>>  		VM_OBJECT_RUNLOCK(fs.first_object);
> >>>  		if (!wired)
> >>>  			vm_fault_prefault(&fs, vaddr, 0, 0);
> >>> @@ -898,42 +959,7 @@ vnode_locked:
> >>>  	if (hardfault)
> >>>  		fs.entry->next_read =3D fs.pindex + faultcount - reqpage;
> >>> =20
> >>> -	if (((prot & VM_PROT_WRITE) !=3D 0 ||
> >>> -	    (fault_flags & VM_FAULT_DIRTY) !=3D 0) &&
> >>> -	    (fs.m->oflags & VPO_UNMANAGED) =3D=3D 0) {
> >>> -		vm_object_set_writeable_dirty(fs.object);
> >>> -
> >>> -		/*
> >>> -		 * If this is a NOSYNC mmap we do not want to set VPO_NOSYNC
> >>> -		 * if the page is already dirty to prevent data written with
> >>> -		 * the expectation of being synced from not being synced.
> >>> -		 * Likewise if this entry does not request NOSYNC then make
> >>> -		 * sure the page isn't marked NOSYNC.  Applications sharing
> >>> -		 * data should use the same flags to avoid ping ponging.
> >>> -		 */
> >>> -		if (fs.entry->eflags & MAP_ENTRY_NOSYNC) {
> >>> -			if (fs.m->dirty =3D=3D 0)
> >>> -				fs.m->oflags |=3D VPO_NOSYNC;
> >>> -		} else {
> >>> -			fs.m->oflags &=3D ~VPO_NOSYNC;
> >>> -		}
> >>> -
> >>> -		/*
> >>> -		 * If the fault is a write, we know that this page is being
> >>> -		 * written NOW so dirty it explicitly to save on=20
> >>> -		 * pmap_is_modified() calls later.
> >>> -		 *
> >>> -		 * Also tell the backing pager, if any, that it should remove
> >>> -		 * any swap backing since the page is now dirty.
> >>> -		 */
> >>> -		if (((fault_type & VM_PROT_WRITE) !=3D 0 &&
> >>> -		    (fault_flags & VM_FAULT_CHANGE_WIRING) =3D=3D 0) ||
> >>> -		    (fault_flags & VM_FAULT_DIRTY) !=3D 0) {
> >>> -			vm_page_dirty(fs.m);
> >>> -			vm_pager_page_unswapped(fs.m);
> >>> -		}
> >>> -	}
> >>> -
> >>> +	vm_fault_dirty(fs.entry, fs.m, prot, fault_type, fault_flags, TRUE);
> >>>  	vm_page_assert_xbusied(fs.m);
> >>> =20
> >>>  	/*
> >>> diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h
> >>> index f12b76c..a45648d 100644
> >>> --- a/sys/vm/vm_page.h
> >>> +++ b/sys/vm/vm_page.h
> >>> @@ -147,7 +147,7 @@ struct vm_page {
> >>>  	uint16_t hold_count;		/* page hold count (P) */
> >>>  	uint16_t flags;			/* page PG_* flags (P) */
> >>>  	uint8_t aflags;			/* access is atomic */
> >>> -	uint8_t oflags;			/* page VPO_* flags (O) */
> >>> +	uint8_t oflags;			/* page VPO_* flags (OM) */
> >>>  	uint8_t	queue;			/* page queue index (P,Q) */
> >>>  	int8_t psind;			/* pagesizes[] index (O) */
> >>>  	int8_t segind;
> >>> @@ -163,8 +163,9 @@ struct vm_page {
> >>>  /*
> >>>   * Page flags stored in oflags:
> >>>   *
> >>> - * Access to these page flags is synchronized by the lock on the obj=
ect
> >>> - * containing the page (O).
> >>> + * Access to these page flags is synchronized by the exclusive lock =
on
> >>> + * the object containing the page, or combination of shared object
> >>> + * lock and the page lock (OM).
> >>>   *
> >>>   * Note: VPO_UNMANAGED (used by OBJT_DEVICE, OBJT_PHYS and OBJT_SG)
> >>>   * 	 indicates that the page is not under PV management but

--q21uyrmm0NnpxAH4
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBAgAGBQJUEB4YAAoJEJDCuSvBvK1BlMAP/1fUuvUeuW5brXL/YAsOIciv
px3Hi5ASXhQv7PZEYC/rVzjJu1JphiGruh9a64YjHoPPAQPRnYcKn2s9xWzq6pzf
uHPRP11FRsP/68fNTSuuflS5/yTKdNcLnD6jSTLDps0Y4FNcIs3qjS2SJpgOs1f9
8stquask/t/lb6IeQ26hn3XpW1k/IH8Farm6i72k7loZBfBx6mGdv5/Di8Ujre/a
TaIfNORgL18YTOh6gsMqVIWTB0txLYHVo6KXn6x+LEROkhbpx4My3xClCgreGpht
KgkHr8JnfWux7iZ1+bxs4vKPe0djKRwUs9R6ogFcNHi6KJvCWFYCHfBLHK1HPina
QPTB7m2NX6rMx4iikWUK0MKlfwXfGYIR64NNmH1dchjMlQL7Jt5yqdUD99Xl+Q7s
XV6s6LZtbE9IElN8UuZvks/z9M+jEvR5RoSnuuHINcRMW04uv919S3+RoF5P6N+9
+UZp7+HtBaaaHKrI5BExSsVFiAXGKX0ainqQZfUOQAIrln2+hJF5aZh2Dfy9pHw1
RldOda2jbn7LSQ4e9ZOt+VGx227dA6BEQki3jzm7jNB9vMflj3XCM3J3TELCWwr9
9LBq03Z5DtMu0lgkvVzLAvgg7qN+QFujOzkpMruF62a24l1QJFc8pjb5fHH3VaHx
OMzE9GFccew3sIAtQBRe
=W+93
-----END PGP SIGNATURE-----

--q21uyrmm0NnpxAH4--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20140910094705.GG2737>