Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 4 Jul 2012 12:45:01 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Pavlo <devgs@ukr.net>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: mmap() incoherency on hi I/O load (FS is zfs)
Message-ID:  <20120704094501.GJ2337@deviant.kiev.zoral.com.ua>
In-Reply-To: <1480.1341393955.14971952305407262720@ffe6.ukr.net>
References:  <20120704090633.GH2337@deviant.kiev.zoral.com.ua> <91943.1339669820.1305529125424791552@ffe15.ukr.net> <23856.1341389256.6316487571580649472@ffe17.ukr.net> <1480.1341393955.14971952305407262720@ffe6.ukr.net>

next in thread | previous in thread | raw e-mail | index | archive | help

--NAIEQUuOioGa/ivv
Content-Type: text/plain; charset=koi8-r
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Jul 04, 2012 at 12:25:55PM +0300, Pavlo wrote:
>=20
> =9A=20
>=20
>   --- Original message ---
>  From: "Konstantin Belousov" <kostikbel@gmail.com>
>  To: "Pavlo" <devgs@ukr.net>
>  Date: 4 July 2012, 12:06:44
>  Subject: Re: mmap() incoherency on hi I/O load (FS is zfs)
> =20
> =20
>=20
>=20
> > On Wed, Jul 04, 2012 at 11:07:36AM +0300, Pavlo wrote:
> > >=20
> > >=20
> > >=20
> > > --- Original message ---
> > > From: "Pavlo" <devgs@ukr.net>
> > > To: freebsd-fs@freebsd.org
> > > Date: 14 June 2012, 13:30:20
> > > Subject: mmap() incoherency on hi I/O load (FS is zfs)
> > >=20
> > >=20
> > > > There's a case when some parts of files that are mapped and then
> > > modified getting corrupted. By corrupted I mean some data is ok (one =
that
> > > was written using write()/pwrite()) but some looks like it never exis=
ted.
> > > Like it was some time in buffers, when several processes simultaneous=
ly
> > > (of course access was synchronised) used shared pages and reported it=
's
> > > existence. But after time pass they (processes) screamed that it is n=
ow
> > > lost. Only part of data written with pwrite() was there. Everything t=
hat
> > > was written via mmap() is zero.
> > > >
> > > > So as I said it occurs on hi I/O busyness. When in background 4+
> > > processes do indexing of huge ammount of data. Also I want to note, it
> > > never occurred in the life of our project  while we used mmap() under
> > > same I/O stress conditions when mapping was done for a whole file of =
just
> > > a part(header) starting from a beginning of a file. First time we used
> > > mapping of individual pages, just to save RAM, and this popped up.
> > > >
> > > > Solution for this problem is msync() before any munmap(). But man s=
ays:
> > > >
> > > >
> > >=20
> > > The msync() system call is usually not needed since BSD implements a
> > > coherent file system buffer cache.  However, it may be used to associ=
ate
> > > dirty VM pages with file system buffers and thus cause them to be flu=
shed
> > > to physical media sooner rather than later.
> > > >=20
> > > > Any thoughts? Thanks.
> > > >=20
> > > >=20
> > >=20
> > > So I tracked issue to the place where it occurs. When I commit data to
> > > file using mmap() and pwrite() side by side, sometimes 'newest data' =
is
> > > being overwritten by 'elder data'. From time to time 'elder data' can=
 be
> > > something written with mmap() either with pwrite(). It never happens =
when
> > > I use exclusively mmap() either pwrite(). Also this issue reproduces =
on
> > > UFS as well. I think there is a problem keeping mmapep pages and FS c=
ache
> > > synced.
> > I am curious how do you label data with newer and elder labels.
>=20
> I have list header like:
>=20
> struct XXX
> {
>     uint32_t alloc_size;
>     uint32_t list_size;
>     node_t   node[1];
> }
>=20
> First I init it with pwrite() setting for example alloc_size to 10 and ev=
erything else to 0;
>=20
> Then add elements with mmap();
>=20
> 1. Workers log elements existence...
> 2. Workers log elements existence...
> ... same thing for a few seconds.
> X. One of the workers cry that list is empty.
>=20
> Then I inspect core file and see that list looks like if it was just init=
ialised with pwrite() ie alloc_size equals 10, everything else is 0.
> Hard to reproduce because it happen only on really high IO loads. And fro=
m tens of thousands of such files only a couple getting corrupted.
>=20
> >=20
> > I do admit a possibility of a race in ZFS double-copy implementation of
> > the mmap/cache coherency, but somewhat skeptical about the same possibi=
lity
> > for UFS. What you saying might indicate that we loose modified/dirty bi=
ts
> > for the page, but that would have much more firework then just eventual
> > race with write.
> >=20
> > What version of the system ? Does the machine swap ?
You just ignored these ^^^^^^^^^^^^ questions.

>=20
> Okay, after msync() helped but didn't fixed issue (just reduced occurrenc=
e) I did next thing:
> tracked modification of mmaped pages using mprotect(). At the end of sess=
ion before munpap() saved modified pages, then munmap() then I wrote those =
pages back to disk.
>=20
> Later worker accessed those pages again with mmap(), modified them and fo=
r some parts of those pages did read() instead of accessing via mmap(). Wha=
t read() returned was data committed in previous session with write() but n=
ot the data, that was just modified by same process via mmap(). We reproduc=
es this again and again on UFS on FreeBSD and only on high IO load. Though =
we could never reproduce this on Linux (ext4).
>=20
So you are saying that the following sequence:
	1. write at offset X
	2. write into the shared mapping of the same file at offset X
	3. read at offset X
performed by single thread can return data at the point (1) instead of
the data at the point (2) ?

Knowing how write is implemented for UFS, I find this quite impossible.

If the actions are executed in the different processes/threads, say
process 1 executes (1, 2) and process 2 executes (3), or process 1
executes (1), and process 2 executes (2, 3), then my first guess would
be a lack of proper synchronization between actions. This would indeed
makes possible exactly the outcome I described.
> >=20
> > >=20
> > > I will try to make test to reliably reproduces issue.
> > Yes, isolated test case is the best route forward. It would either show
> > a bug or demonstrate a misunderstanding on your part.
>=20
> I am trying, but it's really hard to make example to reproduce this issue.
This seems to be the only way forward, at least for you.
And do answer about the version/swap question.

>=20
> Thanks for reply.

--NAIEQUuOioGa/ivv
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAk/0EJ0ACgkQC3+MBN1Mb4gMEgCdEoIKsIYTQo6I9fOmTEERgVHV
2AUAoOyYadvKrm9wKaUNT+H2L7OXPnom
=kBhd
-----END PGP SIGNATURE-----

--NAIEQUuOioGa/ivv--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120704094501.GJ2337>