From owner-freebsd-fs@FreeBSD.ORG Wed Jul 4 09:45:18 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 731CC106566B for ; Wed, 4 Jul 2012 09:45:18 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id A85618FC08 for ; Wed, 4 Jul 2012 09:45:17 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q649jEhI089800; Wed, 4 Jul 2012 12:45:14 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q649j1CV013082; Wed, 4 Jul 2012 12:45:01 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q649j19i013081; Wed, 4 Jul 2012 12:45:01 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 4 Jul 2012 12:45:01 +0300 From: Konstantin Belousov To: Pavlo Message-ID: <20120704094501.GJ2337@deviant.kiev.zoral.com.ua> References: <20120704090633.GH2337@deviant.kiev.zoral.com.ua> <91943.1339669820.1305529125424791552@ffe15.ukr.net> <23856.1341389256.6316487571580649472@ffe17.ukr.net> <1480.1341393955.14971952305407262720@ffe6.ukr.net> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="NAIEQUuOioGa/ivv" Content-Disposition: inline In-Reply-To: <1480.1341393955.14971952305407262720@ffe6.ukr.net> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: freebsd-fs@freebsd.org Subject: Re: mmap() incoherency on hi I/O load (FS is zfs) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 04 Jul 2012 09:45:18 -0000 --NAIEQUuOioGa/ivv Content-Type: text/plain; charset=koi8-r Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Jul 04, 2012 at 12:25:55PM +0300, Pavlo wrote: >=20 > =9A=20 >=20 > --- Original message --- > From: "Konstantin Belousov" > To: "Pavlo" > Date: 4 July 2012, 12:06:44 > Subject: Re: mmap() incoherency on hi I/O load (FS is zfs) > =20 > =20 >=20 >=20 > > On Wed, Jul 04, 2012 at 11:07:36AM +0300, Pavlo wrote: > > >=20 > > >=20 > > >=20 > > > --- Original message --- > > > From: "Pavlo" > > > To: freebsd-fs@freebsd.org > > > Date: 14 June 2012, 13:30:20 > > > Subject: mmap() incoherency on hi I/O load (FS is zfs) > > >=20 > > >=20 > > > > There's a case when some parts of files that are mapped and then > > > modified getting corrupted. By corrupted I mean some data is ok (one = that > > > was written using write()/pwrite()) but some looks like it never exis= ted. > > > Like it was some time in buffers, when several processes simultaneous= ly > > > (of course access was synchronised) used shared pages and reported it= 's > > > existence. But after time pass they (processes) screamed that it is n= ow > > > lost. Only part of data written with pwrite() was there. Everything t= hat > > > was written via mmap() is zero. > > > > > > > > So as I said it occurs on hi I/O busyness. When in background 4+ > > > processes do indexing of huge ammount of data. Also I want to note, it > > > never occurred in the life of our project while we used mmap() under > > > same I/O stress conditions when mapping was done for a whole file of = just > > > a part(header) starting from a beginning of a file. First time we used > > > mapping of individual pages, just to save RAM, and this popped up. > > > > > > > > Solution for this problem is msync() before any munmap(). But man s= ays: > > > > > > > > > > >=20 > > > The msync() system call is usually not needed since BSD implements a > > > coherent file system buffer cache. However, it may be used to associ= ate > > > dirty VM pages with file system buffers and thus cause them to be flu= shed > > > to physical media sooner rather than later. > > > >=20 > > > > Any thoughts? Thanks. > > > >=20 > > > >=20 > > >=20 > > > So I tracked issue to the place where it occurs. When I commit data to > > > file using mmap() and pwrite() side by side, sometimes 'newest data' = is > > > being overwritten by 'elder data'. From time to time 'elder data' can= be > > > something written with mmap() either with pwrite(). It never happens = when > > > I use exclusively mmap() either pwrite(). Also this issue reproduces = on > > > UFS as well. I think there is a problem keeping mmapep pages and FS c= ache > > > synced. > > I am curious how do you label data with newer and elder labels. >=20 > I have list header like: >=20 > struct XXX > { > uint32_t alloc_size; > uint32_t list_size; > node_t node[1]; > } >=20 > First I init it with pwrite() setting for example alloc_size to 10 and ev= erything else to 0; >=20 > Then add elements with mmap(); >=20 > 1. Workers log elements existence... > 2. Workers log elements existence... > ... same thing for a few seconds. > X. One of the workers cry that list is empty. >=20 > Then I inspect core file and see that list looks like if it was just init= ialised with pwrite() ie alloc_size equals 10, everything else is 0. > Hard to reproduce because it happen only on really high IO loads. And fro= m tens of thousands of such files only a couple getting corrupted. >=20 > >=20 > > I do admit a possibility of a race in ZFS double-copy implementation of > > the mmap/cache coherency, but somewhat skeptical about the same possibi= lity > > for UFS. What you saying might indicate that we loose modified/dirty bi= ts > > for the page, but that would have much more firework then just eventual > > race with write. > >=20 > > What version of the system ? Does the machine swap ? You just ignored these ^^^^^^^^^^^^ questions. >=20 > Okay, after msync() helped but didn't fixed issue (just reduced occurrenc= e) I did next thing: > tracked modification of mmaped pages using mprotect(). At the end of sess= ion before munpap() saved modified pages, then munmap() then I wrote those = pages back to disk. >=20 > Later worker accessed those pages again with mmap(), modified them and fo= r some parts of those pages did read() instead of accessing via mmap(). Wha= t read() returned was data committed in previous session with write() but n= ot the data, that was just modified by same process via mmap(). We reproduc= es this again and again on UFS on FreeBSD and only on high IO load. Though = we could never reproduce this on Linux (ext4). >=20 So you are saying that the following sequence: 1. write at offset X 2. write into the shared mapping of the same file at offset X 3. read at offset X performed by single thread can return data at the point (1) instead of the data at the point (2) ? Knowing how write is implemented for UFS, I find this quite impossible. If the actions are executed in the different processes/threads, say process 1 executes (1, 2) and process 2 executes (3), or process 1 executes (1), and process 2 executes (2, 3), then my first guess would be a lack of proper synchronization between actions. This would indeed makes possible exactly the outcome I described. > >=20 > > >=20 > > > I will try to make test to reliably reproduces issue. > > Yes, isolated test case is the best route forward. It would either show > > a bug or demonstrate a misunderstanding on your part. >=20 > I am trying, but it's really hard to make example to reproduce this issue. This seems to be the only way forward, at least for you. And do answer about the version/swap question. >=20 > Thanks for reply. --NAIEQUuOioGa/ivv Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk/0EJ0ACgkQC3+MBN1Mb4gMEgCdEoIKsIYTQo6I9fOmTEERgVHV 2AUAoOyYadvKrm9wKaUNT+H2L7OXPnom =kBhd -----END PGP SIGNATURE----- --NAIEQUuOioGa/ivv--