From owner-freebsd-fs@FreeBSD.ORG Thu Nov 28 19:19:04 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 13168760 for ; Thu, 28 Nov 2013 19:19:04 +0000 (UTC) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 0F8551FF3 for ; Thu, 28 Nov 2013 19:19:02 +0000 (UTC) Received: from tom.home (kostik@localhost [127.0.0.1]) by kib.kiev.ua (8.14.7/8.14.7) with ESMTP id rASJIppo069406; Thu, 28 Nov 2013 21:18:51 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.8.3 kib.kiev.ua rASJIppo069406 Received: (from kostik@localhost) by tom.home (8.14.7/8.14.7/Submit) id rASJIpw8069405; Thu, 28 Nov 2013 21:18:51 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 28 Nov 2013 21:18:51 +0200 From: Konstantin Belousov To: Rick Macklem Subject: Re: RFC: NFS client patch to reduce sychronous writes Message-ID: <20131128191851.GM59496@kib.kiev.ua> References: <20131128072435.GI59496@kib.kiev.ua> <820090900.22521963.1385648137865.JavaMail.root@uoguelph.ca> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="rNM3Wo3yRXFr03ac" Content-Disposition: inline In-Reply-To: <820090900.22521963.1385648137865.JavaMail.root@uoguelph.ca> User-Agent: Mutt/1.5.22 (2013-10-16) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: Kirk McKusick , FreeBSD FS X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.16 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 28 Nov 2013 19:19:04 -0000 --rNM3Wo3yRXFr03ac Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Nov 28, 2013 at 09:15:37AM -0500, Rick Macklem wrote: > Kostik wrote: > > On Wed, Nov 27, 2013 at 07:19:47PM -0500, Rick Macklem wrote: > > > Kirk wrote: > > > > > Date: Wed, 27 Nov 2013 17:50:48 -0500 (EST) > > > > > From: Rick Macklem > > > > > To: Konstantin Belousov > > > > > Subject: Re: RFC: NFS client patch to reduce sychronous writes > > > > >=20 > > > > > Kostik wrote: > > > > >> Sorry, I do not understand the question. mmap(2) itself does > > > > >> not > > > > >> change > > > > >> file size. But if mmaped area includes the last page, I still > > > > >> think > > > > >> that the situation I described before is possible. > > > > >=20 > > > > > Yes, I'll need to look at this. If it is a problem, all I can > > > > > think > > > > > of > > > > > is bzeroing all new pages when they're allocated to the buffer > > > > > cache. > > > > >=20 > > > > > Thanks for looking at it, rick > > > > > ps: Btw, jhb@'s patch didn't have the bzeroing in it. > > > >=20 > > > > The ``fix'' of bzero'ing every buffer cache page was made to > > > > UFS/FFS > > > > for this problem and it killed write performance of the > > > > filesystem > > > > by nearly half. We corrected this by only doing the bzero when > > > > the > > > > file is mmap'ed which helped things considerably (since most > > > > files > > > > being written are not also bmap'ed). > > > >=20 > > > > Kirk > > > >=20 > > > Ok, thanks. I've been trying to reproduce the problem over NFS and > > > haven't been able to break my patch. I was using the attached > > > trivial > > > test program and would simply make a copy of the source file (529 > > > bytes) > > > to test on. I got the same results both locally and over NFS: > > > - built without -DWRITEIT, the setting of a value after EOF would > > > be > > > lost, because nothing grew the file from 529 bytes to over > > > 4080bytes. > > > - built with -DWRITEIT, both the 'A' and 'B' are in the result, > > > since > > > my patch bzeros the grown segment in the write(2) syscall. > > >=20 > > > - If I move the write (code in #ifdef WRITEIT) to after the "*cp" > > > of the mapped page, the 'A' assigned to "*cp" gets lost for > > > both UFS and NFS. > > > Is this correct behaviour? > > >=20 > > > If it is correct behaviour, I can't see how the patch is broken, > > > but > > > if you think it might still be, I'll look at doing what Kirk > > > suggests, > > > which is bzeroing all new buffer cache pages when the file is > > > mmap()d. > > >=20 > > Replying there, since text description is more informative than the > > code. > >=20 > > You cannot get the situation I described, with single process. > > You should have a writer in one thread, and reader through the mmaped > > area in another. Even than, the race window is thin. > >=20 > > Let me describe the issue which could exist one more time: > >=20 > > Thread A (writer) issued write(2). The kernel does two things: > > 1. zeroes part of the last buffer of the affected file. > > 2. kernel uiomove()s the write data into the buffer b_data. > >=20 > > Now, assume that thread B has the same file mmaped somewhere, and > > accesses the page of the buffer after the [1] but before [2]. Than, > > it would see zeroes instead of the valid content. > >=20 > > I said that this breaks write/mmap consistency, since thread B can > > see a content in the file which was never written there. The > > condition > > is transient, it self-repairs after thread A passes point 2. > >=20 > Ok, but doesn't that exist now? >=20 > Without the patch, when the Thread A (writer) appends to the file, > the np->n_size is grown and vnode_pager_setsize() is called with > the new, larger size, followed by an allocbuf(). > { at around line# 1066,1073 of nfs_clbio.c. Same code exists in the > old client and was cloned. } > Then vn_io_fault_uiomove() is called at line#1195. >=20 > Without the patch, Thread B would get whatever garbage is in the > page(s) if it accesses the latter (just grown) part of the buffer between > 1 and 2, would it not? >=20 > With the patch, it will most likely get 0s instead of garbage, if > it accesses the latter (just grown) part of the buffer between 1. and 2. > There is actually a short time just after the vnode_pager_setsize() > and before the bzeroing, that it would still get garbage. > (Note that the patch only zeros out the part of the last page that > was "grown" by the write in 1. that is past the previous EOF. > If I bzeroing data that was before the previous EOF, I can see > the breakage, but the patch doesn't do that. Unless I've written > buggy code, of course, but I think it's correct.) >=20 > So, I don't see any difference between the unpatched and patched > version? AFAIU, you bzero the whole region in the buffer which is subject to i/o, not the new extent. E.g., if the file size mod biosize is non-zero, but the current write starts on the buffer boundary, than then range from the buffer start to the old EOF is zeroed for some moment, before being overwritten by user data. --rNM3Wo3yRXFr03ac Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (FreeBSD) iQIcBAEBAgAGBQJSl5caAAoJEJDCuSvBvK1BDxoQAJ3izrlf3y3xpjTglmsFXGJ0 AbLjaiptVr35dLPK0cPMuia/zScjmOPx1gZCjg4AZVuQ6dPMqsadIIhORDu+ekGF r2RaTPW570P0a0IDoAkZwpNVhBf8WnJ9SEvt9N43H6DaEVVoRSGtwqSlV/vbk10o qC4B//QhRlNpQa4drP1AJRd1DPCKOzjTmPnWeSYJiYmp8x2aZUWfb3N6OykG8m/f o9O4Y20YrIg9gAdw207niunH3umzanrV40TDzH+8BGs66mUrH2rmwNZbbdAW8SEE fZkOgG76Ixc8OhHfT1GT6xQbfr925lh9aa+ODAMQ4SDzWvfCRGA0H24iLCXe+LCy GVBeTCO1B7y5O271a3BunxzvuRuBeo0fELh0wPQo365iM2WJ9Nk8C4tCCuxqNNlQ aCBymFehLKekCUTK3plhMoxmy5tnf3y8s9rkZdg9j24ATGUrGKFAMZZ69AEeyGS7 qFV5BDGsAWWyXg3xVmZhgPVcMhs1KyX4kkoAloAbY/nkdtvmFSdE2NpFGZrNdm8Q zBeJRV1P4g6mKuSXHF8FDLsqYMxdtwhsRtXJYOpdvqxU6wDEwEdR4LeIdxMFU9n7 YY+81si4GPZe0cxE5Yz0Az4obyhLYbj+7DQ5nWBWgWjuOLvuuaE3Eq0IZqrmSkEm Hbe6CVwMdQPDoxdGuFiq =EGIt -----END PGP SIGNATURE----- --rNM3Wo3yRXFr03ac--