From owner-freebsd-stable@FreeBSD.ORG Thu Dec 15 13:01:21 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 84332106566C; Thu, 15 Dec 2011 13:01:21 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id E3CEB8FC1A; Thu, 15 Dec 2011 13:01:20 +0000 (UTC) Received: from alf.home (alf.kiev.zoral.com.ua [10.1.1.177]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id pBFD1CKT028125 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 15 Dec 2011 15:01:12 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: from alf.home (kostik@localhost [127.0.0.1]) by alf.home (8.14.5/8.14.5) with ESMTP id pBFD1CS4048896; Thu, 15 Dec 2011 15:01:12 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by alf.home (8.14.5/8.14.5/Submit) id pBFD1Btm048895; Thu, 15 Dec 2011 15:01:11 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: alf.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 15 Dec 2011 15:01:11 +0200 From: Kostik Belousov To: Andrey Zonov Message-ID: <20111215130111.GN50300@deviant.kiev.zoral.com.ua> References: <4EE7BF77.5000504@zonov.org> <20111213221501.GA85563@icarus.home.lan> <4EE8E6E3.7050202@zonov.org> <20111214182252.GA5176@icarus.home.lan> <4EE8FD3E.8030902@zonov.org> <20111214204201.GA7372@icarus.home.lan> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="GC54EUgUnLpxzrwX" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-3.9 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: alc@freebsd.org, freebsd-stable@freebsd.org, Jeremy Chadwick Subject: Re: directory listing hangs in "ufs" state X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Dec 2011 13:01:21 -0000 --GC54EUgUnLpxzrwX Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Dec 15, 2011 at 03:51:02PM +0400, Andrey Zonov wrote: > On Thu, Dec 15, 2011 at 12:42 AM, Jeremy Chadwick > wrote: >=20 > > On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote: > > > On 14.12.2011 22:22, Jeremy Chadwick wrote: > > > >On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote: > > > >>Hi Jeremy, > > > >> > > > >>This is not hardware problem, I've already checked that. I also ran > > > >>fsck today and got no errors. > > > >> > > > >>After some more exploration of how mongodb works, I found that then > > > >>listing hangs, one of mongodb thread is in "biowr" state for a long > > > >>time. It periodically calls msync(MS_SYNC) accordingly to ktrace > > > >>out. > > > >> > > > >>If I'll remove msync() calls from mongodb, how often data will be > > > >>sync by OS? > > > >> > > > >>-- > > > >>Andrey Zonov > > > >> > > > >>On 14.12.2011 2:15, Jeremy Chadwick wrote: > > > >>>On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote: > > > >>>> > > > >>>>Have you any ideas what is going on? or how to catch the problem? > > > >>> > > > >>>Assuming this isn't a file on the root filesystem, try booting the > > > >>>machine in single-user mode and using "fsck -f" on the filesystem = in > > > >>>question. > > > >>> > > > >>>Can you verify there's no problems with the disk this file lives o= n as > > > >>>well (smartctl -a /dev/disk)? I'm doubting this is the problem, b= ut > > > >>>thought I'd mention it. > > > > > > > >I have no real answer, I'm sorry. msync(2) indicates it's effective= ly > > > >deprecated (see BUGS). It looks like this is effectively a mmap-ver= sion > > > >of fsync(2). > > > > > > I replaced msync(2) with fsync(2). Unfortunately, from man pages it > > > is not obvious that I can do this. Anyway, thanks. > > > > Sorry, that wasn't what I was implying. Let me try to explain > > differently. > > > > msync(2) looks, to me, like an mmap-specific version of fsync(2). Based > > on the man page, it seems that the with msync() you can effectively > > guaranteed flushing of certain pages within an mmap()'d region to disk. > > fsync() would flush **all** buffers/internal pages to be flushed to > > disk. > > > > One would need to look at the code to mongodb to find out what it's > > actually doing with msync(). That is to say, if it's doing something > > like this (I probably have the semantics wrong -- I've never spent much > > time with mmap()): > > > > fd =3D open("/some/file", O_RDWR); > > ptr =3D mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); > > ret =3D msync(ptr, 65536, MS_SYNC); > > /* or alternatively, this: > > ret =3D msync(ptr, NULL, MS_SYNC); > > */ > > > > Then this, to me, would be mostly the equivalent to: > > > > fd =3D fopen("/some/file", "r+"); > > ret =3D fsync(fd); > > > > Otherwise, if it's calling msync() only on an address/location within > > the region ptr points to, then that may be more efficient (less pages to > > flush). > > >=20 > They call msync() for the whole file. So, there will not be any differen= ce. >=20 >=20 > > The mmap() arguments -- specifically flags (see man page) -- also play > > a role here. The one that catches my attention is MAP_NOSYNC. So you > > may need to look at the mongodb code to figure out what it's mmap() > > call is. > > > > One might wonder why they don't just use open() with the O_SYNC. I > > imagine that has to do with, again, performance; possibly the don't want > > all I/O synchronous, and would rather flush certain pages in the mmap'd > > region to disk as needed. I see the legitimacy in that approach (vs. > > just using O_SYNC). > > > > There's really no easy way for me to tell you which is more efficient, > > better, blah blah without spending a lot of time with a benchmarking > > program that tests all of this, *plus* an entire system (world) built > > with profiling. > > >=20 > I ran for two hours mongodb with fsync() and got the following: > STARTED INBLK OUBLK MAJFLT MINFLT > Thu Dec 15 10:34:52 2011 3 192744 314 3080182 >=20 > This is output of `ps -o lstart,inblock,oublock,majflt,minflt -U mongodb'. >=20 > Then I ran it with default msync(): > STARTED INBLK OUBLK MAJFLT MINFLT > Thu Dec 15 12:34:53 2011 0 7241555 79 5401945 >=20 > There are also two graphics of disk business [1] [2]. >=20 > The difference is significant, in 37 times! That what I expected to get. >=20 > In commentaries for vm_object_page_clean() I found this: >=20 > * When stuffing pages asynchronously, allow clustering. XXX we nee= d a > * synchronous clustering mode implementation. >=20 > It means for me that msync(MS_SYNC) flush every page on disk in single IO > transaction. If we multiply 4K and 37 we get 150K. This number is size = of > the single transaction in my experience. >=20 > +alc@, kib@ >=20 > Am I right? Is there any plan to implement this? Current buffer clustering code can only do only async writes. In fact, I am not quite sure what would consitute the sync clustering, because the ability to delay the write is important to be able to cluster at all. Also, I am not sure that lack of clustering is the biggest problem. IMO, the fact that each write is sync is the first problem there. It would be quite a work to add the tracking of the issued writes to the vm_object_page_clean() and down the stack. Esp. due to custom page write vops in several fses. The only guarantee that POSIX requires from msync(MS_SYNC) is that the writes are finished when the syscall returned, and not that the writes are done synchronously. Below is the hack which should help if the msync()ed region contains the mapping of the whole file, since it is possible to fsync() the file after all writes are scheduled asynchronous then. It will causes unneeded metadata update, but I think it would be much faster still. diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c index 250b769..a9de554 100644 --- a/sys/vm/vm_object.c +++ b/sys/vm/vm_object.c @@ -938,7 +938,7 @@ vm_object_sync(vm_object_t object, vm_ooffset_t offset,= vm_size_t size, vm_object_t backing_object; struct vnode *vp; struct mount *mp; - int flags; + int flags, fsync_after; =20 if (object =3D=3D NULL) return; @@ -971,11 +971,26 @@ vm_object_sync(vm_object_t object, vm_ooffset_t offse= t, vm_size_t size, (void) vn_start_write(vp, &mp, V_WAIT); vfslocked =3D VFS_LOCK_GIANT(vp->v_mount); vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); - flags =3D (syncio || invalidate) ? OBJPC_SYNC : 0; - flags |=3D invalidate ? OBJPC_INVAL : 0; + if (syncio && !invalidate && offset =3D=3D 0 && + OFF_TO_IDX(size) =3D=3D object->size) { + /* + * If syncing the whole mapping of the file, + * it is faster to schedule all the writes in + * async mode, also allowing the clustering, + * and then wait for i/o to complete. + */ + flags =3D 0; + fsync_after =3D TRUE; + } else { + flags =3D (syncio || invalidate) ? OBJPC_SYNC : 0; + flags |=3D invalidate ? (OBJPC_SYNC | OBJPC_INVAL) : 0; + fsync_after =3D FALSE; + } VM_OBJECT_LOCK(object); vm_object_page_clean(object, offset, offset + size, flags); VM_OBJECT_UNLOCK(object); + if (fsync_after) + (void) VOP_FSYNC(vp, MNT_WAIT, curthread); VOP_UNLOCK(vp, 0); VFS_UNLOCK_GIANT(vfslocked); vn_finished_write(mp); --GC54EUgUnLpxzrwX Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (FreeBSD) iEYEARECAAYFAk7p75cACgkQC3+MBN1Mb4inKQCfQH7Ln2h+5ERC3KqmowDwefh8 fUcAmwcKFtD68MKzUvmmhARdkVP5G41x =P/8l -----END PGP SIGNATURE----- --GC54EUgUnLpxzrwX--