Date: Fri, 23 Dec 2011 01:39:22 -0600 From: Alan Cox <alc@rice.edu> To: Kostik Belousov <kostikbel@gmail.com> Cc: alc@freebsd.org, Andrey Zonov <andrey@zonov.org>, freebsd-stable@freebsd.org, Jeremy Chadwick <freebsd@jdc.parodius.com> Subject: Re: directory listing hangs in "ufs" state Message-ID: <4EF4302A.1080708@rice.edu> In-Reply-To: <20111222094836.GD50300@deviant.kiev.zoral.com.ua> References: <4EE7BF77.5000504@zonov.org> <20111213221501.GA85563@icarus.home.lan> <4EE8E6E3.7050202@zonov.org> <20111214182252.GA5176@icarus.home.lan> <4EE8FD3E.8030902@zonov.org> <20111214204201.GA7372@icarus.home.lan> <CANU_PUGtjjxP-qLjEqb2wVnL_QGJvtApnaD8SSF4zLksY4ME6A@mail.gmail.com> <20111215130111.GN50300@deviant.kiev.zoral.com.ua> <4EF21146.9010107@zonov.org> <20111222094836.GD50300@deviant.kiev.zoral.com.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On 12/22/2011 03:48, Kostik Belousov wrote: > On Wed, Dec 21, 2011 at 09:03:02PM +0400, Andrey Zonov wrote: >> On 15.12.2011 17:01, Kostik Belousov wrote: >>> On Thu, Dec 15, 2011 at 03:51:02PM +0400, Andrey Zonov wrote: >>>> On Thu, Dec 15, 2011 at 12:42 AM, Jeremy Chadwick >>>> <freebsd@jdc.parodius.com>wrote: >>>> >>>>> On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote: >>>>>> On 14.12.2011 22:22, Jeremy Chadwick wrote: >>>>>>> On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote: >>>>>>>> Hi Jeremy, >>>>>>>> >>>>>>>> This is not hardware problem, I've already checked that. I also ran >>>>>>>> fsck today and got no errors. >>>>>>>> >>>>>>>> After some more exploration of how mongodb works, I found that then >>>>>>>> listing hangs, one of mongodb thread is in "biowr" state for a long >>>>>>>> time. It periodically calls msync(MS_SYNC) accordingly to ktrace >>>>>>>> out. >>>>>>>> >>>>>>>> If I'll remove msync() calls from mongodb, how often data will be >>>>>>>> sync by OS? >>>>>>>> >>>>>>>> -- >>>>>>>> Andrey Zonov >>>>>>>> >>>>>>>> On 14.12.2011 2:15, Jeremy Chadwick wrote: >>>>>>>>> On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote: >>>>>>>>>> Have you any ideas what is going on? or how to catch the problem? >>>>>>>>> Assuming this isn't a file on the root filesystem, try booting the >>>>>>>>> machine in single-user mode and using "fsck -f" on the filesystem in >>>>>>>>> question. >>>>>>>>> >>>>>>>>> Can you verify there's no problems with the disk this file lives on >>>>>>>>> as >>>>>>>>> well (smartctl -a /dev/disk)? I'm doubting this is the problem, but >>>>>>>>> thought I'd mention it. >>>>>>> I have no real answer, I'm sorry. msync(2) indicates it's effectively >>>>>>> deprecated (see BUGS). It looks like this is effectively a >>>>>>> mmap-version >>>>>>> of fsync(2). >>>>>> I replaced msync(2) with fsync(2). Unfortunately, from man pages it >>>>>> is not obvious that I can do this. Anyway, thanks. >>>>> Sorry, that wasn't what I was implying. Let me try to explain >>>>> differently. >>>>> >>>>> msync(2) looks, to me, like an mmap-specific version of fsync(2). Based >>>>> on the man page, it seems that the with msync() you can effectively >>>>> guaranteed flushing of certain pages within an mmap()'d region to disk. >>>>> fsync() would flush **all** buffers/internal pages to be flushed to >>>>> disk. >>>>> >>>>> One would need to look at the code to mongodb to find out what it's >>>>> actually doing with msync(). That is to say, if it's doing something >>>>> like this (I probably have the semantics wrong -- I've never spent much >>>>> time with mmap()): >>>>> >>>>> fd = open("/some/file", O_RDWR); >>>>> ptr = mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); >>>>> ret = msync(ptr, 65536, MS_SYNC); >>>>> /* or alternatively, this: >>>>> ret = msync(ptr, NULL, MS_SYNC); >>>>> */ >>>>> >>>>> Then this, to me, would be mostly the equivalent to: >>>>> >>>>> fd = fopen("/some/file", "r+"); >>>>> ret = fsync(fd); >>>>> >>>>> Otherwise, if it's calling msync() only on an address/location within >>>>> the region ptr points to, then that may be more efficient (less pages to >>>>> flush). >>>>> >>>> They call msync() for the whole file. So, there will not be any >>>> difference. >>>> >>>> >>>>> The mmap() arguments -- specifically flags (see man page) -- also play >>>>> a role here. The one that catches my attention is MAP_NOSYNC. So you >>>>> may need to look at the mongodb code to figure out what it's mmap() >>>>> call is. >>>>> >>>>> One might wonder why they don't just use open() with the O_SYNC. I >>>>> imagine that has to do with, again, performance; possibly the don't want >>>>> all I/O synchronous, and would rather flush certain pages in the mmap'd >>>>> region to disk as needed. I see the legitimacy in that approach (vs. >>>>> just using O_SYNC). >>>>> >>>>> There's really no easy way for me to tell you which is more efficient, >>>>> better, blah blah without spending a lot of time with a benchmarking >>>>> program that tests all of this, *plus* an entire system (world) built >>>>> with profiling. >>>>> >>>> I ran for two hours mongodb with fsync() and got the following: >>>> STARTED INBLK OUBLK MAJFLT MINFLT >>>> Thu Dec 15 10:34:52 2011 3 192744 314 3080182 >>>> >>>> This is output of `ps -o lstart,inblock,oublock,majflt,minflt -U mongodb'. >>>> >>>> Then I ran it with default msync(): >>>> STARTED INBLK OUBLK MAJFLT MINFLT >>>> Thu Dec 15 12:34:53 2011 0 7241555 79 5401945 >>>> >>>> There are also two graphics of disk business [1] [2]. >>>> >>>> The difference is significant, in 37 times! That what I expected to get. >>>> >>>> In commentaries for vm_object_page_clean() I found this: >>>> >>>> * When stuffing pages asynchronously, allow clustering. XXX we >>>> need a >>>> * synchronous clustering mode implementation. >>>> >>>> It means for me that msync(MS_SYNC) flush every page on disk in single IO >>>> transaction. If we multiply 4K and 37 we get 150K. This number is size >>>> of >>>> the single transaction in my experience. >>>> >>>> +alc@, kib@ >>>> >>>> Am I right? Is there any plan to implement this? >>> Current buffer clustering code can only do only async writes. In fact, I >>> am not quite sure what would consitute the sync clustering, because the >>> ability to delay the write is important to be able to cluster at all. >>> >>> Also, I am not sure that lack of clustering is the biggest problem. >>> IMO, the fact that each write is sync is the first problem there. It >>> would be quite a work to add the tracking of the issued writes to the >>> vm_object_page_clean() and down the stack. Esp. due to custom page >>> write vops in several fses. >>> >>> The only guarantee that POSIX requires from msync(MS_SYNC) is that >>> the writes are finished when the syscall returned, and not that the >>> writes are done synchronously. Below is the hack which should help if >>> the msync()ed region contains the mapping of the whole file, since >>> it is possible to fsync() the file after all writes are scheduled >>> asynchronous then. It will causes unneeded metadata update, but I think >>> it would be much faster still. >>> >>> >>> diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c >>> index 250b769..a9de554 100644 >>> --- a/sys/vm/vm_object.c >>> +++ b/sys/vm/vm_object.c >>> @@ -938,7 +938,7 @@ vm_object_sync(vm_object_t object, vm_ooffset_t >>> offset, vm_size_t size, >>> vm_object_t backing_object; >>> struct vnode *vp; >>> struct mount *mp; >>> - int flags; >>> + int flags, fsync_after; >>> >>> if (object == NULL) >>> return; >>> @@ -971,11 +971,26 @@ vm_object_sync(vm_object_t object, vm_ooffset_t >>> offset, vm_size_t size, >>> (void) vn_start_write(vp,&mp, V_WAIT); >>> vfslocked = VFS_LOCK_GIANT(vp->v_mount); >>> vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); >>> - flags = (syncio || invalidate) ? OBJPC_SYNC : 0; >>> - flags |= invalidate ? OBJPC_INVAL : 0; >>> + if (syncio&& !invalidate&& offset == 0&& >>> + OFF_TO_IDX(size) == object->size) { >>> + /* >>> + * If syncing the whole mapping of the file, >>> + * it is faster to schedule all the writes in >>> + * async mode, also allowing the clustering, >>> + * and then wait for i/o to complete. >>> + */ >>> + flags = 0; >>> + fsync_after = TRUE; >>> + } else { >>> + flags = (syncio || invalidate) ? OBJPC_SYNC : 0; >>> + flags |= invalidate ? (OBJPC_SYNC | OBJPC_INVAL) : 0; >>> + fsync_after = FALSE; >>> + } >>> VM_OBJECT_LOCK(object); >>> vm_object_page_clean(object, offset, offset + size, flags); >>> VM_OBJECT_UNLOCK(object); >>> + if (fsync_after) >>> + (void) VOP_FSYNC(vp, MNT_WAIT, curthread); >>> VOP_UNLOCK(vp, 0); >>> VFS_UNLOCK_GIANT(vfslocked); >>> vn_finished_write(mp); >> Thanks, this patch works. Performance is the same as of using fsync(). >> >> Actually, Linux uses fsync() inside of msync() if MS_SYNC is set. >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=mm/msync.c;h=632df4527c0122062d9332a0d483835274ed62f6;hb=HEAD >> > I see, indeed Linux fully fsync the whole file if even single page of it > appeared to be (non-shadowed) mmaped into the msync(MS_SYNC) region. > I am not sure that we shall follow this behaviour. > > Alan, do you agree with the patch above ? Yes, it's ok. Alan
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4EF4302A.1080708>