Date: Wed, 21 Dec 2011 21:03:02 +0400 From: Andrey Zonov <andrey@zonov.org> To: Kostik Belousov <kostikbel@gmail.com> Cc: alc@freebsd.org, freebsd-stable@freebsd.org, Jeremy Chadwick <freebsd@jdc.parodius.com> Subject: Re: directory listing hangs in "ufs" state Message-ID: <4EF21146.9010107@zonov.org> In-Reply-To: <20111215130111.GN50300@deviant.kiev.zoral.com.ua> References: <4EE7BF77.5000504@zonov.org> <20111213221501.GA85563@icarus.home.lan> <4EE8E6E3.7050202@zonov.org> <20111214182252.GA5176@icarus.home.lan> <4EE8FD3E.8030902@zonov.org> <20111214204201.GA7372@icarus.home.lan> <CANU_PUGtjjxP-qLjEqb2wVnL_QGJvtApnaD8SSF4zLksY4ME6A@mail.gmail.com> <20111215130111.GN50300@deviant.kiev.zoral.com.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On 15.12.2011 17:01, Kostik Belousov wrote:
> On Thu, Dec 15, 2011 at 03:51:02PM +0400, Andrey Zonov wrote:
>> On Thu, Dec 15, 2011 at 12:42 AM, Jeremy Chadwick
>> <freebsd@jdc.parodius.com>wrote:
>>
>>> On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote:
>>>> On 14.12.2011 22:22, Jeremy Chadwick wrote:
>>>>> On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote:
>>>>>> Hi Jeremy,
>>>>>>
>>>>>> This is not hardware problem, I've already checked that. I also ran
>>>>>> fsck today and got no errors.
>>>>>>
>>>>>> After some more exploration of how mongodb works, I found that then
>>>>>> listing hangs, one of mongodb thread is in "biowr" state for a long
>>>>>> time. It periodically calls msync(MS_SYNC) accordingly to ktrace
>>>>>> out.
>>>>>>
>>>>>> If I'll remove msync() calls from mongodb, how often data will be
>>>>>> sync by OS?
>>>>>>
>>>>>> --
>>>>>> Andrey Zonov
>>>>>>
>>>>>> On 14.12.2011 2:15, Jeremy Chadwick wrote:
>>>>>>> On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote:
>>>>>>>>
>>>>>>>> Have you any ideas what is going on? or how to catch the problem?
>>>>>>>
>>>>>>> Assuming this isn't a file on the root filesystem, try booting the
>>>>>>> machine in single-user mode and using "fsck -f" on the filesystem in
>>>>>>> question.
>>>>>>>
>>>>>>> Can you verify there's no problems with the disk this file lives on as
>>>>>>> well (smartctl -a /dev/disk)? I'm doubting this is the problem, but
>>>>>>> thought I'd mention it.
>>>>>
>>>>> I have no real answer, I'm sorry. msync(2) indicates it's effectively
>>>>> deprecated (see BUGS). It looks like this is effectively a mmap-version
>>>>> of fsync(2).
>>>>
>>>> I replaced msync(2) with fsync(2). Unfortunately, from man pages it
>>>> is not obvious that I can do this. Anyway, thanks.
>>>
>>> Sorry, that wasn't what I was implying. Let me try to explain
>>> differently.
>>>
>>> msync(2) looks, to me, like an mmap-specific version of fsync(2). Based
>>> on the man page, it seems that the with msync() you can effectively
>>> guaranteed flushing of certain pages within an mmap()'d region to disk.
>>> fsync() would flush **all** buffers/internal pages to be flushed to
>>> disk.
>>>
>>> One would need to look at the code to mongodb to find out what it's
>>> actually doing with msync(). That is to say, if it's doing something
>>> like this (I probably have the semantics wrong -- I've never spent much
>>> time with mmap()):
>>>
>>> fd = open("/some/file", O_RDWR);
>>> ptr = mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
>>> ret = msync(ptr, 65536, MS_SYNC);
>>> /* or alternatively, this:
>>> ret = msync(ptr, NULL, MS_SYNC);
>>> */
>>>
>>> Then this, to me, would be mostly the equivalent to:
>>>
>>> fd = fopen("/some/file", "r+");
>>> ret = fsync(fd);
>>>
>>> Otherwise, if it's calling msync() only on an address/location within
>>> the region ptr points to, then that may be more efficient (less pages to
>>> flush).
>>>
>>
>> They call msync() for the whole file. So, there will not be any difference.
>>
>>
>>> The mmap() arguments -- specifically flags (see man page) -- also play
>>> a role here. The one that catches my attention is MAP_NOSYNC. So you
>>> may need to look at the mongodb code to figure out what it's mmap()
>>> call is.
>>>
>>> One might wonder why they don't just use open() with the O_SYNC. I
>>> imagine that has to do with, again, performance; possibly the don't want
>>> all I/O synchronous, and would rather flush certain pages in the mmap'd
>>> region to disk as needed. I see the legitimacy in that approach (vs.
>>> just using O_SYNC).
>>>
>>> There's really no easy way for me to tell you which is more efficient,
>>> better, blah blah without spending a lot of time with a benchmarking
>>> program that tests all of this, *plus* an entire system (world) built
>>> with profiling.
>>>
>>
>> I ran for two hours mongodb with fsync() and got the following:
>> STARTED INBLK OUBLK MAJFLT MINFLT
>> Thu Dec 15 10:34:52 2011 3 192744 314 3080182
>>
>> This is output of `ps -o lstart,inblock,oublock,majflt,minflt -U mongodb'.
>>
>> Then I ran it with default msync():
>> STARTED INBLK OUBLK MAJFLT MINFLT
>> Thu Dec 15 12:34:53 2011 0 7241555 79 5401945
>>
>> There are also two graphics of disk business [1] [2].
>>
>> The difference is significant, in 37 times! That what I expected to get.
>>
>> In commentaries for vm_object_page_clean() I found this:
>>
>> * When stuffing pages asynchronously, allow clustering. XXX we need a
>> * synchronous clustering mode implementation.
>>
>> It means for me that msync(MS_SYNC) flush every page on disk in single IO
>> transaction. If we multiply 4K and 37 we get 150K. This number is size of
>> the single transaction in my experience.
>>
>> +alc@, kib@
>>
>> Am I right? Is there any plan to implement this?
> Current buffer clustering code can only do only async writes. In fact, I
> am not quite sure what would consitute the sync clustering, because the
> ability to delay the write is important to be able to cluster at all.
>
> Also, I am not sure that lack of clustering is the biggest problem.
> IMO, the fact that each write is sync is the first problem there. It
> would be quite a work to add the tracking of the issued writes to the
> vm_object_page_clean() and down the stack. Esp. due to custom page
> write vops in several fses.
>
> The only guarantee that POSIX requires from msync(MS_SYNC) is that
> the writes are finished when the syscall returned, and not that the
> writes are done synchronously. Below is the hack which should help if
> the msync()ed region contains the mapping of the whole file, since
> it is possible to fsync() the file after all writes are scheduled
> asynchronous then. It will causes unneeded metadata update, but I think
> it would be much faster still.
>
>
> diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c
> index 250b769..a9de554 100644
> --- a/sys/vm/vm_object.c
> +++ b/sys/vm/vm_object.c
> @@ -938,7 +938,7 @@ vm_object_sync(vm_object_t object, vm_ooffset_t offset, vm_size_t size,
> vm_object_t backing_object;
> struct vnode *vp;
> struct mount *mp;
> - int flags;
> + int flags, fsync_after;
>
> if (object == NULL)
> return;
> @@ -971,11 +971,26 @@ vm_object_sync(vm_object_t object, vm_ooffset_t offset, vm_size_t size,
> (void) vn_start_write(vp,&mp, V_WAIT);
> vfslocked = VFS_LOCK_GIANT(vp->v_mount);
> vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
> - flags = (syncio || invalidate) ? OBJPC_SYNC : 0;
> - flags |= invalidate ? OBJPC_INVAL : 0;
> + if (syncio&& !invalidate&& offset == 0&&
> + OFF_TO_IDX(size) == object->size) {
> + /*
> + * If syncing the whole mapping of the file,
> + * it is faster to schedule all the writes in
> + * async mode, also allowing the clustering,
> + * and then wait for i/o to complete.
> + */
> + flags = 0;
> + fsync_after = TRUE;
> + } else {
> + flags = (syncio || invalidate) ? OBJPC_SYNC : 0;
> + flags |= invalidate ? (OBJPC_SYNC | OBJPC_INVAL) : 0;
> + fsync_after = FALSE;
> + }
> VM_OBJECT_LOCK(object);
> vm_object_page_clean(object, offset, offset + size, flags);
> VM_OBJECT_UNLOCK(object);
> + if (fsync_after)
> + (void) VOP_FSYNC(vp, MNT_WAIT, curthread);
> VOP_UNLOCK(vp, 0);
> VFS_UNLOCK_GIANT(vfslocked);
> vn_finished_write(mp);
Thanks, this patch works. Performance is the same as of using fsync().
Actually, Linux uses fsync() inside of msync() if MS_SYNC is set.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=mm/msync.c;h=632df4527c0122062d9332a0d483835274ed62f6;hb=HEAD
--
Andrey Zonov
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4EF21146.9010107>
