From owner-freebsd-hackers@FreeBSD.ORG  Wed May 23 06:02:19 2012
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id AAC9A1065670
	for <freebsd-hackers@freebsd.org>; Wed, 23 May 2012 06:02:18 +0000 (UTC)
	(envelope-from andrey@zonov.org)
Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com
	[209.85.215.54])
	by mx1.freebsd.org (Postfix) with ESMTP id 0B9B58FC0A
	for <freebsd-hackers@freebsd.org>; Wed, 23 May 2012 06:02:17 +0000 (UTC)
Received: by laai10 with SMTP id i10so7020868laa.13
	for <freebsd-hackers@freebsd.org>; Tue, 22 May 2012 23:02:16 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=google.com; s=20120113;
	h=message-id:date:from:user-agent:mime-version:to:cc:subject
	:references:in-reply-to:content-type:content-transfer-encoding
	:x-gm-message-state;
	bh=30SjEa1Bzg6kwTsM3nnB6/NXi+Iq+aYkTWkJQcsKVXo=;
	b=jWH2FqKxRE5maTTd2NzbVzcGu0B10Ng4msSWW8WEJ7nvxb6F6VOgUG0iT9BES4IrJr
	EJSXHLSwJMlsEulnHNMX7jM0/Mf2J6RP/ZFyN9DuGKd/RWhjtPS9wxxnalYiHIlQhY0s
	Tf4x8Ho8m2sqkymYXIWKWkGo2IEZswlWpLwWFkNMXTC1BnunDrtUlKC0qHSyU9LRWdTT
	uSzwYkuq5jvNuRVJnp8EVi3HKuFwC/ppxK6ToxNjb4SWDOP43C5rknHYNHtYI3MFZFFm
	LQwe85WcOzL0DA3VH+iyKd6KMJ5qTaE6/SgGnCNxzNVYzw6O3CVU5zUuLxxlBnUHb1vm
	ApDQ==
Received: by 10.152.102.234 with SMTP id fr10mr24072604lab.32.1337752936645;
	Tue, 22 May 2012 23:02:16 -0700 (PDT)
Received: from zont-osx.local (ppp95-165-130-190.pppoe.spdop.ru.
	[95.165.130.190])
	by mx.google.com with ESMTPS id hz16sm34457964lab.6.2012.05.22.23.02.15
	(version=SSLv3 cipher=OTHER); Tue, 22 May 2012 23:02:16 -0700 (PDT)
Message-ID: <4FBC7D66.2080605@zonov.org>
Date: Wed, 23 May 2012 10:02:14 +0400
From: Andrey Zonov <andrey@zonov.org>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7;
	rv:12.0) Gecko/20120428 Thunderbird/12.0.1
MIME-Version: 1.0
To: Alan Cox <alc@rice.edu>
References: <4F7B495D.3010402@zonov.org>
	<20120404071746.GJ2358@deviant.kiev.zoral.com.ua>
	<4F7DC037.9060803@rice.edu>
	<201204091126.25260.jhb@freebsd.org> <4F845D9B.10004@rice.edu>
	<4F851F87.3050206@zonov.org> <4F9DD372.1020001@rice.edu>
In-Reply-To: <4F9DD372.1020001@rice.edu>
Content-Type: text/plain; charset=windows-1251; format=flowed
Content-Transfer-Encoding: 7bit
X-Gm-Message-State: ALoCoQl3nc/UfPzMd9vYdBkQLmRDNjxQvQsDP4CwRZxz2FUP33EHqzKpQJBDtVxhttG45Cj1Hcmf
Cc: Konstantin Belousov <kostikbel@gmail.com>, freebsd-hackers@freebsd.org,
	alc@freebsd.org
Subject: Re: problems with mmap() and disk caching
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 23 May 2012 06:02:19 -0000

On 4/30/12 3:49 AM, Alan Cox wrote:
> On 04/11/2012 01:07, Andrey Zonov wrote:
>> On 10.04.2012 20:19, Alan Cox wrote:
>>> On 04/09/2012 10:26, John Baldwin wrote:
>>>> On Thursday, April 05, 2012 11:54:31 am Alan Cox wrote:
>>>>> On 04/04/2012 02:17, Konstantin Belousov wrote:
>>>>>> On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I open the file, then call mmap() on the whole file and get pointer,
>>>>>>> then I work with this pointer. I expect that page should be only
>>>>>>> once
>>>>>>> touched to get it into the memory (disk cache?), but this doesn't
>>>>>>> work!
>>>>>>>
>>>>>>> I wrote the test (attached) and ran it for the 1G file generated
>>>>>>> from
>>>>>>> /dev/random, the result is the following:
>>>>>>>
>>>>>>> Prepare file:
>>>>>>> # swapoff -a
>>>>>>> # newfs /dev/ada0b
>>>>>>> # mount /dev/ada0b /mnt
>>>>>>> # dd if=/dev/random of=/mnt/random-1024 bs=1m count=1024
>>>>>>>
>>>>>>> Purge cache:
>>>>>>> # umount /mnt
>>>>>>> # mount /dev/ada0b /mnt
>>>>>>>
>>>>>>> Run test:
>>>>>>> $ ./mmap /mnt/random-1024 30
>>>>>>> mmap: 1 pass took: 7.431046 (none: 262112; res: 32; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 2 pass took: 7.356670 (none: 261648; res: 496; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 3 pass took: 7.307094 (none: 260521; res: 1623; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 4 pass took: 7.350239 (none: 258904; res: 3240; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 5 pass took: 7.392480 (none: 257286; res: 4858; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 6 pass took: 7.292069 (none: 255584; res: 6560; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 7 pass took: 7.048980 (none: 251142; res: 11002; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 8 pass took: 6.899387 (none: 247584; res: 14560; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 9 pass took: 7.190579 (none: 242992; res: 19152; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 10 pass took: 6.915482 (none: 239308; res: 22836; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 11 pass took: 6.565909 (none: 232835; res: 29309; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 12 pass took: 6.423945 (none: 226160; res: 35984; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 13 pass took: 6.315385 (none: 208555; res: 53589; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 14 pass took: 6.760780 (none: 192805; res: 69339; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 15 pass took: 5.721513 (none: 174497; res: 87647; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 16 pass took: 5.004424 (none: 155938; res: 106206; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 17 pass took: 4.224926 (none: 135639; res: 126505; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 18 pass took: 3.749608 (none: 117952; res: 144192; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 19 pass took: 3.398084 (none: 99066; res: 163078; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 20 pass took: 3.029557 (none: 74994; res: 187150; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 21 pass took: 2.379430 (none: 55231; res: 206913; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 22 pass took: 2.046521 (none: 40786; res: 221358; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 23 pass took: 1.152797 (none: 30311; res: 231833; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 24 pass took: 0.972617 (none: 16196; res: 245948; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 25 pass took: 0.577515 (none: 8286; res: 253858; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 26 pass took: 0.380738 (none: 3712; res: 258432; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 27 pass took: 0.253583 (none: 1193; res: 260951; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 28 pass took: 0.157508 (none: 0; res: 262144; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 29 pass took: 0.156169 (none: 0; res: 262144; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 30 pass took: 0.156550 (none: 0; res: 262144; super:
>>>>>>> 0; other: 0)
>>>>>>>
>>>>>>> If I ran this:
>>>>>>> $ cat /mnt/random-1024> /dev/null
>>>>>>> before test, when result is the following:
>>>>>>>
>>>>>>> $ ./mmap /mnt/random-1024 5
>>>>>>> mmap: 1 pass took: 0.337657 (none: 0; res: 262144; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 2 pass took: 0.186137 (none: 0; res: 262144; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 3 pass took: 0.186132 (none: 0; res: 262144; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 4 pass took: 0.186535 (none: 0; res: 262144; super:
>>>>>>> 0; other: 0)
>>>>>>> mmap: 5 pass took: 0.190353 (none: 0; res: 262144; super:
>>>>>>> 0; other: 0)
>>>>>>>
>>>>>>> This is what I expect. But why this doesn't work without reading
>>>>>>> file
>>>>>>> manually?
>>>>>> Issue seems to be in some change of the behaviour of the reserv or
>>>>>> phys allocator. I Cc:ed Alan.
>>>>> I'm pretty sure that the behavior here hasn't significantly changed in
>>>>> about twelve years. Otherwise, I agree with your analysis.
>>>>>
>>>>> On more than one occasion, I've been tempted to change:
>>>>>
>>>>> pmap_remove_all(mt);
>>>>> if (mt->dirty != 0)
>>>>> vm_page_deactivate(mt);
>>>>> else
>>>>> vm_page_cache(mt);
>>>>>
>>>>> to:
>>>>>
>>>>> vm_page_dontneed(mt);
>>>>>
>>>>> because I suspect that the current code does more harm than good. In
>>>>> theory, it saves activations of the page daemon. However, more often
>>>>> than not, I suspect that we are spending more on page reactivations
>>>>> than
>>>>> we are saving on page daemon activations. The sequential access
>>>>> detection heuristic is just too easily triggered. For example, I've
>>>>> seen it triggered by demand paging of the gcc text segment. Also, I
>>>>> think that pmap_remove_all() and especially vm_page_cache() are too
>>>>> severe for a detection heuristic that is so easily triggered.
>>>> Are you planning to commit this?
>>>>
>>>
>>> Not yet. I did some tests with a file that was several times larger than
>>> DRAM, and I didn't like what I saw. Initially, everything behaved as
>>> expected, but about halfway through the test the bulk of the pages were
>>> active. Despite the call to pmap_clear_reference() in
>>> vm_page_dontneed(), the page daemon is finding the pages to be
>>> referenced and reactivating them. The net result is that the time it
>>> takes to read the file (from a relatively fast SSD) goes up by about
>>> 12%. So, this still needs work.
>>>
>>
>> Hi Alan,
>>
>> What do you think about attached patch?
>>
>>
>
> Sorry for the slow reply, I've been rather busy for the past couple of
> weeks. What you propose is clearly good for sequential accesses, but not
> so good for random accesses. Keep in mind, the potential costs of
> unconditionally increasing the read window include not only wasted I/O
> but also increased memory pressure. Rather than argue about which is
> more important, sequential or random access, I think it's more
> productive to replace the sequential access heuristic. The current
> heuristic is just not that sophisticated. It's easy to do better.
>
> The attached patch implements a new heuristic, which starts with the
> same initial read window as the current heuristic, but arithmetically
> grows the window on sequential page faults. From a stylistic standpoint,
> this patch also cleanly separates the "read ahead" logic from the "cache
> behind" logic.
>
> At the same time, this new heuristic is more selective about performing
> cache behind. It requires three or four sequential page faults before
> cache behind is enabled. More precisely, it requires the read ahead
> window to reach its maximum size before cache behind is enabled.
>
> For long, sequential accesses, the results of my performance tests are
> just good as unconditionally increasing the window size. I'm also seeing
> fewer pages needlessly cached by the cache behind heuristic. That said,
> there is still room for improvement. We are still not achieving the same
> sequential performance as "dd", and there are still more pages being
> cached than I would like.
>
> Alan
>
>

I've widely tested your patch and it showed good enough results.  I've 
commited it in our tree and it will be soon on production cluster.

Thanks a lot for help and your improvements!

-- 
Andrey Zonov