Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 12 Oct 2013 12:59:19 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Dmitry Sivachenko <trtrmitya@gmail.com>
Cc:        "hackers@freebsd.org" <hackers@freebsd.org>
Subject:   Re: mmap() question
Message-ID:  <20131012095919.GI41229@kib.kiev.ua>
In-Reply-To: <A5E3C0A2-F0D5-47B1-8992-4B9DA347C275@gmail.com>
References:  <95E0B821-BF9B-4EBF-A1E5-1DDCBB1C3D1B@gmail.com> <20131011051702.GE41229@kib.kiev.ua> <A5E3C0A2-F0D5-47B1-8992-4B9DA347C275@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

--5dWm/iFn9ihQcn0U
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Oct 11, 2013 at 09:57:24AM +0400, Dmitry Sivachenko wrote:
>=20
> On 11.10.2013, at 9:17, Konstantin Belousov <kostikbel@gmail.com> wrote:
>=20
> > On Wed, Oct 09, 2013 at 03:42:27PM +0400, Dmitry Sivachenko wrote:
> >> Hello!
> >>=20
> >> I have a program which mmap()s a lot of large files (total size more t=
hat RAM and I have no swap), but it needs only small parts of that files at=
 a time.
> >>=20
> >> My understanding is that when using mmap when I access some memory reg=
ion OS reads the relevant portion of that file from disk and caches the res=
ult in memory.  If there is no free memory, OS will purge previously read p=
art of mmap'ed file to free memory for the new chunk.
> >>=20
> >> But this is not the case.  I use the following simple program which ge=
ts list of files as command line arguments, mmap()s them all and then selec=
ts random file and random 1K parts of that file and computes a XOR of bytes=
 from that region.
> >> After some time the program dies:
> >> pid 63251 (a.out), uid 1232, was killed: out of swap space
> >>=20
> >> It seems I incorrectly understand how mmap() works, can you please cla=
rify what's going wrong?
> >>=20
> >> I expect that program to run indefinitely, purging some regions out of=
 RAM and reading the relevant parts of files.
> >>=20
> >=20
> > You did not specified several very important parameters for your test:
> > 1. total amount of RAM installed
>=20
>=20
> 24GB
>=20
>=20
> > 2. count of the test files and size of the files
>=20
> To be precise: I used 57 files with size varied form 74MB to 19GB.
> The total size of these files is 270GB.
>=20
> > 3. which filesystem files are located at
>=20
>=20
> UFS @ SSD drive
>=20
> > 4. version of the system.
>=20
>=20
> FreeBSD 9.2-PRERELEASE #0 r254880M: Wed Aug 28 11:07:54 MSK 2013

I was not able to reproduce the situation locally. I even tried to start
a lot of threads accessing the mapped regions, to try to outrun the
pagedaemon. The user threads sleep on the disk read, while pagedaemon
has a lot of time to rebalance the queues. It might be a case when SSD
indeed makes a difference.

Still, I see how this situation could appear. The code, which triggers
OOM, never fires if there is a free space in the swapfile, so the
absense of swap is neccessary condition to trigger the bug.  Next, OOM
calculation does not account for a possibility that almost all pages on
the queues can be reused. It just fires if free pages depleted too much
or free target cannot be reached.

IMO one of the possible solution is to account the queued pages in
addition to the swap space.  This is not entirely accurate, since some
pages on the queues cannot be reused, at least transiently.  Most precise
algorithm would count the hold and busy pages globally, and substract
this count from queues length, but it is probably too costly.

Instead, I think we could rely on the numbers which are counted by
pagedaemon threads during the passes.  Due to the transient nature of the
pagedaemon failures, this should be fine.

Below is the prototype patch, against HEAD.  It is not applicable to
stable, please use HEAD kernel for test.

diff --git a/sys/sys/vmmeter.h b/sys/sys/vmmeter.h
index d2ad920..ee5159a 100644
--- a/sys/sys/vmmeter.h
+++ b/sys/sys/vmmeter.h
@@ -93,9 +93,10 @@ struct vmmeter {
 	u_int v_free_min;	/* (c) pages desired free */
 	u_int v_free_count;	/* (f) pages free */
 	u_int v_wire_count;	/* (a) pages wired down */
-	u_int v_active_count;	/* (q) pages active */
+	u_int v_active_count;	/* (a) pages active */
 	u_int v_inactive_target; /* (c) pages desired inactive */
-	u_int v_inactive_count;	/* (q) pages inactive */
+	u_int v_inactive_count;	/* (a) pages inactive */
+	u_int v_queue_sticky;	/* (a) pages on queues but cannot process */
 	u_int v_cache_count;	/* (f) pages on cache queue */
 	u_int v_cache_min;	/* (c) min pages desired on cache queue */
 	u_int v_cache_max;	/* (c) max pages in cached obj (unused) */
diff --git a/sys/vm/vm_meter.c b/sys/vm/vm_meter.c
index 713a2be..4bb1f1f 100644
--- a/sys/vm/vm_meter.c
+++ b/sys/vm/vm_meter.c
@@ -316,6 +316,7 @@ VM_STATS_VM(v_active_count, "Active pages");
 VM_STATS_VM(v_inactive_target, "Desired inactive pages");
 VM_STATS_VM(v_inactive_count, "Inactive pages");
 VM_STATS_VM(v_cache_count, "Pages on cache queue");
+VM_STATS_VM(v_queue_sticky, "Pages which cannot be moved from queues");
 VM_STATS_VM(v_cache_min, "Min pages on cache queue");
 VM_STATS_VM(v_cache_max, "Max pages on cached queue");
 VM_STATS_VM(v_pageout_free_min, "Min pages reserved for kernel");
diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h
index 7846702..6943a0e 100644
--- a/sys/vm/vm_page.h
+++ b/sys/vm/vm_page.h
@@ -226,6 +226,7 @@ struct vm_domain {
 	long vmd_segs;	/* bitmask of the segments */
 	boolean_t vmd_oom;
 	int vmd_pass;	/* local pagedaemon pass */
+	int vmd_queue_sticky;	/* pages on queues which cannot be processed */
 	struct vm_page vmd_marker; /* marker for pagedaemon private use */
 };
=20
diff --git a/sys/vm/vm_pageout.c b/sys/vm/vm_pageout.c
index 5660b56..a62cf97 100644
--- a/sys/vm/vm_pageout.c
+++ b/sys/vm/vm_pageout.c
@@ -896,7 +896,7 @@ vm_pageout_scan(struct vm_domain *vmd, int pass)
 {
 	vm_page_t m, next;
 	struct vm_pagequeue *pq;
-	int page_shortage, maxscan, pcount;
+	int failed_scan, page_shortage, maxscan, pcount;
 	int addl_page_shortage;
 	vm_object_t object;
 	int act_delta;
@@ -960,6 +960,7 @@ vm_pageout_scan(struct vm_domain *vmd, int pass)
 	 */
 	pq =3D &vmd->vmd_pagequeues[PQ_INACTIVE];
 	maxscan =3D pq->pq_cnt;
+	failed_scan =3D 0;
 	vm_pagequeue_lock(pq);
 	queues_locked =3D TRUE;
 	for (m =3D TAILQ_FIRST(&pq->pq_pl);
@@ -1012,6 +1013,7 @@ vm_pageout_scan(struct vm_domain *vmd, int pass)
 			vm_page_unlock(m);
 			VM_OBJECT_WUNLOCK(object);
 			addl_page_shortage++;
+			failed_scan++;
 			continue;
 		}
=20
@@ -1075,6 +1077,7 @@ vm_pageout_scan(struct vm_domain *vmd, int pass)
 			 * loop over the active queue below.
 			 */
 			addl_page_shortage++;
+			failed_scan++;
 			goto relock_queues;
 		}
=20
@@ -1229,6 +1232,7 @@ vm_pageout_scan(struct vm_domain *vmd, int pass)
 				 */
 				if (vm_page_busied(m)) {
 					vm_page_unlock(m);
+					failed_scan++;
 					goto unlock_and_continue;
 				}
=20
@@ -1241,6 +1245,7 @@ vm_pageout_scan(struct vm_domain *vmd, int pass)
 					vm_page_requeue_locked(m);
 					if (object->flags & OBJ_MIGHTBEDIRTY)
 						vnodes_skipped++;
+					failed_scan++;
 					goto unlock_and_continue;
 				}
 				vm_pagequeue_unlock(pq);
@@ -1386,6 +1391,11 @@ relock_queues:
 		m =3D next;
 	}
 	vm_pagequeue_unlock(pq);
+
+	atomic_add_int(&cnt.v_queue_sticky, failed_scan -
+	    vmd->vmd_queue_sticky);
+	vmd->vmd_queue_sticky =3D failed_scan;
+
 #if !defined(NO_SWAPPING)
 	/*
 	 * Idle process swapout -- run once per second.
@@ -1433,10 +1443,15 @@ static int vm_pageout_oom_vote;
 static void
 vm_pageout_mightbe_oom(struct vm_domain *vmd, int pass)
 {
+	u_int queues_count;
 	int old_vote;
=20
-	if (pass <=3D 1 || !((swap_pager_avail < 64 && vm_page_count_min()) ||
-	    (swap_pager_full && vm_paging_target() > 0))) {
+	queues_count =3D cnt.v_active_count + cnt.v_inactive_count -
+	    cnt.v_queue_sticky;
+	if (pass <=3D 1 || !((swap_pager_avail < 64 && vm_page_count_min() &&
+	    queues_count <=3D cnt.v_free_min) ||
+	    (swap_pager_full && vm_paging_target() > 0 &&
+	    queues_count <=3D vm_paging_target()))) {
 		if (vmd->vmd_oom) {
 			vmd->vmd_oom =3D FALSE;
 			atomic_subtract_int(&vm_pageout_oom_vote, 1);

--5dWm/iFn9ihQcn0U
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (FreeBSD)

iQIcBAEBAgAGBQJSWR12AAoJEJDCuSvBvK1BkrYQAJR8fizQoKk5TsYKqLlpKxMq
5zMDpeRcr3yLDZjkCAUDFGh+EXrMOtKDpaBojQvAdsMPSaexpzUxKSRrbGKGhwfv
AjUQXMEoKq/1JvMX5QQflfybs8r60PTTyvYnamFC0oQr+1Bdba7XlMb9XWFaiXD4
jUvGGT01VCXfvdi7UXUoi77FHGHsw/a8FdzSWNSVtkmHY6hhLIYgxpvX+agD8e8O
0TcuMCACyF9HuRsmS3RnXgSy3IQrpJApw1b8R6UxpIgUR9O34avZVib68/vyosvY
gm9qoihTkEJquvdvbpIstw/CP0KfpHJQHXW7nhVOomfByyowGYt0dJ7+Derf77ag
Xc4Bi0c2bASspEWsyOmcgCn1gdN3HgDotTtO6FRMRwVyhV5rIh2OovUnIqLH1TS0
YqFV0jt7hygoyttzU5ei5cAFpWFOhdAbNfkUuEbggk/VXZEvaiZUR4/IoNLR7Ce1
BjctqAOdZAukGkqXIJHGURiKvwf9arQEyI2Aq9jJExc01HG/Dub8LDj4YSpFNjIY
OYMe7EBmLZ145xLiCZQU71VGw/XJlpAgQocc0+Lz+4HHZ70R7PmtCxUpLviNmweu
0I6BopZNpYLu5nHsxyBKxYdRCvICTOsSUIaF4QXgEgGnLeGFuViRufJm5dmfQNII
EoaeckHA8AVHnTInQVUc
=m+Yx
-----END PGP SIGNATURE-----

--5dWm/iFn9ihQcn0U--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20131012095919.GI41229>