FreeBSD Mail Archives

Date:      Sun, 5 Apr 2009 18:59:18 +0300
From:      Kostik Belousov <kostikbel@gmail.com>
To:        Hans Ottevanger <hansot@iae.nl>
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: mlockall() failure and direction for possible solution
Message-ID:  <20090405155918.GO31897@deviant.kiev.zoral.com.ua>
In-Reply-To: <49D89B50.3000304@iae.nl>
References:  <49D89B50.3000304@iae.nl>


--TabdQyBgAIOfnE51
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sun, Apr 05, 2009 at 01:51:44PM +0200, Hans Ottevanger wrote:
> Hi folks,
>=20
> As has been noted before, there is an issue with the mlockall() system
> call always failing on (at least) the amd64 architecture. This is quite
> evident by the automounter (as configured out-of-the-box) printing error
> messages on startup like:
>=20
> Couldn't lock process pages in memory using mlockall()
>=20
> I have verified the occurrence of this issue on the amd64 platform on
> 7.1-STABLE and 8.0-CURRENT. On the i386 platform this problem does not
> occur.
>=20
> To investigate this issue a bit further I ran the following trivial progr=
am:
>=20
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/mman.h>
>=20
> int main(int argc, char *argv[])
> {
>         if (mlockall(MCL_CURRENT|MCL_FUTURE) =3D=3D -1)
>                 perror(argv[0]);
>=20
>         char command[80];
>         snprintf(command, 80, "procstat -v %d", getpid());
>         system(command);
>=20
>         exit(0);
> }
>=20
> which yields (using CURRENT-8.0 as of today, on an Intel DP965LT board
> with a Q6600 and 8 Gbyte RAM, GENERIC kernel stripped of unused devices,
> output folded to 72 characters per line):
>=20
> /mltest: Resource temporarily unavailable
>   PID              START                END PRT  RES PRES REF SHD FL TP
> PATH
>  1064           0x400000           0x401000 r-x    1    0   1   0 CN vn
> /root/mlockall/mltest
>  1064           0x500000           0x501000 rw-    1    0   1   0 CN df
>  1064           0x501000           0x600000 rwx  255    0   1   0 -- df
>  1064        0x800500000        0x80052c000 r-x   44    0  64  31 CN vn
> /libexec/ld-elf.so.1
>  1064        0x80052c000        0x800534000 rw-    8    0   1   0 C- df
>  1064        0x80062b000        0x800633000 rw-    8    0   1   0 CN vn
> /libexec/ld-elf.so.1
>  1064        0x800633000        0x80063f000 rw-   12    0   1   0 C- df
>  1064        0x80063f000        0x80072e000 r-x  239    0 128  62 CN vn
> /lib/libc.so.7
>  1064        0x80072e000        0x80072f000 r-x    1    0   1   0 CN vn
> /lib/libc.so.7
>  1064        0x80072f000        0x80082f000 r-x   51    0 128  62 CN vn
> /lib/libc.so.7
>  1064        0x80082f000        0x80084f000 rw-   32    0   1   0 C- vn
> /lib/libc.so.7
>  1064        0x80084f000        0x800865000 rw-    6    0   1   0 CN df
>  1064        0x800900000        0x800965000 rw-  101    0   1   0 -- df
>  1064        0x800965000        0x800a00000 rw-  155    0   1   0 -- df
>  1064     0x7ffffffe0000     0x800000000000 rwx    3    0   1   0 C- df
>=20
> I have hunted down the exact location in the kernel where the call to=20
> mlockall() returns an error (just using printf's, debugging using=20
> Firewire proved not to be as trivial to set up as it was just a few=20
> years ago). It appears that while wiring the memory, finally vm_fault()=
=20
> is called and it bails out at line 412 of vm_fault.c. The virtual=20
> address of the page that the system is attempting to wire (argument=20
> vaddr of vm_fault()) is 0x800762000. From the procstat output above it=20
> appears that this in the third region backed by /lib/libc.so.7.
>=20
> This made me think that the issue might be somehow related to the way in=
=20
> which dynamic libraries are linked on runtime. Indeed, if above program=
=20
> is linked -statically- it does not fail. Also if the program in compiled=
=20
> and linked -dynamically- on a i386 platform and run on an amd64, it runs=
=20
> successfully.
>=20
> To make a long story at least a bit shorter, I found that the problem is=
=20
> in /usr/src/libexec/rtld_elf/map_object.c at line 156. Here a contiguous=
=20
>  region is staked out for the code and data. For the amd64, where the=20
> required alignment of the segments is 1 Mbytes, this causes a region to=
=20
> be mapped that is far larger than the library file by which it is=20
> backed. Addresses that are not backed by the file cannot be resident and=
=20
> hence the region cannot be locked into memory. On the i386 architecture=
=20
> this problem does not occur since the alignment of the segments is just=
=20
> 4 Kbytes. I suspect that the problem also occurs at least on the sparc64=
=20
> architecture.
>=20
> As a first step to a possible solution you can apply the attached=20
> (provisional) patch, that uses an anonymous, read-only mapping to create=
=20
> the required region.
>=20
> The output of the above program then becomes:
>=20
>   PID              START                END PRT  RES PRES REF SHD FL TP
> PATH
>  1302           0x400000           0x401000 r-x    1    0   1   0 CN vn
> /root/mlockall/mltest
>  1302           0x500000           0x501000 rw-    1    0   1   0 -- df
>  1302        0x800500000        0x80052c000 r-x   44    0   8   4 CN vn
> /libexec/ld-elf.so.1
>  1302        0x80052c000        0x800534000 rw-    8    0   1   0 -- df
>  1302        0x80062b000        0x800633000 rw-    8    0   1   0 C- vn
> /libexec/ld-elf.so.1
>  1302        0x800633000        0x80063f000 rw-   12    0   1   0 -- df
>  1302        0x80063f000        0x80072e000 r-x  239    0 124  62 CN vn
> /lib/libc.so.7
>  1302        0x80072e000        0x80072f000 r-x    1    0   1   0 C- vn
> /lib/libc.so.7
>  1302        0x80072f000        0x80082f000 r--  256    0   1   0 -- df
>  1302        0x80082f000        0x80084f000 rw-   32    0   1   0 C- vn
> /lib/libc.so.7
>  1302        0x80084f000        0x800865000 rw-   22    0   1   0 -- df
>  1302     0x7ffffffe0000     0x800000000000 rwx   32    0   1   0 -- df
>=20
> i.e. mlockall() does not return an error anymore.
>=20
> I still have the following questions:
>=20
> 1. Is worth the trouble to solve the mlockall() problem at all ? Should=
=20
> I file a PR ?
Yes. Do as you want, but I see no reason.

Your analisys looks correct and useful.

>=20
> 2. Can someone confirm that it also occurs on the other 64 bit=20
> architectures ?
>=20
> 3. It might be more elegant to use PROT_NONE instead of PROT_READ when=20
> just staking out the address space. Currently mlockall() returns an=20
> error when attempting that, so most likely mlockall() would need to be=20
> changed to ignore regions mapped with PROT_NONE. On the other hand, the=
=20
> pthread implementation uses PROT_NONE to create red zones on the stack=20
> and mlockall() apparently succeeds with threaded applications (using the=
=20
> provided patch). Any opinions/ideas/hints ?
I think that it is better to unmap the holes, instead of making some
mapping.

Please, try this patch instead.

diff --git a/libexec/rtld-elf/map_object.c b/libexec/rtld-elf/map_object.c
index 2d06074..3266af0 100644
--- a/libexec/rtld-elf/map_object.c
+++ b/libexec/rtld-elf/map_object.c
@@ -83,6 +83,7 @@ map_object(int fd, const char *path, const struct stat *s=
b)
     Elf_Addr bss_vaddr;
     Elf_Addr bss_vlimit;
     caddr_t bss_addr;
+    size_t hole;
=20
     hdr =3D get_elf_header(fd, path);
     if (hdr =3D=3D NULL)
@@ -91,8 +92,7 @@ map_object(int fd, const char *path, const struct stat *s=
b)
     /*
      * Scan the program header entries, and save key information.
      *
-     * We rely on there being exactly two load segments, text and data,
-     * in that order.
+     * We expect that the loadable segments are ordered by load address.
      */
     phdr =3D (Elf_Phdr *) ((char *)hdr + hdr->e_phoff);
     phsize  =3D hdr->e_phnum * sizeof (phdr[0]);
@@ -214,6 +214,17 @@ map_object(int fd, const char *path, const struct stat=
 *sb)
 		return NULL;
 	    }
 	}
+
+	/* Unmap the region between two non-adjusted ELF segments */
+	if (i < nsegs) {
+	    hole =3D trunc_page(segs[i + 1]->p_vaddr) - bss_vlimit;
+	    if (hole > 0 && munmap(mapbase + bss_vlimit, hole) =3D=3D -1) {
+		_rtld_error("%s: munmap hole failed: %s", path,
+		    strerror(errno));
+		return NULL;
+	    }
+	}
+
 	if (phdr_vaddr =3D=3D 0 && data_offset <=3D hdr->e_phoff &&
 	  (data_vlimit - data_vaddr + data_offset) >=3D
 	  (hdr->e_phoff + hdr->e_phnum * sizeof (Elf_Phdr))) {

--TabdQyBgAIOfnE51
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (FreeBSD)

iEYEARECAAYFAknY1VUACgkQC3+MBN1Mb4hEigCgom6yh9eRWYFm0ALLVCip2Lum
o94AoNWvC7V0iljTBaCKZxPpHtrcEcYT
=OEJG
-----END PGP SIGNATURE-----

--TabdQyBgAIOfnE51--

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20090405155918.GO31897>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation