From owner-freebsd-hackers@FreeBSD.ORG Sun Apr 5 15:59:24 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1CB9B106564A for ; Sun, 5 Apr 2009 15:59:24 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.terabit.net.ua (mail.terabit.net.ua [195.137.202.147]) by mx1.freebsd.org (Postfix) with ESMTP id AA78D8FC18 for ; Sun, 5 Apr 2009 15:59:23 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from skuns.zoral.com.ua ([91.193.166.194] helo=mail.zoral.com.ua) by mail.terabit.net.ua with esmtps (TLSv1:AES256-SHA:256) (Exim 4.63 (FreeBSD)) (envelope-from ) id 1LqUkk-000NHQ-9Z; Sun, 05 Apr 2009 18:59:22 +0300 Received: from deviant.kiev.zoral.com.ua (root@deviant.kiev.zoral.com.ua [10.1.1.148]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id n35FxJ3Y085098 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 5 Apr 2009 18:59:19 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.3/8.14.3) with ESMTP id n35FxILw076224; Sun, 5 Apr 2009 18:59:18 +0300 (EEST) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.3/8.14.3/Submit) id n35FxIM3076223; Sun, 5 Apr 2009 18:59:18 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Sun, 5 Apr 2009 18:59:18 +0300 From: Kostik Belousov To: Hans Ottevanger Message-ID: <20090405155918.GO31897@deviant.kiev.zoral.com.ua> References: <49D89B50.3000304@iae.nl> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="TabdQyBgAIOfnE51" Content-Disposition: inline In-Reply-To: <49D89B50.3000304@iae.nl> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: ClamAV version 0.94.2, clamav-milter version 0.94.2 on skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua X-Virus-Scanned: mail.terabit.net.ua 1LqUkk-000NHQ-9Z c7bf50a149c24f3cbab0216252ed5dd6 X-Terabit: YES Cc: freebsd-hackers@freebsd.org Subject: Re: mlockall() failure and direction for possible solution X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 05 Apr 2009 15:59:24 -0000 --TabdQyBgAIOfnE51 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Apr 05, 2009 at 01:51:44PM +0200, Hans Ottevanger wrote: > Hi folks, >=20 > As has been noted before, there is an issue with the mlockall() system > call always failing on (at least) the amd64 architecture. This is quite > evident by the automounter (as configured out-of-the-box) printing error > messages on startup like: >=20 > Couldn't lock process pages in memory using mlockall() >=20 > I have verified the occurrence of this issue on the amd64 platform on > 7.1-STABLE and 8.0-CURRENT. On the i386 platform this problem does not > occur. >=20 > To investigate this issue a bit further I ran the following trivial progr= am: >=20 > #include > #include > #include > #include >=20 > int main(int argc, char *argv[]) > { > if (mlockall(MCL_CURRENT|MCL_FUTURE) =3D=3D -1) > perror(argv[0]); >=20 > char command[80]; > snprintf(command, 80, "procstat -v %d", getpid()); > system(command); >=20 > exit(0); > } >=20 > which yields (using CURRENT-8.0 as of today, on an Intel DP965LT board > with a Q6600 and 8 Gbyte RAM, GENERIC kernel stripped of unused devices, > output folded to 72 characters per line): >=20 > /mltest: Resource temporarily unavailable > PID START END PRT RES PRES REF SHD FL TP > PATH > 1064 0x400000 0x401000 r-x 1 0 1 0 CN vn > /root/mlockall/mltest > 1064 0x500000 0x501000 rw- 1 0 1 0 CN df > 1064 0x501000 0x600000 rwx 255 0 1 0 -- df > 1064 0x800500000 0x80052c000 r-x 44 0 64 31 CN vn > /libexec/ld-elf.so.1 > 1064 0x80052c000 0x800534000 rw- 8 0 1 0 C- df > 1064 0x80062b000 0x800633000 rw- 8 0 1 0 CN vn > /libexec/ld-elf.so.1 > 1064 0x800633000 0x80063f000 rw- 12 0 1 0 C- df > 1064 0x80063f000 0x80072e000 r-x 239 0 128 62 CN vn > /lib/libc.so.7 > 1064 0x80072e000 0x80072f000 r-x 1 0 1 0 CN vn > /lib/libc.so.7 > 1064 0x80072f000 0x80082f000 r-x 51 0 128 62 CN vn > /lib/libc.so.7 > 1064 0x80082f000 0x80084f000 rw- 32 0 1 0 C- vn > /lib/libc.so.7 > 1064 0x80084f000 0x800865000 rw- 6 0 1 0 CN df > 1064 0x800900000 0x800965000 rw- 101 0 1 0 -- df > 1064 0x800965000 0x800a00000 rw- 155 0 1 0 -- df > 1064 0x7ffffffe0000 0x800000000000 rwx 3 0 1 0 C- df >=20 > I have hunted down the exact location in the kernel where the call to=20 > mlockall() returns an error (just using printf's, debugging using=20 > Firewire proved not to be as trivial to set up as it was just a few=20 > years ago). It appears that while wiring the memory, finally vm_fault()= =20 > is called and it bails out at line 412 of vm_fault.c. The virtual=20 > address of the page that the system is attempting to wire (argument=20 > vaddr of vm_fault()) is 0x800762000. From the procstat output above it=20 > appears that this in the third region backed by /lib/libc.so.7. >=20 > This made me think that the issue might be somehow related to the way in= =20 > which dynamic libraries are linked on runtime. Indeed, if above program= =20 > is linked -statically- it does not fail. Also if the program in compiled= =20 > and linked -dynamically- on a i386 platform and run on an amd64, it runs= =20 > successfully. >=20 > To make a long story at least a bit shorter, I found that the problem is= =20 > in /usr/src/libexec/rtld_elf/map_object.c at line 156. Here a contiguous= =20 > region is staked out for the code and data. For the amd64, where the=20 > required alignment of the segments is 1 Mbytes, this causes a region to= =20 > be mapped that is far larger than the library file by which it is=20 > backed. Addresses that are not backed by the file cannot be resident and= =20 > hence the region cannot be locked into memory. On the i386 architecture= =20 > this problem does not occur since the alignment of the segments is just= =20 > 4 Kbytes. I suspect that the problem also occurs at least on the sparc64= =20 > architecture. >=20 > As a first step to a possible solution you can apply the attached=20 > (provisional) patch, that uses an anonymous, read-only mapping to create= =20 > the required region. >=20 > The output of the above program then becomes: >=20 > PID START END PRT RES PRES REF SHD FL TP > PATH > 1302 0x400000 0x401000 r-x 1 0 1 0 CN vn > /root/mlockall/mltest > 1302 0x500000 0x501000 rw- 1 0 1 0 -- df > 1302 0x800500000 0x80052c000 r-x 44 0 8 4 CN vn > /libexec/ld-elf.so.1 > 1302 0x80052c000 0x800534000 rw- 8 0 1 0 -- df > 1302 0x80062b000 0x800633000 rw- 8 0 1 0 C- vn > /libexec/ld-elf.so.1 > 1302 0x800633000 0x80063f000 rw- 12 0 1 0 -- df > 1302 0x80063f000 0x80072e000 r-x 239 0 124 62 CN vn > /lib/libc.so.7 > 1302 0x80072e000 0x80072f000 r-x 1 0 1 0 C- vn > /lib/libc.so.7 > 1302 0x80072f000 0x80082f000 r-- 256 0 1 0 -- df > 1302 0x80082f000 0x80084f000 rw- 32 0 1 0 C- vn > /lib/libc.so.7 > 1302 0x80084f000 0x800865000 rw- 22 0 1 0 -- df > 1302 0x7ffffffe0000 0x800000000000 rwx 32 0 1 0 -- df >=20 > i.e. mlockall() does not return an error anymore. >=20 > I still have the following questions: >=20 > 1. Is worth the trouble to solve the mlockall() problem at all ? Should= =20 > I file a PR ? Yes. Do as you want, but I see no reason. Your analisys looks correct and useful. >=20 > 2. Can someone confirm that it also occurs on the other 64 bit=20 > architectures ? >=20 > 3. It might be more elegant to use PROT_NONE instead of PROT_READ when=20 > just staking out the address space. Currently mlockall() returns an=20 > error when attempting that, so most likely mlockall() would need to be=20 > changed to ignore regions mapped with PROT_NONE. On the other hand, the= =20 > pthread implementation uses PROT_NONE to create red zones on the stack=20 > and mlockall() apparently succeeds with threaded applications (using the= =20 > provided patch). Any opinions/ideas/hints ? I think that it is better to unmap the holes, instead of making some mapping. Please, try this patch instead. diff --git a/libexec/rtld-elf/map_object.c b/libexec/rtld-elf/map_object.c index 2d06074..3266af0 100644 --- a/libexec/rtld-elf/map_object.c +++ b/libexec/rtld-elf/map_object.c @@ -83,6 +83,7 @@ map_object(int fd, const char *path, const struct stat *s= b) Elf_Addr bss_vaddr; Elf_Addr bss_vlimit; caddr_t bss_addr; + size_t hole; =20 hdr =3D get_elf_header(fd, path); if (hdr =3D=3D NULL) @@ -91,8 +92,7 @@ map_object(int fd, const char *path, const struct stat *s= b) /* * Scan the program header entries, and save key information. * - * We rely on there being exactly two load segments, text and data, - * in that order. + * We expect that the loadable segments are ordered by load address. */ phdr =3D (Elf_Phdr *) ((char *)hdr + hdr->e_phoff); phsize =3D hdr->e_phnum * sizeof (phdr[0]); @@ -214,6 +214,17 @@ map_object(int fd, const char *path, const struct stat= *sb) return NULL; } } + + /* Unmap the region between two non-adjusted ELF segments */ + if (i < nsegs) { + hole =3D trunc_page(segs[i + 1]->p_vaddr) - bss_vlimit; + if (hole > 0 && munmap(mapbase + bss_vlimit, hole) =3D=3D -1) { + _rtld_error("%s: munmap hole failed: %s", path, + strerror(errno)); + return NULL; + } + } + if (phdr_vaddr =3D=3D 0 && data_offset <=3D hdr->e_phoff && (data_vlimit - data_vaddr + data_offset) >=3D (hdr->e_phoff + hdr->e_phnum * sizeof (Elf_Phdr))) { --TabdQyBgAIOfnE51 Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (FreeBSD) iEYEARECAAYFAknY1VUACgkQC3+MBN1Mb4hEigCgom6yh9eRWYFm0ALLVCip2Lum o94AoNWvC7V0iljTBaCKZxPpHtrcEcYT =OEJG -----END PGP SIGNATURE----- --TabdQyBgAIOfnE51--