From owner-freebsd-hackers@FreeBSD.ORG  Sun Apr  5 12:12:02 2009
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 47EEC106564A
	for <freebsd-hackers@freebsd.org>; Sun,  5 Apr 2009 12:12:02 +0000 (UTC)
	(envelope-from hansot@iae.nl)
Received: from smtp-vbr9.xs4all.nl (smtp-vbr9.xs4all.nl [194.109.24.29])
	by mx1.freebsd.org (Postfix) with ESMTP id A3D9F8FC19
	for <freebsd-hackers@freebsd.org>; Sun,  5 Apr 2009 12:12:01 +0000 (UTC)
	(envelope-from hansot@iae.nl)
Received: from merom.hotsoft.nl (beasties.demon.nl [82.161.3.114])
	by smtp-vbr9.xs4all.nl (8.13.8/8.13.8) with ESMTP id n35BpjYs078494
	for <freebsd-hackers@freebsd.org>; Sun, 5 Apr 2009 13:51:45 +0200 (CEST)
	(envelope-from hansot@iae.nl)
Message-ID: <49D89B50.3000304@iae.nl>
Date: Sun, 05 Apr 2009 13:51:44 +0200
From: Hans Ottevanger <hansot@iae.nl>
User-Agent: Thunderbird 2.0.0.21 (X11/20090322)
MIME-Version: 1.0
To: freebsd-hackers@freebsd.org
Content-Type: multipart/mixed; boundary="------------090600010407000403070006"
X-Virus-Scanned: by XS4ALL Virus Scanner
Subject: mlockall() failure and direction for possible solution
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 05 Apr 2009 12:12:02 -0000

This is a multi-part message in MIME format.
--------------090600010407000403070006
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi folks,

As has been noted before, there is an issue with the mlockall() system
call always failing on (at least) the amd64 architecture. This is quite
evident by the automounter (as configured out-of-the-box) printing error
messages on startup like:

Couldn't lock process pages in memory using mlockall()

I have verified the occurrence of this issue on the amd64 platform on
7.1-STABLE and 8.0-CURRENT. On the i386 platform this problem does not
occur.

To investigate this issue a bit further I ran the following trivial program:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>

int main(int argc, char *argv[])
{
         if (mlockall(MCL_CURRENT|MCL_FUTURE) == -1)
                 perror(argv[0]);

         char command[80];
         snprintf(command, 80, "procstat -v %d", getpid());
         system(command);

         exit(0);
}

which yields (using CURRENT-8.0 as of today, on an Intel DP965LT board
with a Q6600 and 8 Gbyte RAM, GENERIC kernel stripped of unused devices,
output folded to 72 characters per line):

/mltest: Resource temporarily unavailable
   PID              START                END PRT  RES PRES REF SHD FL TP
PATH
  1064           0x400000           0x401000 r-x    1    0   1   0 CN vn
/root/mlockall/mltest
  1064           0x500000           0x501000 rw-    1    0   1   0 CN df
  1064           0x501000           0x600000 rwx  255    0   1   0 -- df
  1064        0x800500000        0x80052c000 r-x   44    0  64  31 CN vn
/libexec/ld-elf.so.1
  1064        0x80052c000        0x800534000 rw-    8    0   1   0 C- df
  1064        0x80062b000        0x800633000 rw-    8    0   1   0 CN vn
/libexec/ld-elf.so.1
  1064        0x800633000        0x80063f000 rw-   12    0   1   0 C- df
  1064        0x80063f000        0x80072e000 r-x  239    0 128  62 CN vn
/lib/libc.so.7
  1064        0x80072e000        0x80072f000 r-x    1    0   1   0 CN vn
/lib/libc.so.7
  1064        0x80072f000        0x80082f000 r-x   51    0 128  62 CN vn
/lib/libc.so.7
  1064        0x80082f000        0x80084f000 rw-   32    0   1   0 C- vn
/lib/libc.so.7
  1064        0x80084f000        0x800865000 rw-    6    0   1   0 CN df
  1064        0x800900000        0x800965000 rw-  101    0   1   0 -- df
  1064        0x800965000        0x800a00000 rw-  155    0   1   0 -- df
  1064     0x7ffffffe0000     0x800000000000 rwx    3    0   1   0 C- df

I have hunted down the exact location in the kernel where the call to 
mlockall() returns an error (just using printf's, debugging using 
Firewire proved not to be as trivial to set up as it was just a few 
years ago). It appears that while wiring the memory, finally vm_fault() 
is called and it bails out at line 412 of vm_fault.c. The virtual 
address of the page that the system is attempting to wire (argument 
vaddr of vm_fault()) is 0x800762000. From the procstat output above it 
appears that this in the third region backed by /lib/libc.so.7.

This made me think that the issue might be somehow related to the way in 
which dynamic libraries are linked on runtime. Indeed, if above program 
is linked -statically- it does not fail. Also if the program in compiled 
and linked -dynamically- on a i386 platform and run on an amd64, it runs 
successfully.

To make a long story at least a bit shorter, I found that the problem is 
in /usr/src/libexec/rtld_elf/map_object.c at line 156. Here a contiguous 
  region is staked out for the code and data. For the amd64, where the 
required alignment of the segments is 1 Mbytes, this causes a region to 
be mapped that is far larger than the library file by which it is 
backed. Addresses that are not backed by the file cannot be resident and 
hence the region cannot be locked into memory. On the i386 architecture 
this problem does not occur since the alignment of the segments is just 
4 Kbytes. I suspect that the problem also occurs at least on the sparc64 
architecture.

As a first step to a possible solution you can apply the attached 
(provisional) patch, that uses an anonymous, read-only mapping to create 
the required region.

The output of the above program then becomes:

   PID              START                END PRT  RES PRES REF SHD FL TP
PATH
  1302           0x400000           0x401000 r-x    1    0   1   0 CN vn
/root/mlockall/mltest
  1302           0x500000           0x501000 rw-    1    0   1   0 -- df
  1302        0x800500000        0x80052c000 r-x   44    0   8   4 CN vn
/libexec/ld-elf.so.1
  1302        0x80052c000        0x800534000 rw-    8    0   1   0 -- df
  1302        0x80062b000        0x800633000 rw-    8    0   1   0 C- vn
/libexec/ld-elf.so.1
  1302        0x800633000        0x80063f000 rw-   12    0   1   0 -- df
  1302        0x80063f000        0x80072e000 r-x  239    0 124  62 CN vn
/lib/libc.so.7
  1302        0x80072e000        0x80072f000 r-x    1    0   1   0 C- vn
/lib/libc.so.7
  1302        0x80072f000        0x80082f000 r--  256    0   1   0 -- df
  1302        0x80082f000        0x80084f000 rw-   32    0   1   0 C- vn
/lib/libc.so.7
  1302        0x80084f000        0x800865000 rw-   22    0   1   0 -- df
  1302     0x7ffffffe0000     0x800000000000 rwx   32    0   1   0 -- df

i.e. mlockall() does not return an error anymore.

I still have the following questions:

1. Is worth the trouble to solve the mlockall() problem at all ? Should 
I file a PR ?

2. Can someone confirm that it also occurs on the other 64 bit 
architectures ?

3. It might be more elegant to use PROT_NONE instead of PROT_READ when 
just staking out the address space. Currently mlockall() returns an 
error when attempting that, so most likely mlockall() would need to be 
changed to ignore regions mapped with PROT_NONE. On the other hand, the 
pthread implementation uses PROT_NONE to create red zones on the stack 
and mlockall() apparently succeeds with threaded applications (using the 
provided patch). Any opinions/ideas/hints ?

Kind regards,

Hans


--------------090600010407000403070006
Content-Type: text/plain;
 name="rtld.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="rtld.diff"

Index: map_object.c
===================================================================
RCS file: /home/ncvs/src/libexec/rtld-elf/map_object.c,v
retrieving revision 1.19
diff -u -r1.19 map_object.c
--- map_object.c	18 Mar 2009 13:40:37 -0000	1.19
+++ map_object.c	5 Apr 2009 10:53:31 -0000
@@ -153,8 +153,8 @@
     mapsize = base_vlimit - base_vaddr;
     base_addr = hdr->e_type == ET_EXEC ? (caddr_t) base_vaddr : NULL;
 
-    mapbase = mmap(base_addr, mapsize, convert_prot(segs[0]->p_flags),
-      convert_flags(segs[0]->p_flags), fd, base_offset);
+    mapbase = mmap(base_addr, mapsize, PROT_READ,
+      MAP_NOCORE|MAP_ANON, -1, 0);
     if (mapbase == (caddr_t) -1) {
 	_rtld_error("%s: mmap of entire address space failed: %s",
 	  path, strerror(errno));
@@ -175,8 +175,7 @@
 	data_addr = mapbase + (data_vaddr - base_vaddr);
 	data_prot = convert_prot(segs[i]->p_flags);
 	data_flags = convert_flags(segs[i]->p_flags) | MAP_FIXED;
-	/* Do not call mmap on the first segment - this is redundant */
-	if (i && mmap(data_addr, data_vlimit - data_vaddr, data_prot,
+	if (mmap(data_addr, data_vlimit - data_vaddr, data_prot,
 	  data_flags, fd, data_offset) == (caddr_t) -1) {
 	    _rtld_error("%s: mmap of data failed: %s", path, strerror(errno));
 	    return NULL;

--------------090600010407000403070006--