Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 15 Nov 2012 18:07:27 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Chris Rees <utisoft@gmail.com>
Cc:        src-committers@FreeBSD.org, "Simon J. Gerraty" <sjg@FreeBSD.org>, svn-src-all@FreeBSD.org, "David E. O'Brien" <obrien@FreeBSD.org>, svn-src-head@FreeBSD.org, Konstantin Belousov <kostikbel@gmail.com>
Subject:   Re: svn commit: r242102 - in head: contrib/bmake usr.bin/bmake
Message-ID:  <20121115151622.J1179@besplex.bde.org>
In-Reply-To: <CADLo83_TQ0213jeo16J5X=vdKVbbYPq=WN2HZJCLkKMCP=RkFA@mail.gmail.com>
References:  <201210252318.q9PNI6IQ069461@svn.freebsd.org> <20121114172823.GA20127@dragon.NUXI.org> <20121114184837.GA73505@kib.kiev.ua> <CADLo83_TQ0213jeo16J5X=vdKVbbYPq=WN2HZJCLkKMCP=RkFA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 14 Nov 2012, Chris Rees wrote:

> On 14 Nov 2012 18:49, "Konstantin Belousov" <kostikbel@gmail.com> wrote:
>>
>> On Wed, Nov 14, 2012 at 09:28:23AM -0800, David O'Brien wrote:
>>> On Thu, Oct 25, 2012 at 11:18:06PM +0000, Simon J. Gerraty wrote:
>>>> Log:
>>>>   Merge bmake-20121010
>>>
>>> Hi Simon,
>>> I was kicking the tires on this and noticed bmake is dynamically linked.
>>>
>>> Can you change it to being statically linked?
>>>
>>> This issue most recently came up in freebsd-current.  See thread pieces
>>>
> http://lists.freebsd.org/pipermail/freebsd-current/2012-April/033460.html
>>>
> http://lists.freebsd.org/pipermail/freebsd-current/2012-April/033472.html
>>>
> http://lists.freebsd.org/pipermail/freebsd-current/2012-April/033473.html
>>
>> As you see, I prefer to not introduce new statically linked binaries into
> base.
>> If, for unfortunate turns of events, bmake is changed to be statically
> linked,
>> please obey WITH_SHARED_TOOLCHAIN.
>
> Or a /rescue/bmake for when speed is a concern would also be acceptable.

Yes, the big rescue executable is probably even better than dynamic linkage
for pessimizing speeds.  Sizes on freefall now:

%    text	   data	    bss	    dec	    hex	filename
%  130265	   1988	   9992	 142245	  22ba5	/bin/sh
% 5256762	 133964	2220464	7611190	 742336	/rescue/sh
% -r--r--r--  1 root  wheel  3738610 Nov 11 06:48 /usr/lib/libc.a

The dynamically-linked /bin/sh is deceptively small, although it is larger
than the statically linked /bin/sh in FreeBSD-1 for few new features.
When executed, it expands to 16.5MB with 10MB RSS.  I don't know how much
of that is malloc bloat that wouldn't need to be copied on fork, but it
is a lot just to map.  /rescue/sh starts at 5MB and expands to 15.5MB with
9.25MB when executed.  So it is slightly smaller, and its slowness is
determined by its non-locality.  Perhaps its non-locality is not as good
for pessimization as libc's.

I don't use dynamic linkage of course.  /bin/sh is bloated by static
linkage (or rather libc) in the FreeBSD-~5.2 that I usually run:

    text	   data	    bss	    dec	    hex	filename
  649623	   8192	  64056	 721871	  b03cf	/bin/sh

but this "only" expands to 864K with 580K RSS when executed.  This can be
forked a little faster than 10MB RSS.   In practice the timings for

     time whatever/sh -c 'for i in $(jot 1000 1); do echo -n; done'

are:

     freefall /bin/sh:    6.93 real 1.69 user 5.16 sys
     freefall /rescue/sh: 6.86 real 1.65 user 5.13 sys
     local    /bin/sh:    0.21 real 0.01 user 0.18 sys

freefall:
FreeBSD 10.0-CURRENT #4 r242881M: Sun Nov 11 05:30:05 UTC 2012
     root@freefall.freebsd.org:/usr/obj/usr/src/sys/FREEFALL amd64
CPU: Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz (2666.82-MHz K8-class CPU)
   Origin = "GenuineIntel"  Id = 0x206c2  Family = 0x6  Model = 0x2c  Stepping = 2

local:
FreeBSD 5.2-CURRENT #4395: Sun Apr  8 12:15:03 EST 2012
     bde@besplex.bde.org:/c/obj/usr/src/sys/compile/BESPLEX.fw
...
CPU: AMD Athlon(tm) 64 Processor 3200+ (2010.05-MHz 686-class CPU)
   Origin = "AuthenticAMD"  Id = 0xf48  Stepping = 8

freefall may be pessimized by INVARIANTS.  It is pessimized by /bin/echo
being dynamically linked.  Normally shells use builtin echo so the speed
of /bin/echo is unimportant.  There is also some strangeness in the timing
for /bin/echo specifically.  Changing 'echo -n' to
'/bin/rm -f /etc/nonesuch' or /usr/bin/true reduces the times on freefall
by almost a factor of 2, although rm is larger and has to do more:

freefall:
    text	   data	    bss	    dec	    hex	filename
    2661	    540	      8	   3209	    c89	/bin/echo
   11026	    884	    152	  12062	   2f1e	/bin/rm
    1420	    484	      8	   1912	    778	/usr/bin/true
(all dynamically linked to libc only.  truss verifies that rm does a little
more).
     freefall /bin/sh    echo: 6.93 real 1.69 user 5.16 sys
     freefall /bin/sh    rm:   3.83 real 0.91 user 2.84 sys
     freefall /bin/sh    true: 3.68 real 0.75 user 2.85 sys
     freefall /rescue/sh echo: 6.86 real 1.65 user 5.13 sys
     freefall /rescue/sh rm:   3.69 real 0.83 user 2.78 sys
     freefall /rescue/sh true: 3.67 real 0.85 user 2.74 sys
     local    /bin/sh    echo: 0.21 real 0.01 user 0.18 sys
     local    /bin/sh    rm:   0.22 real 0.02 user 0.19 sys
     local    /bin/sh    true: 0.18 real 0.01 user 0.17 sys
local:
    text	   data	    bss	    dec	    hex	filename
   11926	     60	    768	  12754	   31d2	/bin/echo
  380758	   6752	  61772	 449282	  6db02	/bin/rm
    1639	     40	    604	   2283	    8eb	/usr/bin/true
(all statically linked.  I managed to debloat crtso and libc enough for
/usr/bin/true to be small.  The sources for /bin/echo are excessively
optimized for space in the executable -- they have contortions to avoid
using printf.  But this is useless in -current, since crtso and libc
drag in printf, so that the null program int main(){} has size:

freefall (amd64):
    text	   data	    bss	    dec	    hex	filename
  316370	  12156	  55184	 383710	  5dade	null-static
    1452	    484	      8	   1944	    798	null-dynamic
local (i386):
    text	   data	    bss	    dec	    hex	filename
    1490	     40	    604	   2134	    856	null-static
    1203	    208	     32	   1443	    5a3	null-dynamic

Putting this null program in the jot loop gives a truer indication of the
cost of a statically linked shell:

     freefall /bin/sh    null-static:  6.36 real 1.51 user 4.45 sys
     freefall /bin/sh    null-dynamic: 3.92 real 0.85 user 2.71 sys
     local    /bin/sh    null-static:  0.18 real 0.00 user 0.18 sys
     local    /bin/sh    null-dynamic: 0.58 real 0.09 user 0.49 sys

The last 2 lines show the expected large cost of dynamic linkage for
a small program (3 times slower), but the freefall lines show strangeness
-- static linkage is almost twice as slow, and almost as slow as
/bin/echo -n.  So to get a truer indication of the cost of a statically
linked shell, test with my favourite small program:

%%%
#include <sys/syscall.h>

 	.globl	_start
_start:
 	movl	$SYS_sync,%eax
 	int	$0x80
 	pushl	$0		# only to look like a sync library call (?)
 	pushl	$0
 	movl	$SYS_exit,%eax
 	int	$0x80
%%%

This is my sync.S source file for sync(1) on x86 (must build on i386
using cc -o sync sync.S -nostdlib).

local:
    text	   data	    bss	    dec	    hex	filename
      18	      0	      0	     18	     12	sync

It does the same amount of error checking as /usr/src/bin/sync.c (none),
which compiles to:

freefall:
    text	   data	    bss	    dec	    hex	filename
  316330	  12092	  55184	 383606	  5da76	sync-static
    1503	    492	      8	   2003	    7d3	sync-dynamic

Putting this in the jot loop gives:

     local    /bin/sh    sync: 0.65 real 0.01 user 0.63 sys

but since is a heavyweight instruction and I don't want to exercise
freefalls's disks, remove the syscall from the program, so it just
does _exit(0):

    text	   data	    bss	    dec	    hex	filename
      11	      0	      0	     18	     12	syncfree-sync

     freefall /bin/sh    syncfree-sync: 0.29 real 0.01 user 0.11 sys
     local    /bin/sh    syncfree-sync: 0.17 real 0.00 user 0.17 sys

This shows that most of freefall's enormous slowness is for execing
its bloated executables, perhaps especially when they are on nfs
(oops).  Another test of null-static after copying it to /tmp shows
that nfs makes little difference.  However, syncfree-sync is much
faster when copied to /tmp (<= 0.08 seconds real.  Test not done, but
this result is read off from a later test).

Next, try bloating syncfree-sync with padding to the same size as
null-static:

%%%
#include <sys/syscall.h>

 	.text
 	.globl	_start
_start:
 	pushl	$0
 	pushl	$0
 	movl	$SYS_exit,%eax
 	int	$0x80
 	.space	316370-11
.data
 	.space	12156
.bss
 	.space	55184
%%%
    text	   data	    bss	    dec	    hex	filename
  316370	  12156	  55184	 383710	  5dade	bloated-syncfree-sync

     freefall /bin/sh bloated-syncfree-sync: 0.08 real 0.00 user 0.08 sys (zfs)
     freefall /bin/sh bloated-syncfree-sync: 0.30 real 0.00 user 0.13 sys (nfs)
     local    /bin/sh bloated-syncfree-sync: 0.21 real 0.00 user 0.21 sys (ffs)

This shows that the the kernel is still quite fast and enormous slowness
on freefall is mainly in crtso.  I blame malloc() for this.  malloc()
first increases the size of a null statically linked program from ~1K
text to 310K text.  Then it increases the startup time by a factor of
50 or so.  For small utilities like echo and rm, the increases are
similar.  A small utility only needs to allocate about 8K of data (for
stdio buffers).  Since execing bloated-syncfree-sync is fast, a small
utility could do this allocation a few thousand times in the time that
crtso now takes to start up (the 300+K of padding only gives enough for
statically allocating 40 x 8K.  Expanding the padding by a factor of
50 might slow down the exec to the crtso time, but gives 2000 x 8K.
Of course, actually using the allocated areas will slow down both the
statically allocated and the dynamically allocated cases a lot.

More tests with a large program on small data (put 'cc -c null.c' in
the jot loop, where null.c is int main(){}):

     freefall /bin/sh clang: 22.53 real  6.35 user 12.15 sys (nfs)
     freefall /bin/sh   gcc: 35.28 real 13.14 user 17.45 sys (nfs)
     local    /bin/sh    cc: 17.50 real  6.72 user  2.64 sys (ffs)

The crtso slowness seems to be very significant even here.  Assume that
it is 6 seconds (divided by 1000) per exec.  clang is monolithic and
does only 1 exec per cc -c.  gcc is a small driver program that execs
cc1 and as (it used to exec a separate cpp too).  So gcc does 3 execs
per cc -c, and 6 seconds extra for the 2 extra execs accounts almost
exactly for clang being 12.75 seconds faster.

The `local' time apparently shows a large accounting bug.  Actually, it
is because I left a shell loop for testing this running in the background.
All the other 'local' times are not much affected by this, since the
background loop has low priority, and scheduling works so that it is
rarely run in competition with the tiny programs in the other tests.
But here the cc's compete with it significantly.  After fixing this
and also running the freefall tests on zfs:

     freefall /bin/sh clang: 19.69 real  6.74 user 12.82 sys (zfs)
     freefall /bin/sh   gcc: 28.51 real 12.75 user 15.47 sys (zfs, gcc-4.2.1)
     local    /bin/sh    cc:  8.95 real  6.17 user  2.74 sys (ffs, gcc-3.3.3)

gcc-4.2.1 is only 35% slower than gcc-3.3.3 on larger source files when it
is run locally:

     local /bin/sh gcc: 120.1 real 112.4  user 7.4 sys (ffs, gcc-3.3.3 -O1 -S)
     local /bin/sh gcc: 164.6 real 155.8  user 8.1 sys (ffs, gcc-3.3.3 -O2 -S)
     local /bin/sh gcc: 161.9 real 148.0  user 8.1 sys (ffs, gcc-4.2.1 -O1 -S)
     local /bin/sh gcc: 202.4 real 193.6  user 8.0 sys (ffs, gcc-4.2.1 -O2 -S)

Maybe malloc() would be faster with MALLOC_PRODUCTION.  I use
/etc/malloc.conf -> aj locally.  freefall doesn't have /etc/malloc.conf.
MALLOC_OPTIONS no longer works, and MALLOC_CONF is too large for me to
understand, so I don't know how to turn off non-production features
dynamically.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121115151622.J1179>