Date: Thu, 15 Nov 2012 18:07:27 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Chris Rees <utisoft@gmail.com> Cc: src-committers@FreeBSD.org, "Simon J. Gerraty" <sjg@FreeBSD.org>, svn-src-all@FreeBSD.org, "David E. O'Brien" <obrien@FreeBSD.org>, svn-src-head@FreeBSD.org, Konstantin Belousov <kostikbel@gmail.com> Subject: Re: svn commit: r242102 - in head: contrib/bmake usr.bin/bmake Message-ID: <20121115151622.J1179@besplex.bde.org> In-Reply-To: <CADLo83_TQ0213jeo16J5X=vdKVbbYPq=WN2HZJCLkKMCP=RkFA@mail.gmail.com> References: <201210252318.q9PNI6IQ069461@svn.freebsd.org> <20121114172823.GA20127@dragon.NUXI.org> <20121114184837.GA73505@kib.kiev.ua> <CADLo83_TQ0213jeo16J5X=vdKVbbYPq=WN2HZJCLkKMCP=RkFA@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 14 Nov 2012, Chris Rees wrote: > On 14 Nov 2012 18:49, "Konstantin Belousov" <kostikbel@gmail.com> wrote: >> >> On Wed, Nov 14, 2012 at 09:28:23AM -0800, David O'Brien wrote: >>> On Thu, Oct 25, 2012 at 11:18:06PM +0000, Simon J. Gerraty wrote: >>>> Log: >>>> Merge bmake-20121010 >>> >>> Hi Simon, >>> I was kicking the tires on this and noticed bmake is dynamically linked. >>> >>> Can you change it to being statically linked? >>> >>> This issue most recently came up in freebsd-current. See thread pieces >>> > http://lists.freebsd.org/pipermail/freebsd-current/2012-April/033460.html >>> > http://lists.freebsd.org/pipermail/freebsd-current/2012-April/033472.html >>> > http://lists.freebsd.org/pipermail/freebsd-current/2012-April/033473.html >> >> As you see, I prefer to not introduce new statically linked binaries into > base. >> If, for unfortunate turns of events, bmake is changed to be statically > linked, >> please obey WITH_SHARED_TOOLCHAIN. > > Or a /rescue/bmake for when speed is a concern would also be acceptable. Yes, the big rescue executable is probably even better than dynamic linkage for pessimizing speeds. Sizes on freefall now: % text data bss dec hex filename % 130265 1988 9992 142245 22ba5 /bin/sh % 5256762 133964 2220464 7611190 742336 /rescue/sh % -r--r--r-- 1 root wheel 3738610 Nov 11 06:48 /usr/lib/libc.a The dynamically-linked /bin/sh is deceptively small, although it is larger than the statically linked /bin/sh in FreeBSD-1 for few new features. When executed, it expands to 16.5MB with 10MB RSS. I don't know how much of that is malloc bloat that wouldn't need to be copied on fork, but it is a lot just to map. /rescue/sh starts at 5MB and expands to 15.5MB with 9.25MB when executed. So it is slightly smaller, and its slowness is determined by its non-locality. Perhaps its non-locality is not as good for pessimization as libc's. I don't use dynamic linkage of course. /bin/sh is bloated by static linkage (or rather libc) in the FreeBSD-~5.2 that I usually run: text data bss dec hex filename 649623 8192 64056 721871 b03cf /bin/sh but this "only" expands to 864K with 580K RSS when executed. This can be forked a little faster than 10MB RSS. In practice the timings for time whatever/sh -c 'for i in $(jot 1000 1); do echo -n; done' are: freefall /bin/sh: 6.93 real 1.69 user 5.16 sys freefall /rescue/sh: 6.86 real 1.65 user 5.13 sys local /bin/sh: 0.21 real 0.01 user 0.18 sys freefall: FreeBSD 10.0-CURRENT #4 r242881M: Sun Nov 11 05:30:05 UTC 2012 root@freefall.freebsd.org:/usr/obj/usr/src/sys/FREEFALL amd64 CPU: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz (2666.82-MHz K8-class CPU) Origin = "GenuineIntel" Id = 0x206c2 Family = 0x6 Model = 0x2c Stepping = 2 local: FreeBSD 5.2-CURRENT #4395: Sun Apr 8 12:15:03 EST 2012 bde@besplex.bde.org:/c/obj/usr/src/sys/compile/BESPLEX.fw ... CPU: AMD Athlon(tm) 64 Processor 3200+ (2010.05-MHz 686-class CPU) Origin = "AuthenticAMD" Id = 0xf48 Stepping = 8 freefall may be pessimized by INVARIANTS. It is pessimized by /bin/echo being dynamically linked. Normally shells use builtin echo so the speed of /bin/echo is unimportant. There is also some strangeness in the timing for /bin/echo specifically. Changing 'echo -n' to '/bin/rm -f /etc/nonesuch' or /usr/bin/true reduces the times on freefall by almost a factor of 2, although rm is larger and has to do more: freefall: text data bss dec hex filename 2661 540 8 3209 c89 /bin/echo 11026 884 152 12062 2f1e /bin/rm 1420 484 8 1912 778 /usr/bin/true (all dynamically linked to libc only. truss verifies that rm does a little more). freefall /bin/sh echo: 6.93 real 1.69 user 5.16 sys freefall /bin/sh rm: 3.83 real 0.91 user 2.84 sys freefall /bin/sh true: 3.68 real 0.75 user 2.85 sys freefall /rescue/sh echo: 6.86 real 1.65 user 5.13 sys freefall /rescue/sh rm: 3.69 real 0.83 user 2.78 sys freefall /rescue/sh true: 3.67 real 0.85 user 2.74 sys local /bin/sh echo: 0.21 real 0.01 user 0.18 sys local /bin/sh rm: 0.22 real 0.02 user 0.19 sys local /bin/sh true: 0.18 real 0.01 user 0.17 sys local: text data bss dec hex filename 11926 60 768 12754 31d2 /bin/echo 380758 6752 61772 449282 6db02 /bin/rm 1639 40 604 2283 8eb /usr/bin/true (all statically linked. I managed to debloat crtso and libc enough for /usr/bin/true to be small. The sources for /bin/echo are excessively optimized for space in the executable -- they have contortions to avoid using printf. But this is useless in -current, since crtso and libc drag in printf, so that the null program int main(){} has size: freefall (amd64): text data bss dec hex filename 316370 12156 55184 383710 5dade null-static 1452 484 8 1944 798 null-dynamic local (i386): text data bss dec hex filename 1490 40 604 2134 856 null-static 1203 208 32 1443 5a3 null-dynamic Putting this null program in the jot loop gives a truer indication of the cost of a statically linked shell: freefall /bin/sh null-static: 6.36 real 1.51 user 4.45 sys freefall /bin/sh null-dynamic: 3.92 real 0.85 user 2.71 sys local /bin/sh null-static: 0.18 real 0.00 user 0.18 sys local /bin/sh null-dynamic: 0.58 real 0.09 user 0.49 sys The last 2 lines show the expected large cost of dynamic linkage for a small program (3 times slower), but the freefall lines show strangeness -- static linkage is almost twice as slow, and almost as slow as /bin/echo -n. So to get a truer indication of the cost of a statically linked shell, test with my favourite small program: %%% #include <sys/syscall.h> .globl _start _start: movl $SYS_sync,%eax int $0x80 pushl $0 # only to look like a sync library call (?) pushl $0 movl $SYS_exit,%eax int $0x80 %%% This is my sync.S source file for sync(1) on x86 (must build on i386 using cc -o sync sync.S -nostdlib). local: text data bss dec hex filename 18 0 0 18 12 sync It does the same amount of error checking as /usr/src/bin/sync.c (none), which compiles to: freefall: text data bss dec hex filename 316330 12092 55184 383606 5da76 sync-static 1503 492 8 2003 7d3 sync-dynamic Putting this in the jot loop gives: local /bin/sh sync: 0.65 real 0.01 user 0.63 sys but since is a heavyweight instruction and I don't want to exercise freefalls's disks, remove the syscall from the program, so it just does _exit(0): text data bss dec hex filename 11 0 0 18 12 syncfree-sync freefall /bin/sh syncfree-sync: 0.29 real 0.01 user 0.11 sys local /bin/sh syncfree-sync: 0.17 real 0.00 user 0.17 sys This shows that most of freefall's enormous slowness is for execing its bloated executables, perhaps especially when they are on nfs (oops). Another test of null-static after copying it to /tmp shows that nfs makes little difference. However, syncfree-sync is much faster when copied to /tmp (<= 0.08 seconds real. Test not done, but this result is read off from a later test). Next, try bloating syncfree-sync with padding to the same size as null-static: %%% #include <sys/syscall.h> .text .globl _start _start: pushl $0 pushl $0 movl $SYS_exit,%eax int $0x80 .space 316370-11 .data .space 12156 .bss .space 55184 %%% text data bss dec hex filename 316370 12156 55184 383710 5dade bloated-syncfree-sync freefall /bin/sh bloated-syncfree-sync: 0.08 real 0.00 user 0.08 sys (zfs) freefall /bin/sh bloated-syncfree-sync: 0.30 real 0.00 user 0.13 sys (nfs) local /bin/sh bloated-syncfree-sync: 0.21 real 0.00 user 0.21 sys (ffs) This shows that the the kernel is still quite fast and enormous slowness on freefall is mainly in crtso. I blame malloc() for this. malloc() first increases the size of a null statically linked program from ~1K text to 310K text. Then it increases the startup time by a factor of 50 or so. For small utilities like echo and rm, the increases are similar. A small utility only needs to allocate about 8K of data (for stdio buffers). Since execing bloated-syncfree-sync is fast, a small utility could do this allocation a few thousand times in the time that crtso now takes to start up (the 300+K of padding only gives enough for statically allocating 40 x 8K. Expanding the padding by a factor of 50 might slow down the exec to the crtso time, but gives 2000 x 8K. Of course, actually using the allocated areas will slow down both the statically allocated and the dynamically allocated cases a lot. More tests with a large program on small data (put 'cc -c null.c' in the jot loop, where null.c is int main(){}): freefall /bin/sh clang: 22.53 real 6.35 user 12.15 sys (nfs) freefall /bin/sh gcc: 35.28 real 13.14 user 17.45 sys (nfs) local /bin/sh cc: 17.50 real 6.72 user 2.64 sys (ffs) The crtso slowness seems to be very significant even here. Assume that it is 6 seconds (divided by 1000) per exec. clang is monolithic and does only 1 exec per cc -c. gcc is a small driver program that execs cc1 and as (it used to exec a separate cpp too). So gcc does 3 execs per cc -c, and 6 seconds extra for the 2 extra execs accounts almost exactly for clang being 12.75 seconds faster. The `local' time apparently shows a large accounting bug. Actually, it is because I left a shell loop for testing this running in the background. All the other 'local' times are not much affected by this, since the background loop has low priority, and scheduling works so that it is rarely run in competition with the tiny programs in the other tests. But here the cc's compete with it significantly. After fixing this and also running the freefall tests on zfs: freefall /bin/sh clang: 19.69 real 6.74 user 12.82 sys (zfs) freefall /bin/sh gcc: 28.51 real 12.75 user 15.47 sys (zfs, gcc-4.2.1) local /bin/sh cc: 8.95 real 6.17 user 2.74 sys (ffs, gcc-3.3.3) gcc-4.2.1 is only 35% slower than gcc-3.3.3 on larger source files when it is run locally: local /bin/sh gcc: 120.1 real 112.4 user 7.4 sys (ffs, gcc-3.3.3 -O1 -S) local /bin/sh gcc: 164.6 real 155.8 user 8.1 sys (ffs, gcc-3.3.3 -O2 -S) local /bin/sh gcc: 161.9 real 148.0 user 8.1 sys (ffs, gcc-4.2.1 -O1 -S) local /bin/sh gcc: 202.4 real 193.6 user 8.0 sys (ffs, gcc-4.2.1 -O2 -S) Maybe malloc() would be faster with MALLOC_PRODUCTION. I use /etc/malloc.conf -> aj locally. freefall doesn't have /etc/malloc.conf. MALLOC_OPTIONS no longer works, and MALLOC_CONF is too large for me to understand, so I don't know how to turn off non-production features dynamically. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121115151622.J1179>