Date: Mon, 12 Dec 2005 06:04:15 +0100 From: Johan Bucht <bucht@acc.umu.se> To: Kris Kennaway <kris@obsecurity.org> Cc: current@freebsd.org Subject: Re: New libc malloc patch Message-ID: <439D04CF.1000204@acc.umu.se> In-Reply-To: <20051212043023.GA16678@xor.obsecurity.org> References: <B6653214-2181-4342-854D-323979D23EE8@canonware.com> <Pine.LNX.4.53.0511291121360.27754@regurgitate.ugcs.caltech.edu> <0B746373-8C29-4ADF-9218-311AE08F3834@canonware.com> <b41c75520512031245q48521143m@mail.gmail.com> <7318D807-9086-4817-A40B-50D6960880FB@canonware.com> <b41c75520512040451t360eb01u@mail.gmail.com> <12CA5E15-D006-441D-A24C-1BCD1A69D740@canonware.com> <439CC5DA.3080103@elischer.org> <439CC939.5080703@freebsd.org> <20051212012907.GA13640@xor.obsecurity.org> <20051212043023.GA16678@xor.obsecurity.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Kris Kennaway wrote: >On Sun, Dec 11, 2005 at 08:29:07PM -0500, Kris Kennaway wrote: > > > >>I'll try to test this on a 4 CPU amd64 machine next. >> >> > > > Thanks for the time. >phkmalloc: > ># ./malloc-test 1024 10000000 1 >Starting test with 1 thread... > Thread 5298176 adjusted timing: 4.173052 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 2 >Starting test with 2 threads... > Thread 5299200 adjusted timing: 325.108643 seconds for 10000000 requests of 1024 bytes. > Thread 5298176 adjusted timing: 325.202485 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 3 >Starting test with 3 threads... > Thread 5414912 adjusted timing: 1133.238459 seconds for 10000000 requests of 1024 bytes. > Thread 5299200 adjusted timing: 1134.525255 seconds for 10000000 requests of 1024 bytes. > Thread 5298176 adjusted timing: 1134.539555 seconds for 10000000 requests of 1024 bytes. > > > Those times seems way too high even for a serial allocator, is libpthread performance really this bad on amd64 or is it broken? >jemalloc: > ># ./malloc-test 1024 10000000 1 >Starting test with 1 thread... > Thread 1073760528 adjusted timing: 3.777175 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 2 >Starting test with 2 threads... > Thread 1073760560 adjusted timing: 3.851702 seconds for 10000000 requests of 1024 bytes. > Thread 1073761584 adjusted timing: 3.887943 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 3 >Starting test with 3 threads... > Thread 1073760528 adjusted timing: 3.866206 seconds for 10000000 requests of 1024 bytes. > Thread 1073761552 adjusted timing: 13.382795 seconds for 10000000 requests of 1024 bytes. > Thread 1073762688 adjusted timing: 14.407229 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 4 >Starting test with 4 threads... > Thread 1073760528 adjusted timing: 3.782923 seconds for 10000000 requests of 1024 bytes. > Thread 1073763792 adjusted timing: 6.668655 seconds for 10000000 requests of 1024 bytes. > Thread 1073762688 adjusted timing: 14.346569 seconds for 10000000 requests of 1024 bytes. > Thread 1073761584 adjusted timing: 14.680211 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 5 >Starting test with 5 threads... > Thread 1073760560 adjusted timing: 4.748248 seconds for 10000000 requests of 1024 bytes. > Thread 1073761584 adjusted timing: 9.898153 seconds for 10000000 requests of 1024 bytes. > Thread 1073764896 adjusted timing: 13.019884 seconds for 10000000 requests of 1024 bytes. > Thread 1073762688 adjusted timing: 15.326908 seconds for 10000000 requests of 1024 bytes. > Thread 1073763792 adjusted timing: 15.442164 seconds for 10000000 requests of 1024 bytes. > >So it's 1.1 times faster for single-threaded, and 107 times faster >with 3 threads. > > > The problem with thread scheduling under 4bsd as reported earlier making the first thread getting higher priority than the later threads, makes these numbers a bit strange. >With libthr instead of libpthread: > >phkmalloc: > ># ./malloc-test 1024 10000000 1 >Starting test with 1 thread... > Thread 5255680 adjusted timing: 2.357247 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 2 >Starting test with 2 threads... > Thread 5256192 adjusted timing: 10.964918 seconds for 10000000 requests of 1024 bytes. > Thread 5255680 adjusted timing: 11.001288 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 3 >Starting test with 3 threads... > Thread 5255680 adjusted timing: 17.467754 seconds for 10000000 requests of 1024 bytes. > Thread 5256704 adjusted timing: 17.724583 seconds for 10000000 requests of 1024 bytes. > Thread 5256192 adjusted timing: 17.913381 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 4 >Starting test with 4 threads... > Thread 5255680 adjusted timing: 42.715420 seconds for 10000000 requests of 1024 bytes. > Thread 5256192 adjusted timing: 43.481252 seconds for 10000000 requests of 1024 bytes. > Thread 5256704 adjusted timing: 43.871452 seconds for 10000000 requests of 1024 bytes. > Thread 5257216 adjusted timing: 43.887820 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 5 >Starting test with 5 threads... > Thread 5255680 adjusted timing: 139.316332 seconds for 10000000 requests of 1024 bytes. > Thread 5257216 adjusted timing: 140.117720 seconds for 10000000 requests of 1024 bytes. > Thread 5256192 adjusted timing: 140.134057 seconds for 10000000 requests of 1024 bytes. > Thread 5256704 adjusted timing: 140.855289 seconds for 10000000 requests of 1024 bytes. > Thread 5257728 adjusted timing: 140.865934 seconds for 10000000 requests of 1024 bytes. > > > Looks reasonable for a serial allocator. >jemalloc: > ># ./malloc-test 1024 10000000 1 >Starting test with 1 thread... > Thread 1073742416 adjusted timing: 1.366353 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 2 >Starting test with 2 threads... > Thread 1073742416 adjusted timing: 1.429485 seconds for 10000000 requests of 1024 bytes. > Thread 1073742896 adjusted timing: 1.530733 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 3 >Starting test with 3 threads... > Thread 1073742416 adjusted timing: 1.419813 seconds for 10000000 requests of 1024 bytes. > Thread 1073743376 adjusted timing: 1.432790 seconds for 10000000 requests of 1024 bytes. > Thread 1073742896 adjusted timing: 1.490218 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 4 >Starting test with 4 threads... > Thread 1073743376 adjusted timing: 1.447554 seconds for 10000000 requests of 1024 bytes. > Thread 1073742416 adjusted timing: 1.503659 seconds for 10000000 requests of 1024 bytes. > Thread 1073743856 adjusted timing: 1.503937 seconds for 10000000 requests of 1024 bytes. > Thread 1073742896 adjusted timing: 1.504926 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 5 >Starting test with 5 threads... > Thread 1073743376 adjusted timing: 1.595239 seconds for 10000000 requests of 1024 bytes. > Thread 1073742896 adjusted timing: 1.689753 seconds for 10000000 requests of 1024 bytes. > Thread 1073742416 adjusted timing: 1.750115 seconds for 10000000 requests of 1024 bytes. > Thread 1073744336 adjusted timing: 1.744271 seconds for 10000000 requests of 1024 bytes. > Thread 1073743856 adjusted timing: 1.890269 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 6 >Starting test with 6 threads... > Thread 1073743856 adjusted timing: 1.847653 seconds for 10000000 requests of 1024 bytes. > Thread 1073742416 adjusted timing: 2.018481 seconds for 10000000 requests of 1024 bytes. > Thread 1073743376 adjusted timing: 2.059817 seconds for 10000000 requests of 1024 bytes. > Thread 1073742896 adjusted timing: 2.129204 seconds for 10000000 requests of 1024 bytes. > Thread 1073744336 adjusted timing: 2.223751 seconds for 10000000 requests of 1024 bytes. > Thread 1073744816 adjusted timing: 2.293809 seconds for 10000000 requests of 1024 bytes. ># ./malloc-test 1024 10000000 20 >Starting test with 20 threads... > Thread 1073744816 adjusted timing: 5.113769 seconds for 10000000 requests of 1024 bytes. > Thread 1073751136 adjusted timing: 4.973369 seconds for 10000000 requests of 1024 bytes. > Thread 1073750176 adjusted timing: 5.295912 seconds for 10000000 requests of 1024 bytes. > Thread 1073745296 adjusted timing: 5.502331 seconds for 10000000 requests of 1024 bytes. > Thread 1073743856 adjusted timing: 5.614890 seconds for 10000000 requests of 1024 bytes. > Thread 1073744336 adjusted timing: 5.608690 seconds for 10000000 requests of 1024 bytes. > Thread 1073752096 adjusted timing: 5.555465 seconds for 10000000 requests of 1024 bytes. > Thread 1073748736 adjusted timing: 5.650922 seconds for 10000000 requests of 1024 bytes. > Thread 1073748256 adjusted timing: 6.608054 seconds for 10000000 requests of 1024 bytes. > Thread 1073750656 adjusted timing: 7.144998 seconds for 10000000 requests of 1024 bytes. > Thread 1073742896 adjusted timing: 7.390905 seconds for 10000000 requests of 1024 bytes. > Thread 1073746256 adjusted timing: 7.364728 seconds for 10000000 requests of 1024 bytes. > Thread 1073742416 adjusted timing: 7.556064 seconds for 10000000 requests of 1024 bytes. > Thread 1073749216 adjusted timing: 7.357179 seconds for 10000000 requests of 1024 bytes. > Thread 1073752576 adjusted timing: 7.349483 seconds for 10000000 requests of 1024 bytes. >c Thread 1073747776 adjusted timing: 7.375179 seconds for 10000000 requests of 1024 bytes. > Thread 1073751616 adjusted timing: 7.557854 seconds for 10000000 requests of 1024 bytes. > Thread 1073743376 adjusted timing: 7.915978 seconds for 10000000 requests of 1024 bytes. > Thread 1073749696 adjusted timing: 7.795219 seconds for 10000000 requests of 1024 bytes. > Thread 1073745776 adjusted timing: 8.007392 seconds for 10000000 requests of 1024 bytes. > > > Seems to experience the same scheduling issues but to a lesser extent. >So libthr is *much* faster than libpthread with both malloc >implementations, but jemalloc is still 1.7 times faster for 1 thread >and 80 times faster for 5 threads than phkmalloc. > > > This test simply tests the local arena performance making it the worst case for serial allocators as all threads contend for the same lock. At the same time this is the best case scenario for jemalloc as all memory resides in the local arena. This means no contention at all unless the threads get hashed into the same arena. Basicly you are comparing worst case of phkmalloc vs best case of jemalloc. =) Would be nice if someone could run some supersmack benchmarks. >Kris > >P.S. Holy crap :) > >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?439D04CF.1000204>