Date: Tue, 23 Aug 2005 16:35:59 +0900 (JST) From: NAKATA Maho <chat95@mac.com> To: freebsd-amd64@FreeBSD.org Subject: 1.5 times slower performance with SCHED_ULE than SCHED_4BSD Message-ID: <20050823.163559.74703336.chat95@mac.com>
next in thread | raw e-mail | index | archive | help
Hello list, I have noticed (last year) that SCHED_ULE is slower than SCHED_4BSD, and raised a PR. At that time it was not convincing because 5.3-RELEASE/amd64 was not stable enough with large amount of memory, etc. My recent 5.4-RELEASE/amd64 is extremely stable even with large amount of memory and high load, and I can say something definite. Someone who is interested in my e-mail, please test it. I also prepared statically linked binaries to reproduce my test easily, because this test uses ATLAS (math/atlas-devel) which is a really really pain to build. my conclusion in short: SCHED_ULE is slower than SCHED_4BSD by 1.5 times in FreeBSD 5.4-RELEASE/amd64. this means both SCHED_4BSD and SCHED_ULE are definitely SMP aware, but SCHED_ULE scheduling is not efficient for very large jobs. Whereas 4BSD is almost optimal. my opetron box: o Tyan S2885 Tiger K8W o Opteron 242x2 (1.6GHzx2) o Transcend 2Gx8 (total 16G) How to repeat: fetch http://people.freebsd.org/~maho/scheduler_amd64.tar.gz tar xvfz scheduler_amd64.tar.gz cd scheduler_amd64/ sysctl kern.sched.name ; /usr/bin/time ./xdinvtst_pt -N 7000 9000 200 My results: o 4BSD sysctl kern.sched.name ; /usr/bin/time ./xdinvtst_pt -N 7000 9000 200 kern.sched.name: 4BSD NREPS ORDER UPLO N LDA TIME MFLOP RESID ===== ===== ===== ===== ===== ======== ======== ============ 0 Col GE 7000 7000 157.028 4368.50 2.110725e-02 0 Col GE 7200 7200 168.527 4429.39 2.106386e-02 0 Col GE 7400 7400 185.014 4380.32 2.099199e-02 0 Col GE 7600 7600 198.622 4420.07 2.073756e-02 0 Col GE 7800 7800 214.284 4429.04 2.089531e-02 0 Col GE 8000 8000 232.126 4411.27 2.142018e-02 0 Col GE 8200 8200 255.809 4310.65 2.041516e-02 0 Col GE 8400 8400 265.088 4471.62 2.092699e-02 0 Col GE 8600 8600 285.403 4457.11 2.119786e-02 0 Col GE 8800 8800 306.969 4439.88 2.257722e-02 0 Col GE 9000 9000 324.456 4493.54 2.347010e-02 11 cases: 11 passed, 0 skipped, 0 failed 4707.78 real 9019.12 user 38.34 sys o ULE sysctl kern.sched.name ; /usr/bin/time ./xdinvtst_pt -N 7000 9000 200 kern.sched.name: ule NREPS ORDER UPLO N LDA TIME MFLOP RESID ===== ===== ===== ===== ===== ======== ======== ============ 0 Col GE 7000 7000 284.579 2410.49 2.110725e-02 0 Col GE 7200 7200 176.769 4222.87 2.106386e-02 0 Col GE 7400 7400 183.035 4427.67 2.099199e-02 0 Col GE 7600 7600 195.830 4483.10 2.073756e-02 0 Col GE 7800 7800 228.077 4161.20 2.089531e-02 0 Col GE 8000 8000 267.382 3829.61 2.142018e-02 0 Col GE 8200 8200 247.578 4453.95 2.041516e-02 0 Col GE 8400 8400 261.590 4531.42 2.092699e-02 0 Col GE 8600 8600 308.443 4124.18 2.119786e-02 0 Col GE 8800 8800 331.672 4109.20 2.257722e-02 0 Col GE 9000 9000 320.790 4544.91 2.347010e-02 11 cases: 11 passed, 0 skipped, 0 failed 6964.19 real 8720.26 user 34.31 sys o What are my test doing? what is xdinvtst_pt? this program calculates inversion of randomly genrated matrices (double precision). _pt means pthread, and this creates two threads at a time to calculate the inversion of the matrix. We performed calculation from 7000x7000 matrix to 9000x9000 gradually increasing row and column by 200. how to make xdivntst_pt? build math/atlas-devel with smp machine. this port knows # of processors installed. You will build atlas after a very very long time; 1.5 day or so after typing make many times (10-20 times!) since this port is fragile. Then go down the work directory and manually fix some makefiles to point Fortran BLAS/LAPACK (via math/lapack) and can build by yourself. so this is why i included in archive and prepared statically linked binaries. o Perfomance of Opteron and effect of SMP Theoretical peak of calculation in double precision using SSE2 for 1.6GHz opteron is 3.2GFlops, so 6.4GFlops for dual processors. Performance of the largest test (inversion of 9000x9000 matrix in double precision) is about 4.5Gflops. namely 70% of theoretical peak. this is very good. From my experience, best experimental perfomance in single processor is ~80% such achivement might be found at much more primitive calculations. o ULE vs 4BSD please see this row: 4BSD 0 Col GE 9000 9000 324.456 4493.54 2.347010e-02 ule 0 Col GE 9000 9000 320.790 4544.91 2.347010e-02 these line shows 4.5Flops performance by inverting matrix. 324 seconds have passed by 4BSD and 320 seconds have passed by ule. This doesn't mean what I say was wrong; ~320 seconds have passed by both processors. namely ~160 seconds by one processor and ~160 seconds by another processor, then atlas measure as ~320 seconds have passed as total and this is the best case. We definitely need at least 320 seconds to invert the matrix and how actual time has passed is not measured in this context. With ULE, for example ~240 have passed in one processor, and ~80s in another processor. so we *must* wait for 240 seconds, while with 4BSD, we only wait for 160 seconds. we can know from actual difference between ULE and 4BSD by /usr/bin/time 4BSD 4707.78 real 9019.12 user 38.34 sys ULE 6964.19 real 8720.26 user 34.31 sys and 6964.19/4707.78=1.479. ~9000 seconds have passed by 4BSD and 8700s by ULE. and real is ~4700 seconds for 4BSD and ~7000s by ULE. so time consumed by actual works are both same (~9000s and ~8700s). but scheduling is not efficinet for this calculation and so, ULE needs more time. o Scheduling threads / processes? scheduling threads and processes can be different. but other experiments show that if we run same process at a time, ULE is also ~1.5 times slower than 4BSD. o conclusion 4BSD is near the optimal for large calculations and ULE is ~1.5 times slower than 4BSD. Both scheduling algorithm smp aware. I don't think ULE as default is good choice. All the best, -- NAKATA, Maho (maho@FreeBSD.org)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20050823.163559.74703336.chat95>