Date: Wed, 24 Sep 2008 00:52:59 -0400 From: Jeff Wheelhouse <freebsd-hackers@wheelhouse.org> To: freebsd-hackers@freebsd.org Subject: Major SMP problems with lstat/namei Message-ID: <8185F68B-C443-4891-BEC2-5E3D453DDC93@wheelhouse.org>
next in thread | raw e-mail | index | archive | help
We have encountered some serious SMP performance/scalability problems that we've tracked back to lstat/namei calls. I've written a quick benchmark with a pair of tests to simplify/measure the problem. Both tests use a tree of directories: the top level directory contains five subdirectories a, b, c, d, and e. Each subdirectory contains five subdirectories a, b, c, d, and e, and so on.. 1 directory at level one, 5 at level two, 25 at level three, 125 at level four, 625 at level five, and 3125 at level six. In the "realpath" test, a random path is constructed at the bottom of the tree (e.g. /tmp/lstat/a/b/c/d/e) and realpath() is called on that, provoking lstat() calls on the whole tree. This is to simulate a mix of high-contention and low-contention lstat() calls. In the "lstat" test, lstat is called directly on a path at the bottom of the tree. Since there are 3125 files, this simulates relatively low-contention lstat() calls. In both cases, the test repeats as many times as possible for 60 seconds. Each test is run simultaneously by multiple processes, with progressively doubling concurrency from 1 to 512. What I found was that everything is fine at concurrency 2, probably indicating that the benchmark pegged on some other resource limit. At concurrency 4, realpath drops to 31.8% of concurrency 1. At concurrency 8, performance is down to 18.3%. In the interim, CPU load goes to 80-90% system CPU. I've confirmed via ktrace and the rusage that the CPU usage is all system time, and that lstat() is the *only* system call in the test (realpath() is called with an absolute path). I then reran the 32-process test on 1-7 cores, and found that performance peaks at 2 cores and drops sharply from there. eight cores runs *fifteen* times slower than two cores. The test full results are at the bottom of this message. This is on 6.3-RELEASE-p4 with vfs.lookup_shared=1. I believe this is the same issue that was previously discussed as "2 x quad-core system is slower that 2 x dual core on FreeBSD" archived here: http://lists.freebsd.org/pipermail/freebsd-stable/2007-November/038441.html In that post, Kris Kennaway wrote: > It is hard to say for certain without a direct profile comparison of the > workload, but it is probably due to lockmgr contention. lockmgr is used > for various locking operations to do with VFS data structures. It is > known to have poor performance and scale very badly." At this point, what I've got is one of those synthetic benchmarks, but it matches our production problems exactly, except that the production processes need a whole lot more RAM and eventually when this manifests, they backlog and the server death spirals through swap, which is a most unfortunate difference. I've chased my way up the kernel source to kern_lstat(), where a shared lock is obtained, and then onto namei, where vfs.lookup_shared comes into play. But unfortunately, I don't understand lockmgr, I don't know how the macros and flags I see here relate to it, I can't figure out what happened to the changes that Attilio Rao was working on, and there didn't seem to be much other hope at the time. This is becoming a huge problem for us. Is there anything that at all can be done, or any news? In the case linked above, improvement was made by changing a PHP setting that isn't applicable in our case. Thanks, Jeff Concurrency 1 realpath Total = 1409069 (100%) Total/Sec = 23484 Total/Sec/Worker = 23484 lstat Total = 6828763 (100%) Total/Sec = 113812 Total/Sec/Worker = 113812 Concurrency 2 realpath Total = 1450489 (100%) Total/Sec = 24174 Total/Sec/Worker = 12087 lstat Total = 6891417 (100.9%) Total/Sec = 114856 Total/Sec/Worker = 57428 Concurrency 4 realpath Total = 448693 (31.8%) Total/Sec = 7478 Total/Sec/Worker = 1869 lstat Total = 3047933 (44.6%) Total/Sec = 50798 Total/Sec/Worker = 12699 Concurrency 8 realpath Total = 258281 (18.3%) Total/Sec = 4304 Total/Sec/Worker = 538 lstat Total = 1688728 (24.7%) Total/Sec = 28145 Total/Sec/Worker = 3518 Concurrency 16 realpath Total = 179150 (12.7%) Total/Sec = 2985 Total/Sec/Worker = 186 lstat Total = 966558 (14.1%) Total/Sec = 16109 Total/Sec/Worker = 1006 Concurrency 32 realpath Total = 116982 (8.3%) Total/Sec = 1949 Total/Sec/Worker = 60 lstat Total = 644703 (9.4%) Total/Sec = 10745 Total/Sec/Worker = 335 Concurrency 64 realpath Total = 112050 (7.9%) Total/Sec = 1867 Total/Sec/Worker = 29 lstat Total = 572798 (8.3%) Total/Sec = 9546 Total/Sec/Worker = 149 Concurrency 128 realpath Total = 111544 (7.9%) Total/Sec = 1859 Total/Sec/Worker = 14 lstat Total = 570800 (8.3%) Total/Sec = 9513 Total/Sec/Worker = 74 Concurrency 256 realpath Total = 96461 (6.8%) Total/Sec = 1607 Total/Sec/Worker = 6 lstat Total = 580679 (8.5%) Total/Sec = 9677 Total/Sec/Worker = 37 Concurrency 512 realpath Total = 91224 (6.4%) Total/Sec = 1520 Total/Sec/Worker = 2 lstat Total = 498342 (7.2%) Total/Sec = 8305 Total/Sec/Worker = 16 realpath Concurrency 32 - 1 Core Total = 1289527 Total/Sec = 21492 Total/Sec/Worker = 671 realpath Concurrency 32 - 2 Core Total = 1753625 Total/Sec = 29227 Total/Sec/Worker = 913 realpath Concurrency 32 - 3 Core Total = 1197896 Total/Sec = 19964 Total/Sec/Worker = 623 realpath Concurrency 32 - 4 Core Total = 631293 Total/Sec = 10521 Total/Sec/Worker = 328 realpath Concurrency 32 - 5 Core Total = 227814 Total/Sec = 3796 Total/Sec/Worker = 118 realpath Concurrency 32 - 6 Core Total = 153550 Total/Sec = 2559 Total/Sec/Worker = 79 realpath Concurrency 32 - 7 Core Total = 136013 Total/Sec = 2266 Total/Sec/Worker = 70
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8185F68B-C443-4891-BEC2-5E3D453DDC93>