Date: Thu, 10 Dec 2009 12:09:03 -0500 From: Linda Messerschmidt <linda.messerschmidt@gmail.com> To: freebsd-questions <freebsd-questions@freebsd.org> Subject: 8.0: OpenSSL stat()'s NLS 500+ times causing extreme system load Message-ID: <237c27100912100909i9c0eb95jd348eb594d86c744@mail.gmail.com>
next in thread | raw e-mail | index | archive | help
We have a Nagios server (ports/net-mgmt/nagios) that has a lot of check_nrpe2 (ports/net-mgmt/nrpe2) checks. We recently upgraded the server it runs on to 8.0-STABLE (r199975). The performance has never been great, but now it's really atrocious and I'm trying to figure out what's going on. The machine (a dual-core Nehalem) has a load average of 5 - 10 at all times, and "top" shows 100% CPU usage, 75% "system" CPU usage. No process has more than a few % CPU though. This is due to the large number of very short-lived processes doing individual Nagios checks that don't live long enough to appear in top. I investigated in some more detail with ktrace and found that each execution of check_nrpe2 performs 520 stat() calls. The bulk of them look like this: 81915 check_nrpe2 CALL stat(0x7fbfde28,0x7fbfddc4) 81915 check_nrpe2 NAMI "/usr/share/nls/C/libc.cat" 81915 check_nrpe2 RET stat -1 errno 2 No such file or directory 81915 check_nrpe2 CALL stat(0x7fbfde28,0x7fbfddc4) 81915 check_nrpe2 NAMI "/usr/share/nls/libc/C" 81915 check_nrpe2 RET stat -1 errno 2 No such file or directory 81915 check_nrpe2 CALL stat(0x7fbfde28,0x7fbfddc4) 81915 check_nrpe2 NAMI "/usr/local/share/nls/C/libc.cat" 81915 check_nrpe2 RET stat -1 errno 2 No such file or directory 81915 check_nrpe2 CALL stat(0x7fbfde28,0x7fbfddc4) 81915 check_nrpe2 NAMI "/usr/local/share/nls/libc/C" 81915 check_nrpe2 RET stat -1 errno 2 No such file or directory 81915 check_nrpe2 CALL stat(0x7fbfde28,0x7fbfddc4) kdump also shows 70 calls to getpid, which seems excessive. (About 50 of them appear to be in a tight loop.) The check_nrpe2 program simply opens an SSL socket to a remote server, sends a short request and gets a short response. It is a pretty simple program. (~22k of source) The calls to getpid() bother me a bit, but I think the NLS is the real problem: $ kdump -E -t n | fgrep /nls/ | head -1 81915 check_nrpe2 0.016815 NAMI "/usr/share/nls/C/libc.cat" $ kdump -E -t n | fgrep /nls/ | tail -1 81915 check_nrpe2 0.135663 NAMI "/usr/local/share/nls/libc/C" $ kdump -E | tail -1 81915 check_nrpe2 0.222510 CALL exit(0x1) $ kdump -E -t n | fgrep /nls/ | wc 508 2540 32004 So this program spends over half its life looping over 508 stat() calls looking for a nonexistent libc.cat file. And then another chunk (probably a lot smaller, but not measured) looping over getpid(). Both appear to be related to SSL; if I set up nrpe not to use it, both excesses go away and the program finishes in about half the time, using about half the CPU resources. To confirm that it was SSL-related, I tried: $ ktrace openssl s_client -connect x2:5666 And I got the exact same stat() & getpid() behavior. Obviously there is some small CPU overhead associated with SSL. This is not about that. This is about the system overhead induced by calling stat 500+ times on a directory that doesn't exist. This gets a little worse. Because there are several checks running at any given time, there is a lot of contention to VFS lookup this handful of paths. That's an area where FreeBSD has known SMP performance issues I've seen discussed elsewhere, and this is a pathological worst case. The net result, a dual core machine is brought to its knees by a relatively simple Nagios setup. Anyway, long story short, why is OpenSSL doing this and how can we make it stop? Thanks for any suggestions!
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?237c27100912100909i9c0eb95jd348eb594d86c744>