From owner-freebsd-current Wed Sep 29 12:39:49 1999 Delivered-To: freebsd-current@freebsd.org Received: from dt014n8c.san.rr.com (dt011n66.san.rr.com [204.210.13.102]) by hub.freebsd.org (Postfix) with ESMTP id CF314159C8 for ; Wed, 29 Sep 1999 12:29:00 -0700 (PDT) (envelope-from Doug@gorean.org) Received: from localhost (doug@localhost) by dt014n8c.san.rr.com (8.9.3/8.8.8) with ESMTP id MAA32467 for ; Wed, 29 Sep 1999 12:28:59 -0700 (PDT) (envelope-from Doug@gorean.org) Date: Wed, 29 Sep 1999 12:28:59 -0700 (PDT) From: Doug X-Sender: doug@dt014n8c.san.rr.com To: freebsd-current@freebsd.org Subject: Weird sockname errors with -current and apache Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Greetings, I'm using -current on some web server/CGI processing machines. Yes I know all about using -current on production stuff, but we need the NFS, et al fixes due to the heavy NFS client activity on these systems, and I'm willing to take the good with the bad. I cvsup'ed and built world and kernel on or about 8/26 and these boxes ran fine for about 26 days. On 9/22 (Wednesday) I cvsup'ed and built world and kernel on one machine in order to take advantage of Matt's latest round of NFS, etc. fixes. That box ran well for two days so I updated the rest of them on Friday (9/24) and took off for a happy weekend. Well, you know what happened, one box locked up on saturday, I came in and rebooted it, then the other 4 boxes locked up on sunday. *sigh* The really annoying thing here is that there isn't ONE clear problem that I can point to. Also, when the boxes die they wedge solid. No console, serial or otherwise, and no DDB so I can't find out exactly what they are doing when they die. I have the DDB_UNATTENDED option in the kernel because I have the boxes set up to recover themselves on boot and go back into service (previous to the 26 day uptime panics were common). I'm starting to think I should disable that, however as far as I can see they aren't panic'ing, they are just freezing up; although they are ping'able. We started out this project with Apache 1.3.6, and on Sept. 7 we moved to 1.3.9. These are dual PIII 500 machines with a half gig of ram each. The other annoying thing is that while I was checking the kernel, etc. logs for signs of problems, it hadn't occured to me to check the apache error log. Once I did I noticed that at least some of the symptoms I'm seeing go back as far as I have logs, even before the blessed 26 day uptime period. Here is what I've seen. The first errror I can find in any of the logs I have that seems related to the problem is this from apache's error log: [Fri Aug 20 10:59:34 1999] [error] (22)Invalid argument: getsockname consequently I've noticed that we get this error a LOT, usually coinciding with a period of time where the machine is wedged, after which it sometimes comes back, and sometimes doesn't (i.e., it stays wedged). When this happens it usually repeats about 15-20 times, followed by: Virtual memory exceeded in `new' then a NULL character (^@) in the apache log. Those errors are usually accompanied by a slew of "Premature end of script headers" messages, apparently related to CGI process that these web servers run dying off before it finishes writing out its data. We also have a slew of these errors in the apache logs at various times (doesn't *seem* to be a correlation with the others, but I'm not sure) that look like: [Mon Sep 13 12:51:03 1999] [warn] child process 82600 still did not exit, sending a SIGTERM [Mon Sep 13 12:51:03 1999] [warn] child process 83437 still did not exit, sending a SIGTERM [Mon Sep 13 12:51:03 1999] [warn] child process 84136 still did not exit, sending a SIGTERM [Mon Sep 13 12:51:03 1999] [warn] child process 83698 still did not exit, sending a SIGTERM [Mon Sep 13 12:51:03 1999] [warn] child process 83703 still did not exit, sending a SIGTERM Sometimes these happen at the same time, sometimes they don't. When this one happens we get about 40 of them in a row. In the system logs the only unusual thing I've seen (and I enable a LOT of logging) are these messages, which started over this past weekend. /kernel: calcru: negative time of 4347162 usec for pid 6806 (httpd) Once again, when these come they come in bunches, sometimes with a positive time value like this one, sometimes with a negative one. I'm used to seeing calcru messages related to the kernel misjudging the speed of the processor, but the recently added code that tells you the speed on SMP systems says that I have CPU: Pentium III (498.75-MHz 686-class CPU), which looks right to me. Now, as if the above were not annoying enough, all of these problems could very well be related to the third party CGI processing engine (a program called Miva) which we have tracked down some bugs in before. Of course the machines freezing up is my main concern at this point, but the errors themselves could be coming from miva. Any suggestions on how to debug this problem further would be greatly appreciated. I'm going to start up some boxes today that don't have the DDB_UNATTENDED option enabled to see if they will in fact panic and drop to the debugger. Beyond that, I'm at a bit of a loss here. TIA, Doug To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message