From owner-freebsd-current  Wed Sep 29 12:39:49 1999
Delivered-To: freebsd-current@freebsd.org
Received: from dt014n8c.san.rr.com (dt011n66.san.rr.com [204.210.13.102])
	by hub.freebsd.org (Postfix) with ESMTP id CF314159C8
	for <freebsd-current@freebsd.org>; Wed, 29 Sep 1999 12:29:00 -0700 (PDT)
	(envelope-from Doug@gorean.org)
Received: from localhost (doug@localhost)
	by dt014n8c.san.rr.com (8.9.3/8.8.8) with ESMTP id MAA32467
	for <freebsd-current@freebsd.org>; Wed, 29 Sep 1999 12:28:59 -0700 (PDT)
	(envelope-from Doug@gorean.org)
Date: Wed, 29 Sep 1999 12:28:59 -0700 (PDT)
From: Doug <Doug@gorean.org>
X-Sender: doug@dt014n8c.san.rr.com
To: freebsd-current@freebsd.org
Subject: Weird sockname errors with -current and apache
Message-ID: <Pine.BSF.4.10.9909291220320.32400-100000@dt014n8c.san.rr.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-current@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Greetings,

	I'm using -current on some web server/CGI processing machines. Yes
I know all about using -current on production stuff, but we need the NFS,
et al fixes due to the heavy NFS client activity on these systems, and
I'm willing to take the good with the bad. I cvsup'ed and built world and
kernel on or about 8/26 and these boxes ran fine for about 26 days. On
9/22 (Wednesday) I cvsup'ed and built world and kernel on one machine in
order to take advantage of Matt's latest round of NFS, etc. fixes. That
box ran well for two days so I updated the rest of them on Friday (9/24)
and took off for a happy weekend. Well, you know what happened, one box
locked up on saturday, I came in and rebooted it, then the other 4 boxes
locked up on sunday. *sigh*

	The really annoying thing here is that there isn't ONE clear problem
that I can point to. Also, when the boxes die they wedge solid. No
console, serial or otherwise, and no DDB so I can't find out exactly
what they are doing when they die. I have the DDB_UNATTENDED option in
the kernel because I have the boxes set up to recover themselves on boot
and go back into service (previous to the 26 day uptime panics were
common). I'm starting to think I should disable that, however as far as
I can see they aren't panic'ing, they are just freezing up; although
they are ping'able. We started out this project with Apache 1.3.6, and
on Sept. 7 we moved to 1.3.9. These are dual PIII 500 machines with a
half gig of ram each. 

	The other annoying thing is that while I was checking the kernel, etc.
logs for signs of problems, it hadn't occured to me to check the apache
error log. Once I did I noticed that at least some of the symptoms I'm
seeing go back as far as I have logs, even before the blessed 26 day
uptime period. Here is what I've seen.

The first errror I can find in any of the logs I have that seems related
to the problem is this from apache's error log:

[Fri Aug 20 10:59:34 1999] [error] (22)Invalid argument: getsockname

consequently I've noticed that we get this error a LOT, usually
coinciding with a period of time where the machine is wedged, after
which it sometimes comes back, and sometimes doesn't (i.e., it stays
wedged). When this happens it usually repeats about 15-20 times,
followed by:

Virtual memory exceeded in `new' 

then a NULL character (^@) in the apache log. Those errors are usually
accompanied by a slew of "Premature end of script headers" messages,
apparently related to CGI process that these web servers run dying off
before it finishes writing out its data. 

	We also have a slew of these errors in the apache logs at various
times (doesn't *seem* to be a correlation with the others, but I'm not
sure) that look like:

[Mon Sep 13 12:51:03 1999] [warn] child process 82600 still did not
exit, sending a SIGTERM
[Mon Sep 13 12:51:03 1999] [warn] child process 83437 still did not
exit, sending a SIGTERM
[Mon Sep 13 12:51:03 1999] [warn] child process 84136 still did not
exit, sending a SIGTERM
[Mon Sep 13 12:51:03 1999] [warn] child process 83698 still did not
exit, sending a SIGTERM
[Mon Sep 13 12:51:03 1999] [warn] child process 83703 still did not
exit, sending a SIGTERM

Sometimes these happen at the same time, sometimes they don't. When this
one happens we get about 40 of them in a row. 

	In the system logs the only unusual thing I've seen (and I enable a LOT
of logging) are these messages, which started over this past weekend. 

/kernel: calcru: negative time of 4347162 usec for pid 6806 (httpd) 

Once again, when these come they come in bunches, sometimes with a
positive time value like this one, sometimes with a negative one. I'm
used to seeing calcru messages related to the kernel misjudging the
speed of the processor, but the recently added code that tells you the
speed on SMP systems says that I have CPU: Pentium III (498.75-MHz
686-class CPU), which looks right to me.

	Now, as if the above were not annoying enough, all of these
problems could very well be related to the third party CGI processing
engine (a program called Miva) which we have tracked down some bugs in
before. Of course the machines freezing up is my main concern at this
point, but the errors themselves could be coming from miva. 

	Any suggestions on how to debug this problem further would be
greatly appreciated. I'm going to start up some boxes today that don't
have the DDB_UNATTENDED option enabled to see if they will in fact panic
and drop to the debugger. Beyond that, I'm at a bit of a loss here. 

TIA,

Doug


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message