From owner-freebsd-stable@FreeBSD.ORG  Tue Jun 10 06:25:29 2003
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 39E4537B401; Tue, 10 Jun 2003 06:25:29 -0700 (PDT)
Received: from lilzcluster.liwest.at (lilzclust02.liwest.at [212.33.55.12])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 278CD43F85; Tue, 10 Jun 2003 06:25:27 -0700 (PDT)
	(envelope-from dgw@liwest.at)
Received: from cm58-27.liwest.at by lilzcluster.liwest.at
	(8.10.2/1.1.2.11/08Jun01-1123AM)
	id h5ADPOm0001149220; Tue, 10 Jun 2003 15:25:24 +0200 (MEST)
From: Daniela <dgw@liwest.at>
To: Robert Watson <rwatson@freebsd.org>
Date: Tue, 10 Jun 2003 15:25:46 +0000
User-Agent: KMail/1.5.1
References: <Pine.NEB.3.96L.1030605154904.54608C-100000@fledge.watson.org>
In-Reply-To: <Pine.NEB.3.96L.1030605154904.54608C-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200306101525.46714.dgw@liwest.at>
cc: stable@freebsd.org
Subject: Re: Server overloaded? Or is it a bug?
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Production branch of FreeBSD source code
	<freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 10 Jun 2003 13:25:29 -0000

On Thursday 05 June 2003 20:19, Robert Watson wrote:
> So this tells us that interrupt delivery appears to be working fine for
> your NIC, that the network stack isn't completely hosed, and can allocate
> packet buffers (mbufs), so isn't memory-starved at that level of the
> system.

> Sockets are used only for locally terminated connections, and come out of
> a separate memory pool from packet buffers (well, it's a little more
> complicated than that, but that's enough to get the picture).  The reason
> I wondered about this was that one of the classes of possible memory
> starvation is to reach the allocation limit on sockets.  We allocate the
> socket (and TCP state) a couple of packets into the TCP setup, so if the
> TCP setup got partway completed and then there was no further response,
> we'd have a possible explanation.
>
> Since the connection completes, it's probably safe to assume the TCP state
> and socket were fully allocated, and the socket was returned by the kernel
> to the application, or at least, the kernel got pretty much to the point
> of returning it to the application.

> Try using "slogin -v" or "ssh -v" on the client, and paste the results
> into an e-mail in response to this one.  The SSH daemon does a lot of work
> to set up a new connection -- it forks a process or two, does name
> lookups, allocates pseudo-terminals, invokes PAM, and all kinds of other
> things.  There are failure modes for each of these, and a bit more detail
> might let us track it down.  Particularly useful might be the results of
> "slogin -v" both when the machine is operating normally, and when it's
> hosed.  This will let us figure out about when during the process
> something failed, and what it might have been doing.
>
> > >     If you can get partway through the banner but hang later, that
> > > might be the result of a file system deadlock of some sort.
> >
> > This is also possible, but what could have caused it? My file I/O is not
> > really heavy.
>
> Deadlock is a bit of a misnomer for what I have in mind.  There are two
> classes of things that look like deadlocks: lock order problems, and lock
> leaks.

...

> So the VFS deadlock is somewhat of a shot in the dark, but it has pretty
> easy to identify symptoms, especially if you can get to a debugger.
> They're also fairly easy to analyze.

...

> I think we'll find that it's either a kernel problem, or an X problem
> triggering a kernel problem, so we're unlikely to find useful core dumps
> from applications.  A system core might be useful, but hard to get without
> a serial console.
>
> Ok, so at the end of this all, here were my pieces of advice on debugging
> it, if you can reproduce it:
>
> (1) Compare "slogin -v" to the system in the before and after scenarios,
>     that may tell us a lot about what's broken.
>
> (2) Despite the fact that you can't set up a serial console, set up a
>     serial console.

...

Some strange things happened these days, they were all related to processes:

(1) I have some zombies I cannot kill:

# ps ax
...
53410  pn  Z      0:00.00  (kate)
...
# kill -9 53410
53410: No such process

The same thing happens with make.

(2) When I invoke the KDE System Guard, the process list won't show up.

(3) My processes recieve a lot of signals (10 and 11), about 30 times a day.

(4) Kate crashed when I wanted to save a document, and then every time I 
opened it. So I tried gdb kate:

(gdb) run
Starting program: /usr/local/bin/kate
Deprecated bfd_read called at 
/usr/src/gnu/usr.bin/binutils/gdb/../../../../contrib/gdb/gdb/dbxread.c line 
2627 in elfstab_build_psymtabs
Deprecated bfd_read called at 
/usr/src/gnu/usr.bin/binutils/gdb/../../../../contrib/gdb/gdb/dbxread.c line 
933 in fill_symbuf
ERROR: Communication problem with kate, it probably crashed.

Program exited with code 0377.


As I never had any problems like these, I guess they are a side effect of the 
crash.

Do we have a chance to debug this or should I rebuild my system?
And, most imortant, could this be a new kernel bug? If yes, I would really 
like to debug it.

Daniela