Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 9 Jun 2003 01:21:04 +0000
From:      Daniela <dgw@liwest.at>
To:        Robert Watson <rwatson@freebsd.org>
Cc:        stable@freebsd.org
Subject:   Re: Server overloaded? Or is it a bug?
Message-ID:  <200306090121.04733.dgw@liwest.at>
In-Reply-To: <Pine.NEB.3.96L.1030605154904.54608C-100000@fledge.watson.org>
References:  <Pine.NEB.3.96L.1030605154904.54608C-100000@fledge.watson.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thursday 05 June 2003 20:19, Robert Watson wrote:
> Sockets are used only for locally terminated connections, and come out of
> a separate memory pool from packet buffers (well, it's a little more
> complicated than that, but that's enough to get the picture).  The reason
> I wondered about this was that one of the classes of possible memory
> starvation is to reach the allocation limit on sockets.  We allocate the
> socket (and TCP state) a couple of packets into the TCP setup, so if the
> TCP setup got partway completed and then there was no further response,
> we'd have a possible explanation.
>
> Since the connection completes, it's probably safe to assume the TCP state
> and socket were fully allocated, and the socket was returned by the kernel
> to the application, or at least, the kernel got pretty much to the point
> of returning it to the application.

I'm almost sure that the socket was returned. It hanged right after pressing 
Enter at the SSH password prompt. If I get it right, the connection must be 
established to get to this point.

> Try using "slogin -v" or "ssh -v" on the client, and paste the results
> into an e-mail in response to this one.  The SSH daemon does a lot of work
> to set up a new connection -- it forks a process or two, does name
> lookups, allocates pseudo-terminals, invokes PAM, and all kinds of other
> things.  There are failure modes for each of these, and a bit more detail
> might let us track it down.  Particularly useful might be the results of
> "slogin -v" both when the machine is operating normally, and when it's
> hosed.  This will let us figure out about when during the process
> something failed, and what it might have been doing.

Couldn't try ssh -v. I was on a Windoze machine where I only had an awful 
graphical SSH client.
I guess it hanged when it tried to fork or read the password file.

> > >     If you can get partway through the banner but hang later, that
> > > might be the result of a file system deadlock of some sort.
> >
> > This is also possible, but what could have caused it? My file I/O is not
> > really heavy.
>
> Deadlock is a bit of a misnomer for what I have in mind.  There are two
> classes of things that look like deadlocks: lock order problems, and lock
> leaks.
>
> Lock order problems are real deadlocks, where you grab locks in the wrong
> order -- they tend to occur under high load, since race windows open up
> improving the chances of a problem, as well as increasing the probability
> of it occuring due to a high number of operations.  Common activities that
> increase the chance of a lock order reversal in FreeBSD's VFS include
> simultaneous use of chroot(), quotas, and vnode-backed vn/md devices.
> Quotas and vnodes both violate the lock order (although in ways that
> hardly ever manifest in practice), and chroot() tend to create less common
> lock aquisition orders for applications when running in kernel.  Nullfs is
> also a common cause of problems.  I think most of these are unlikely to be
> the problem in your environment, especially given that you don't have a
> massively high load with tens of thousands of simultaneous processes all
> installing world in chroot()'s on vn-backed file systems with quotas.

I'm not using any of these.

> The second class of problems relates to lock leaks, which occur in unusual
> failure modes.  The implementation neglects to release a lock under some
> scenario, and the result is that no other process can ever acquire the
> lock.  These are relatively rare, but once in a while we bump into one,
> and it's a bit of a pain to debug.  The symptoms are very similar to a
> deadlock, since gradually processes stack up trying to acquire the lock
> while holding other locks, and typically this results in a "race to root",
> in which sets of processes hold pairs of locks down the file hierarchy,
> and eventually the root vnode lock can't be grabbed, so all processes
> doing name lookups from the root hang.  (Ouch).  NFS can also trigger
> races to roots: if an NFS server hangs, NFS client processes may be
> holding a vnode lock when the NFS server ceases to respond.  If processes
> hold multiple locks at a time (such as during lookup), this can also
> result in a race to the root.  There are some changes to -CURRENT
> submitted by Jeff Roberson, which greatly reduce the chances of this
> happening.  Since you're not using NFS, I believe, it's unlikely to relate
> to this.

I have an NFS server (at least I'm trying to set one up).

> Hmm.  That sucks; a serial console is one of the single most useful
> debugging tools available, since it allows you to track the state of the
> system while the GUI is running.  Are you sure you can't? :-)  It can be
> an old IBM XT with a NULL modem cable...

I really have nothing I could use to set up a serial console.

> > I already have debug symbols everywhere. I have alredy rebooted, and I'm
> > now looking for application core dumps (however, I don't think an
> > application crashed). Maybe I can reproduce it, I still know everything
> > I did.
>
> I think we'll find that it's either a kernel problem, or an X problem
> triggering a kernel problem, so we're unlikely to find useful core dumps
> from applications.  A system core might be useful, but hard to get without
> a serial console.

If the kernel panicked, I should have got a core dump, so we know it did not 
(maybe this information helps).

Could this eventually be a DoS attack? Already had one, and the symptoms were 
similar. But this time I had almost no internet traffic (or the attacker had 
already stopped when I looked).

> Ok, so at the end of this all, here were my pieces of advice on debugging
> it, if you can reproduce it:
>
> (1) Compare "slogin -v" to the system in the before and after scenarios,
>     that may tell us a lot about what's broken.
>
> (2) Despite the fact that you can't set up a serial console, set up a
>     serial console.
>
> :-)
>
> Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
> robert@fledge.watson.org      Network Associates Laboratories



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200306090121.04733.dgw>