From owner-freebsd-stable@FreeBSD.ORG Tue Jun 10 06:25:29 2003 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 39E4537B401; Tue, 10 Jun 2003 06:25:29 -0700 (PDT) Received: from lilzcluster.liwest.at (lilzclust02.liwest.at [212.33.55.12]) by mx1.FreeBSD.org (Postfix) with ESMTP id 278CD43F85; Tue, 10 Jun 2003 06:25:27 -0700 (PDT) (envelope-from dgw@liwest.at) Received: from cm58-27.liwest.at by lilzcluster.liwest.at (8.10.2/1.1.2.11/08Jun01-1123AM) id h5ADPOm0001149220; Tue, 10 Jun 2003 15:25:24 +0200 (MEST) From: Daniela To: Robert Watson Date: Tue, 10 Jun 2003 15:25:46 +0000 User-Agent: KMail/1.5.1 References: In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200306101525.46714.dgw@liwest.at> cc: stable@freebsd.org Subject: Re: Server overloaded? Or is it a bug? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 10 Jun 2003 13:25:29 -0000 On Thursday 05 June 2003 20:19, Robert Watson wrote: > So this tells us that interrupt delivery appears to be working fine for > your NIC, that the network stack isn't completely hosed, and can allocate > packet buffers (mbufs), so isn't memory-starved at that level of the > system. > Sockets are used only for locally terminated connections, and come out of > a separate memory pool from packet buffers (well, it's a little more > complicated than that, but that's enough to get the picture). The reason > I wondered about this was that one of the classes of possible memory > starvation is to reach the allocation limit on sockets. We allocate the > socket (and TCP state) a couple of packets into the TCP setup, so if the > TCP setup got partway completed and then there was no further response, > we'd have a possible explanation. > > Since the connection completes, it's probably safe to assume the TCP state > and socket were fully allocated, and the socket was returned by the kernel > to the application, or at least, the kernel got pretty much to the point > of returning it to the application. > Try using "slogin -v" or "ssh -v" on the client, and paste the results > into an e-mail in response to this one. The SSH daemon does a lot of work > to set up a new connection -- it forks a process or two, does name > lookups, allocates pseudo-terminals, invokes PAM, and all kinds of other > things. There are failure modes for each of these, and a bit more detail > might let us track it down. Particularly useful might be the results of > "slogin -v" both when the machine is operating normally, and when it's > hosed. This will let us figure out about when during the process > something failed, and what it might have been doing. > > > > If you can get partway through the banner but hang later, that > > > might be the result of a file system deadlock of some sort. > > > > This is also possible, but what could have caused it? My file I/O is not > > really heavy. > > Deadlock is a bit of a misnomer for what I have in mind. There are two > classes of things that look like deadlocks: lock order problems, and lock > leaks. ... > So the VFS deadlock is somewhat of a shot in the dark, but it has pretty > easy to identify symptoms, especially if you can get to a debugger. > They're also fairly easy to analyze. ... > I think we'll find that it's either a kernel problem, or an X problem > triggering a kernel problem, so we're unlikely to find useful core dumps > from applications. A system core might be useful, but hard to get without > a serial console. > > Ok, so at the end of this all, here were my pieces of advice on debugging > it, if you can reproduce it: > > (1) Compare "slogin -v" to the system in the before and after scenarios, > that may tell us a lot about what's broken. > > (2) Despite the fact that you can't set up a serial console, set up a > serial console. ... Some strange things happened these days, they were all related to processes: (1) I have some zombies I cannot kill: # ps ax ... 53410 pn Z 0:00.00 (kate) ... # kill -9 53410 53410: No such process The same thing happens with make. (2) When I invoke the KDE System Guard, the process list won't show up. (3) My processes recieve a lot of signals (10 and 11), about 30 times a day. (4) Kate crashed when I wanted to save a document, and then every time I opened it. So I tried gdb kate: (gdb) run Starting program: /usr/local/bin/kate Deprecated bfd_read called at /usr/src/gnu/usr.bin/binutils/gdb/../../../../contrib/gdb/gdb/dbxread.c line 2627 in elfstab_build_psymtabs Deprecated bfd_read called at /usr/src/gnu/usr.bin/binutils/gdb/../../../../contrib/gdb/gdb/dbxread.c line 933 in fill_symbuf ERROR: Communication problem with kate, it probably crashed. Program exited with code 0377. As I never had any problems like these, I guess they are a side effect of the crash. Do we have a chance to debug this or should I rebuild my system? And, most imortant, could this be a new kernel bug? If yes, I would really like to debug it. Daniela