From owner-freebsd-current@FreeBSD.ORG Mon Nov 30 13:09:25 2009 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D7977106566C; Mon, 30 Nov 2009 13:09:24 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 5E4298FC12; Mon, 30 Nov 2009 13:09:24 +0000 (UTC) Received: from [192.168.1.44] (c-67-186-133-78.hsd1.ma.comcast.net [67.186.133.78]) by cyrus.watson.org (Postfix) with ESMTPSA id C19BF46B03; Mon, 30 Nov 2009 08:09:19 -0500 (EST) Mime-Version: 1.0 (Apple Message framework v1077) Content-Type: text/plain; charset=iso-8859-1 From: "Robert N. M. Watson" In-Reply-To: Date: Mon, 30 Nov 2009 08:09:16 -0500 Content-Transfer-Encoding: quoted-printable Message-Id: References: <20091129013026.GA1355@michelle.cdnetworks.com> <74BFE523-4BB3-4748-98BA-71FBD9829CD5@anduin.net> <34AD565D-814A-446A-B9CA-AC16DD762E1B@anduin.net> To: =?iso-8859-1?Q?Eirik_=D8verby?= X-Mailer: Apple Mail (2.1077) Cc: pyunyh@gmail.com, weldon@excelsusphoto.com, freebsd-current@freebsd.org, Gavin Atkinson Subject: Re: FreeBSD 8.0 - network stack crashes? X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Nov 2009 13:09:26 -0000 On 30 Nov 2009, at 05:36, Eirik =D8verby wrote: > Short follow-up: Making OpenBSD use TCP mounts (it defaults to UDP) = seems to solve the issue. >=20 > So this is a UDP-NFS-related problem, it would seem? Could well be. Let's try another debugging tactic -- there are two = possible things going on here: resource leak, and resource exhaustion = leading to deadlock. If you shut down to single user mode from = multi-user, and let the system quiesce for a few minutes, then run = netstat -m, what does it look like? Do vast numbers of mbufs+clusters = get freed, or do they remain accounted for as allocated? (If they remain allocated, they were likely leaked, since most/all = sockets will have been closed, releasing their resources on shutdown to = single user when all processes are killed) The theory of an mbuf leak in NFS isn't an unlikely theory -- the socket = code there continues to change, and rare edge cases frequently lead to = leaks (per my earlier e-mail). Perhaps there's a case the OpenBSD client = is triggering that other NFS clients normally don't. If we think that's = the case, the next step is usually to narrow down what causes the leak = to trigger a lot (i.e., the backup starting), and then grab a packet = trace that we can analyze with wireshark. We'll want to look at the = types of errors being returned for RPCs and, in particular, if there's = one that happens about the same number of times as the resource has = leaked over the same window, look at the code and see if that error case = is handled properly. If this is definitely an NFS leak bug, we should get the NFS folks = attention by sticking "NFS mbuf leak" in the subject line and CC'ing = rmacklem/dfr. :-) Robert > /Eirik >=20 > On 30. nov. 2009, at 11.22, Eirik =D8verby wrote: >=20 >> Hi, >>=20 >> I have something that might be more interesting than any counter ... >> It seems to me as if the problem *only* manifests itself when an = OpenBSD box is backing up to this FreeBSD 8.0-NFS-ZFS server. All other = boxes are FreeBSD, and I have so far today been unable to reproduce the = problem from any of those. As soon as I interrupted the backup running = from OpenBSD, the mbuf cluster usage stabilized. >>=20 >> How's that for a mystery in the morning? >>=20 >> /Eirik >>=20 >> On 29. nov. 2009, at 15.29, Robert Watson wrote: >>=20 >>> On Sun, 29 Nov 2009, Eirik =D8verby wrote: >>>=20 >>>> I just did that (-rxcsum -txcsum -tso), but the numbers still keep = rising. I'll wait and see if it goes down again, then reboot with those = values to see how it behaves. But right away it doesn't look too good .. >>>=20 >>> It would be interesting to know if any of the counters in the output = of netstat -s grow linearly with the allocation count in netstat -m. = Often times leaks are associated with edge cases in the stack (typically = because if they are in common cases the bug is detected really quickly!) = -- usually error handling, where in some error case the unwinding fails = to free an mbuf that it should free. These are notoriously hard to = track down, unfortunately, but the stats output (especially where delta = alloc is linear to delta stat) may inform the situation some more. >>>=20 >>> Robert N M Watson >>> Computer Laboratory >>> University of Cambridge >>=20 >> _______________________________________________ >> freebsd-current@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-current >> To unsubscribe, send any mail to = "freebsd-current-unsubscribe@freebsd.org" >>=20 >=20