From owner-freebsd-current@FreeBSD.ORG Mon Dec 7 21:20:12 2009 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 57EA21065672; Mon, 7 Dec 2009 21:20:12 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id A65DB8FC0A; Mon, 7 Dec 2009 21:20:11 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ApoEAMgAHUuDaFvK/2dsb2JhbADaP4QzBA X-IronPort-AV: E=Sophos;i="4.47,357,1257138000"; d="scan'208";a="58228286" Received: from fraser.cs.uoguelph.ca ([131.104.91.202]) by esa-jnhn-pri.mail.uoguelph.ca with ESMTP; 07 Dec 2009 16:20:10 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by fraser.cs.uoguelph.ca (Postfix) with ESMTP id 8EDAA109C2BB; Mon, 7 Dec 2009 16:20:10 -0500 (EST) X-Virus-Scanned: amavisd-new at fraser.cs.uoguelph.ca Received: from fraser.cs.uoguelph.ca ([127.0.0.1]) by localhost (fraser.cs.uoguelph.ca [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id lFPiZ02o-0Qj; Mon, 7 Dec 2009 16:20:10 -0500 (EST) Received: from muncher.cs.uoguelph.ca (muncher.cs.uoguelph.ca [131.104.91.102]) by fraser.cs.uoguelph.ca (Postfix) with ESMTP id F06F0109C2BA; Mon, 7 Dec 2009 16:20:09 -0500 (EST) Received: from localhost (rmacklem@localhost) by muncher.cs.uoguelph.ca (8.11.7p3+Sun/8.11.6) with ESMTP id nB7LSqv12857; Mon, 7 Dec 2009 16:28:52 -0500 (EST) X-Authentication-Warning: muncher.cs.uoguelph.ca: rmacklem owned process doing -bs Date: Mon, 7 Dec 2009 16:28:52 -0500 (EST) From: Rick Macklem X-X-Sender: rmacklem@muncher.cs.uoguelph.ca To: "Robert N. M. Watson" In-Reply-To: Message-ID: References: <20091129013026.GA1355@michelle.cdnetworks.com> <74BFE523-4BB3-4748-98BA-71FBD9829CD5@anduin.net> <34AD565D-814A-446A-B9CA-AC16DD762E1B@anduin.net> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-559023410-758783491-1260221332=:11928" Cc: pyunyh@gmail.com, dfr@FreeBSD.org, weldon@excelsusphoto.com, freebsd-current@FreeBSD.org, =?X-UNKNOWN?Q?Eirik_=C3~Xverby?= , Gavin Atkinson Subject: Re: FreeBSD 8.0 - network stack crashes? X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 07 Dec 2009 21:20:12 -0000 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---559023410-758783491-1260221332=:11928 Content-Type: TEXT/PLAIN; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Mon, 30 Nov 2009, Robert N. M. Watson wrote: > > On 30 Nov 2009, at 05:36, Eirik =D8verby wrote: > >> Short follow-up: Making OpenBSD use TCP mounts (it defaults to UDP) seem= s to solve the issue. >> >> So this is a UDP-NFS-related problem, it would seem? > > Could well be. Let's try another debugging tactic -- there are two possib= le things going on here: resource leak, and resource exhaustion leading to = deadlock. If you shut down to single user mode from multi-user, and let the= system quiesce for a few minutes, then run netstat -m, what does it look l= ike? Do vast numbers of mbufs+clusters get freed, or do they remain account= ed for as allocated? > > (If they remain allocated, they were likely leaked, since most/all socket= s will have been closed, releasing their resources on shutdown to single us= er when all processes are killed) > > The theory of an mbuf leak in NFS isn't an unlikely theory -- the socket = code there continues to change, and rare edge cases frequently lead to leak= s (per my earlier e-mail). Perhaps there's a case the OpenBSD client is tri= ggering that other NFS clients normally don't. If we think that's the case,= the next step is usually to narrow down what causes the leak to trigger a = lot (i.e., the backup starting), and then grab a packet trace that we can a= nalyze with wireshark. We'll want to look at the types of errors being retu= rned for RPCs and, in particular, if there's one that happens about the sam= e number of times as the resource has leaked over the same window, look at = the code and see if that error case is handled properly. > > If this is definitely an NFS leak bug, we should get the NFS folks attent= ion by sticking "NFS mbuf leak" in the subject line and CC'ing rmacklem/dfr= =2E :-) > It's a bit of a shot in the dark, but could you please test the following patch? It patches for a possible mbuf leak + a possible M_SONAME leak (I have no idea if these ever occur in practice?). It also fixes a case where the return value for svc_reply_dg() would have been TRUE for failure. It was all I could see from a quick look. rick --- rpc/svc_dg.c.sav=092009-12-07 15:37:45.000000000 -0500 +++ rpc/svc_dg.c=092009-12-07 15:48:50.000000000 -0500 @@ -221,6 +221,8 @@ =09xdrmbuf_create(&xdrs, mreq, XDR_DECODE); =09if (! xdr_callmsg(&xdrs, msg)) { =09=09XDR_DESTROY(&xdrs); +=09=09if (raddr !=3D NULL) +=09=09=09free(raddr, M_SONAME); =09=09return (FALSE); =09} @@ -259,11 +261,13 @@ =09=09m_fixhdr(mrep); =09=09error =3D sosend(xprt->xp_socket, addr, NULL, mrep, NULL, =09=09 0, curthread); -=09=09if (!error) { -=09=09=09stat =3D TRUE; +=09=09if (error) { +=09=09=09stat =3D FALSE; =09=09} =09} else { =09=09m_freem(mrep); +=09=09if (m !=3D NULL) +=09=09=09m_freem(m); =09} =09XDR_DESTROY(&xdrs); ---559023410-758783491-1260221332=:11928--