From owner-freebsd-net@FreeBSD.ORG  Wed Aug 13 14:58:37 2003
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id EAD6337B401
	for <freebsd-net@freebsd.org>; Wed, 13 Aug 2003 14:58:37 -0700 (PDT)
Received: from relay.pair.com (relay.pair.com [209.68.1.20])
	by mx1.FreeBSD.org (Postfix) with SMTP id A9F4443FA3
	for <freebsd-net@freebsd.org>; Wed, 13 Aug 2003 14:58:36 -0700 (PDT)
	(envelope-from silby@silby.com)
Received: (qmail 29279 invoked from network); 13 Aug 2003 21:58:35 -0000
Received: from niwun.pair.com (HELO localhost) (209.68.2.70)
  by relay.pair.com with SMTP; 13 Aug 2003 21:58:35 -0000
X-pair-Authenticated: 209.68.2.70
Date: Wed, 13 Aug 2003 16:57:28 -0500 (CDT)
From: Mike Silbersack <silby@silby.com>
To: Ed Maste <emaste@sandvine.com>
In-Reply-To: <FE045D4D9F7AED4CBFF1B3B813C8533701BD3CA0@mail.sandvine.com>
Message-ID: <20030813164935.M29363@odysseus.silby.com>
References: <FE045D4D9F7AED4CBFF1B3B813C8533701BD3CA0@mail.sandvine.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
cc: "'freebsd-net@freebsd.org'" <freebsd-net@freebsd.org>
cc: John Baldwin <jhb@FreeBSD.org>
Subject: RE: TCP socket shutdown race condition
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 13 Aug 2003 21:58:38 -0000


On Wed, 13 Aug 2003, Ed Maste wrote:

> I think I've found the problem.
>
> crfree() is called from a lot of places (I counted at least 20) including
> sodealloc() in the socket code, crcopy() etc.  It's called at splnet() from
> sodealloc().   I'm not sure what spl (if any) it might be called at from
> elsewhere, but certainly not splnet().
>
> I'm going to investigate the correct solution for this and supply a
> PR / patch, but for now let me know if more information is desired.
>
> -ed

Hm, sounds like you've done some solid debugging, and this should be easy
to fix.  However, perhaps we need to think about this for a little bit
longer before we just switch to atomic operations or a spl call within the
cr functions...

As I understand it, 4.x uses just a single lock on anything going into the
kernel, meaning that this type of problem should be prevented.  However,
maybe there's something a lot more subtle which actually goes on.  What
I'm thinking is that perhaps we're seeing a single entrypoint which
happens to call the cr* functions that should be more generally locked,
and that we're just seeing the problem in the cr functions.

John, can you give us a quick overview of how 4.x SMP works so that we can
determine the correct solution here?  My main question is this:  If CPU 1
is chugging along at a low SPL level and an interrupt comes in to CPU 2,
can it wrestle control away from the other CPU, and/or run the interrupt
handler concurrently?

Thanks,

Mike "Silby" Silbersack