From owner-freebsd-smp  Tue May 14  3:17:53 2002
Delivered-To: freebsd-smp@freebsd.org
Received: from rina.r.dl.itc.u-tokyo.ac.jp (cvsup2.r.dl.itc.u-tokyo.ac.jp [133.11.199.247])
	by hub.freebsd.org (Postfix) with ESMTP
	id 372B337B403; Tue, 14 May 2002 03:17:44 -0700 (PDT)
Received: from rina.r.dl.itc.u-tokyo.ac.jp (localhost [127.0.0.1])
	by rina.r.dl.itc.u-tokyo.ac.jp (8.12.3+3.5Wbeta/3.7W-rina.r-Nankai-Koya) with ESMTP id g4EAHf3i037251 ;
	Tue, 14 May 2002 19:17:42 +0900 (JST)
Message-Id: <200205141017.g4EAHf3i037251@rina.r.dl.itc.u-tokyo.ac.jp>
Date: Tue, 14 May 2002 19:17:40 +0900
From: Seigo Tanimura <tanimura@r.dl.itc.u-tokyo.ac.jp>
To: Alfred Perlstein <bright@mu.org>
Cc: Seigo Tanimura <tanimura@r.dl.itc.u-tokyo.ac.jp>,
	current@FreeBSD.org, smp@FreeBSD.org
Subject: Re: The updated socket patch and axing sotryfree() (Re: Locking down a socket, milestone 1)
In-Reply-To: <20020508161656.GV36741@elvis.mu.org>
References: <200204241110.g3OB8u8t006194@bunko>
	<200205081159.g48Bx63i045654@rina.r.dl.itc.u-tokyo.ac.jp>
	<20020508161656.GV36741@elvis.mu.org>
User-Agent: Wanderlust/2.8.1 (Something) SEMI/1.14.3 (Ushinoya) FLIM/1.14.3
 (=?ISO-8859-1?Q?Unebigory=F2mae?=) APEL/10.3 MULE XEmacs/21.1 (patch 14)
 (Cuyahoga Valley) (i386--freebsd)
Organization: Digital Library Research Division, Information Techinology Centre, The University of Tokyo
MIME-Version: 1.0 (generated by SEMI 1.14.3 - "Ushinoya")
Content-Type: text/plain; charset=US-ASCII
Sender: owner-freebsd-smp@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-smp.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-smp>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-smp>
X-Loop: FreeBSD.org

On Wed, 8 May 2002 09:16:56 -0700,
  Alfred Perlstein <bright@mu.org> said:

bright> * Seigo Tanimura <tanimura@r.dl.itc.u-tokyo.ac.jp> [020508 04:59] wrote:
>> 
>> I would like to commit this patch in one or two weeks to start working
>> on a possible race between a user process and a netisr kthread,
>> prevented by only the Giant lock at the moment.
>> 
>> When a user process calls sofree() for a listening socket, it attempts
>> to free the sockets in the connection queues by soabort().  If the
>> connection of an aborting socket gets dropped by a remote host (eg by
>> TCP RST), a netisr kthread also attempts to free the socket.  Since
>> the reference count of a socket in a connection queue is zero, this
>> would resust in doubly freeing a socket.
>> 
>> To solve that problem, I would like to axe sotryfree().  The PCB of a
>> socket and a connection queue should hold a reference to the
>> socket. This should make the reference count of an alive socket always
>> be >= 1, and ensure that there is only one referer to a socket to be
>> freed.
>> 
>> Comments?

bright> I'm not sure exactly how this solves the problem, there may need to
bright> be a global socket mutex, perhaps putting this sort of operation under
bright> that may do what you want.

Yes, at least, we should introduce a global lock to protect the
relation between sockets and PCBs.


bright> Off the top of my head...

bright> I think one way of doing it is storing the hashlist that the socket
bright> belongs to (inpcb_hash) inside the sockets.  That way after a lookup
bright> you will have the lock on the parent structure, a user process will
bright> have to follow the same locking paradym, basically look at the head
bright> socket, lock the hashlist, then manipulate the incomplete queue.

bright> Basically, protect this sort of operation via the hashlist because
bright> you pretty much need to anyway. :)


In order to solve the issue of deallocation race by a hashlist lock,
we *always* have to obtain a socket or a PCB by looking up a hashlist.
This is quite problematic because:

1. the lock order between a socket and a PCB gets tangled,

2. a hashlist introduces an overhead of calculating a hash index
   value, and

3. a hashlint lock cannot be per-socket or per-PCB, resulting in a
   contention under a huge number of socket operations or incoming
   packets.

In order to make our lock strategy readable and comprehensible, we
should keep a lock order as simple as the following:

1. a lock to protect the relation between/among objects, (eg the
   proctree lock or the allproc lock) and

2. a lock dedicated to a single object. (eg a proc lock)

A reference count allows us a flexible way to keep a lock order clean.
Once we grabbed a reference to an object, we can unlock it completely
to restart with locking any lock protecting a relation.  For instance,
in the interrupt handler of a protocol stack, you lock a hashlist to
look up the PCB appropriate to an incoming packet.  You then lock the
PCB to do some work.  If you have to modify the socket corresponding
to the PCB, hold a reference to the PCB and unlock it.  Now you can
lock the relation between sockets and PCBs to grab the socket safely.

This strategy should be applicable to a socket operation initiated by
a user process as well.  We will not have to worry about the lock
order between a socket and a PCB.

Another advantage of a reference count is its cost.  Provided that we
hold an appropriate lock, we can simply follow a pointer to obtain an
object.  This is much cheaper than we calculate a hash index.  We can
also reduce the contention over a lock because the lock of a reference
count is per-socket or per-PCB.


bright> Other than that, have you looked at what BSD/os does and what Linux
bright> does?  Do they get it wrong or have any particular drawbacks?

BSD/OS seems to ensure the existence of a PCB by locking the hashlist
of PCBs.  I am worrying about the fact that they lock the hashlist for
quite a long duration. (about a half of udp_input() hold a read lock,
and almost all of tcp_input() hold a *write* lock.)

-- 
Seigo Tanimura <tanimura@r.dl.itc.u-tokyo.ac.jp> <tanimura@FreeBSD.org>

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message