From owner-freebsd-net@FreeBSD.ORG  Sat Aug 20 23:02:29 2005
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
X-Original-To: net@FreeBSD.org
Delivered-To: freebsd-net@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id D53C516A421
	for <net@FreeBSD.org>; Sat, 20 Aug 2005 23:02:29 +0000 (GMT)
	(envelope-from ups@tree.com)
Received: from smtp.speedfactory.net (smtp.speedfactory.net [66.23.216.216])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 45E3043D55
	for <net@FreeBSD.org>; Sat, 20 Aug 2005 23:02:27 +0000 (GMT)
	(envelope-from ups@tree.com)
Received: (qmail 13794 invoked by uid 210); 20 Aug 2005 23:03:00 +0000
Received: from 66.23.216.49 by talon (envelope-from <ups@tree.com>,
	uid 201) with qmail-scanner-1.25st 
	(clamdscan: 0.85.1/1034. spamassassin: 3.0.2. perlscan: 1.25st.  
	Clear:RC:1(66.23.216.49):. 
	Processed in 0.039084 secs); 20 Aug 2005 23:03:00 -0000
X-Qmail-Scanner-Mail-From: ups@tree.com via talon
X-Qmail-Scanner: 1.25st (Clear:RC:1(66.23.216.49):. Processed in 0.039084 secs
	Process 13788)
Received: from 66-23-216-49.clients.speedfactory.net (HELO palm.tree.com)
	(66.23.216.49)
	by smtp.speedfactory.net with AES256-SHA encrypted SMTP;
	20 Aug 2005 23:03:00 +0000
Received: from [127.0.0.1] (ups@localhost.tree.com [127.0.0.1])
	by palm.tree.com (8.12.10/8.12.10) with ESMTP id j7KN2NrK021532;
	Sat, 20 Aug 2005 19:02:24 -0400 (EDT) (envelope-from ups@tree.com)
From: Stephan Uphoff <ups@tree.com>
To: John Baldwin <jhb@FreeBSD.org>
In-Reply-To: <200508181026.35502.jhb@FreeBSD.org>
References: <20050816170519.A74422@xorpc.icir.org>
	<20050818005739.A83776@xorpc.icir.org>
	<1124374713.1360.64660.camel@palm>
	<200508181026.35502.jhb@FreeBSD.org>
Content-Type: text/plain
Message-Id: <1124578943.1360.73407.camel@palm>
Mime-Version: 1.0
X-Mailer: Ximian Evolution 1.4.6 
Date: Sat, 20 Aug 2005 19:02:23 -0400
Content-Transfer-Encoding: 7bit
Cc: Luigi Rizzo <rizzo@icir.org>, net@FreeBSD.org,
	Max Laier <max@love2party.net>, arch@FreeBSD.org,
	"freebsd-arch@freebsd.org" <freebsd-arch@FreeBSD.org>
Subject: Re: duplicate  read/write locks in net/pfil.c and netinet/ip_fw2.c
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 20 Aug 2005 23:02:30 -0000

On Thu, 2005-08-18 at 10:26, John Baldwin wrote:
> On Thursday 18 August 2005 10:18 am, Stephan Uphoff wrote:
> > On Thu, 2005-08-18 at 03:57, Luigi Rizzo wrote:
> > > On Thu, Aug 18, 2005 at 03:32:19AM +0200, Max Laier wrote:
> > > > On Thursday 18 August 2005 02:02, Luigi Rizzo wrote:
> > >
> > > ...
> > >
> > > > > could you guys look at the following code and see if it makes sense,
> > > > > or tell me where i am wrong ?
> > > > >
> > > > > It should solve the starvation and blocking trylock problems,
> > > > > because the active reader does not hold the mutex in the critical
> > >
> > >                        ^^^^^^
> > >
> > > i meant 'writer', sorry... as max said even in the current implementation
> > > the reader does not hold the lock.
> > >
> > > > > int
> > > > > RLOCK(struct rwlock *rwl, int try)
> > > > > {
> > > > >         if (!try)
> > > > >                 mtx_lock(&rwl->m);
> > > > >         else if (!mtx_trylock(&rwl->m))
> > > > >                 return EBUSY;
> > > > >         if (rwl->writers == 0)  /* no writer, pass */
> > > > >                 rwl->readers++;
> > > > >         else {
> > > > >                 rwl->br++;
> > > > >                 cv_wait(&rwl->qr, &rwl->m);
> > > >
> > > >                   ^^^^^^^
> > > >
> > > > That we can't do.  That's exactly the thing the existing sx(9)
> > > > implementation does and where it breaks.  The problem is that cv_wait()
> > > > is an implicit sleep which breaks when we try to RLOCK() with other
> > > > mutex already acquired.
> > >
> > > but that is not a solvable problem given that the *LOCK may be blocking.
> > > And the cv_wait is not an unconditioned sleep, it is one where you
> > > release the lock right before ans wait for an event to wake you up.
> > > In fact i don't understand why you consider spinning and sleeping
> > > on a mutex two different things.
> >
> > The major difference between sleeping (cv_wait,msleep,..) and blocking
> > on a mutex is priority inheritance.
> > If you need to be able to use (non-spin) mutexes while holding a
> > [R|W]LOCK and use a [R|W]LOCK while holding a (non-spin) mutex then you
> > need to implement priority inheritance for [R|W]LOCKs.
> > For the (single) write lock holder tracking priority is easy. (just like
> > a (non-spin) mutex). However priority inheritance to multiple readers is
> > more difficult as one needs to keep track of all holders of the lock.
> > Keeping track of all readers requires pre-allocated memory resources.
> > This memory could come from
> > 	1) A limited global pool
> > 	2) A limited per [R|W]LOCK pool
> > 	3) A limited per thread pool
> > 	4) As a parameter for acquiring a RLOCK
> > None of the choices are really pretty.
> > (1),(2) and (3) can lead to limiting reader parallelism when running out
> > of resources. (4) may be practically for some cases since the memory
> > could be allocated from stack. However since the memory must be valid
> > while holding a read lock choice (4) makes some algorithms (than use for
> > example lock crabbing) a bit harder to implement.
> 
> Solaris handles the read case by only tracking the first thread to get a read 
> lock (referred to as the "owner of record" IIRC) and only propagating 
> priority to that thread and ignoring other readers.  They admit it's not 
> perfect as well.  That's mentioned in the Solaris Internals book.

Hi John,

If I recall correctly Solaris user threads get a better "system
priority" when running/blocking in the kernel. This prevents threads
holding critical kernel resources from being starved by user processes.
I think this would also be a good idea for FreeBSD to prevent unbound
priority inversions caused by vnode locks and other non-tracked
resources. However even with these changes I don't think the Solaris
implementation is the right way to go if we want to freely mix mutexes
and [R|W]locks.


Stephan