From owner-freebsd-arch@FreeBSD.ORG  Thu Aug 18 14:18:38 2005
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@freebsd.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id AD22116A41F
	for <arch@freebsd.org>; Thu, 18 Aug 2005 14:18:38 +0000 (GMT)
	(envelope-from ups@tree.com)
Received: from smtp.speedfactory.net (smtp.speedfactory.net [66.23.216.216])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 84B9543D48
	for <arch@freebsd.org>; Thu, 18 Aug 2005 14:18:37 +0000 (GMT)
	(envelope-from ups@tree.com)
Received: (qmail 9936 invoked by uid 210); 18 Aug 2005 14:19:00 +0000
Received: from 66.23.216.49 by talon (envelope-from <ups@tree.com>,
	uid 201) with qmail-scanner-1.25st 
	(clamdscan: 0.85.1/1030. spamassassin: 3.0.2. perlscan: 1.25st.  
	Clear:RC:1(66.23.216.49):. 
	Processed in 0.051598 secs); 18 Aug 2005 14:19:00 -0000
X-Qmail-Scanner-Mail-From: ups@tree.com via talon
X-Qmail-Scanner: 1.25st (Clear:RC:1(66.23.216.49):. Processed in 0.051598 secs
	Process 9926)
Received: from 66-23-216-49.clients.speedfactory.net (HELO palm.tree.com)
	(66.23.216.49)
	by smtp.speedfactory.net with AES256-SHA encrypted SMTP;
	18 Aug 2005 14:19:00 +0000
Received: from [127.0.0.1] (ups@localhost.tree.com [127.0.0.1])
	by palm.tree.com (8.12.10/8.12.10) with ESMTP id j7IEIXrK007363;
	Thu, 18 Aug 2005 10:18:35 -0400 (EDT) (envelope-from ups@tree.com)
From: Stephan Uphoff <ups@tree.com>
To: Luigi Rizzo <rizzo@icir.org>
In-Reply-To: <20050818005739.A83776@xorpc.icir.org>
References: <20050816170519.A74422@xorpc.icir.org>
	<200508170435.34688.max@love2party.net>
	<20050817170248.A70991@xorpc.icir.org>
	<200508180332.34895.max@love2party.net>
	<20050818005739.A83776@xorpc.icir.org>
Content-Type: text/plain
Message-Id: <1124374713.1360.64660.camel@palm>
Mime-Version: 1.0
X-Mailer: Ximian Evolution 1.4.6 
Date: Thu, 18 Aug 2005 10:18:33 -0400
Content-Transfer-Encoding: 7bit
Cc: Max Laier <max@love2party.net>, net@freebsd.org, arch@freebsd.org
Subject: Re: duplicate  read/write locks in net/pfil.c and netinet/ip_fw2.c
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 18 Aug 2005 14:18:38 -0000

On Thu, 2005-08-18 at 03:57, Luigi Rizzo wrote:
> On Thu, Aug 18, 2005 at 03:32:19AM +0200, Max Laier wrote:
> > On Thursday 18 August 2005 02:02, Luigi Rizzo wrote:
> ...
> > > could you guys look at the following code and see if it makes sense,
> > > or tell me where i am wrong ?
> > >
> > > It should solve the starvation and blocking trylock problems,
> > > because the active reader does not hold the mutex in the critical
>                        ^^^^^^
> 
> i meant 'writer', sorry... as max said even in the current implementation
> the reader does not hold the lock.
> 
> > > int
> > > RLOCK(struct rwlock *rwl, int try)
> > > {
> > >         if (!try)
> > >                 mtx_lock(&rwl->m);
> > >         else if (!mtx_trylock(&rwl->m))
> > >                 return EBUSY;
> > >         if (rwl->writers == 0)  /* no writer, pass */
> > >                 rwl->readers++;
> > >         else {
> > >                 rwl->br++;
> > >                 cv_wait(&rwl->qr, &rwl->m);
> >                   ^^^^^^^
> > 
> > That we can't do.  That's exactly the thing the existing sx(9) implementation 
> > does and where it breaks.  The problem is that cv_wait() is an implicit sleep 
> > which breaks when we try to RLOCK() with other mutex already acquired.  
> 
> but that is not a solvable problem given that the *LOCK may be blocking.
> And the cv_wait is not an unconditioned sleep, it is one where you release
> the lock right before ans wait for an event to wake you up.
> In fact i don't understand why you consider spinning and sleeping
> on a mutex two different things.

The major difference between sleeping (cv_wait,msleep,..) and blocking
on a mutex is priority inheritance.
If you need to be able to use (non-spin) mutexes while holding a
[R|W]LOCK and use a [R|W]LOCK while holding a (non-spin) mutex then you
need to implement priority inheritance for [R|W]LOCKs.
For the (single) write lock holder tracking priority is easy. (just like
a (non-spin) mutex). However priority inheritance to multiple readers is
more difficult as one needs to keep track of all holders of the lock.
Keeping track of all readers requires pre-allocated memory resources.
This memory could come from
	1) A limited global pool
	2) A limited per [R|W]LOCK pool
	3) A limited per thread pool
	4) As a parameter for acquiring a RLOCK
None of the choices are really pretty.
(1),(2) and (3) can lead to limiting reader parallelism when running out
of resources. (4) may be practically for some cases since the memory
could be allocated from stack. However since the memory must be valid
while holding a read lock choice (4) makes some algorithms (than use for
example lock crabbing) a bit harder to implement.

> > Moreover will this break for recursive reads e.g.:
> > 
> > Thread 1: RLOCK() ...      RLOCK() -> cv_wait ...
> > Thread 2:          WLOCK() -> cv_wait ...
> > 
> > This is exactly what pfil_hooks must be able to do as the packet filter may 
> > want to call back to the stack in order to send rejects etc.
> 
> that's another story (also in issue in ipfw) and the way it is addressed
> elsewhere is by releasing and reaquiring the lock. In fact is the topic
> that started this thread.
> 
> > change).  The problem is, that we still have to invest 4 mutex operations for 
> > every access.  The current implementation has the same basic problem (though 
> > it only uses 2 mutex operations for the WLOCK/UNLOCK).  Ideally the RLOCK/ 
> > UNLOCK should be free unless there is a writer waiting for the lock.  In 
> 
> "free unless" means not free - you always have to check, be it through
> an atomic cmpswap or something else. But this is what mtx_lock does
> anyways in the fast path.
> 
> 	cheers
> 	luigi
>