Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 11 May 2013 14:43:00 -1000 (HST)
From:      Jeff Roberson <jroberson@jroberson.net>
To:        Marcel Moolenaar <marcel@FreeBSD.org>, jhb@freebsd.org, alfred@freebsd.org, attilio@freebsd.org
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org
Subject:   Re: svn commit: r250411 - in head/sys: conf kern sys
Message-ID:  <alpine.BSF.2.00.1305111405421.2005@desktop>
In-Reply-To: <201305091628.r49GSI33039873@svn.freebsd.org>
References:  <201305091628.r49GSI33039873@svn.freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 9 May 2013, Marcel Moolenaar wrote:

> Author: marcel
> Date: Thu May  9 16:28:18 2013
> New Revision: 250411
> URL: http://svnweb.freebsd.org/changeset/base/250411
>
> Log:
>  Add option WITNESS_NO_VNODE to suppress printing LORs between VNODE
>  locks. To support this, VNODE locks are created with the LK_IS_VNODE
>  flag. This flag is propagated down using the LO_IS_VNODE flag.
>
>  Note that WITNESS still records the LOR. Only the printing and the
>  optional entering into the kernel debugger is bypassed with the
>  WITNESS_NO_VNODE option.

I'm replying to the original commit because the resulting thread got way 
out of hand.  We need to all take a deep breath and take a pragmatic 
approach to solving the problem at hand.

Let me first say I understand the utility here as this is also coming up 
in my organization.  Test, and users, do not want to see erroneous warning 
messages.  I understand that.  Let's find a solution.

Secondly, I think this project has grown too far for us to commit changes 
like this without some focused discussion.  We need to be more mindful of 
the size of the impact and the number of people who are interested in a 
particular area.  I'm not picking on you Marcel because this sort of thing 
has been coming up lately and we have all been guilty of it from time to 
time.  There are more companies and individuals than ever trying to push 
work into the repository and we're having some growing pains.

I am intimately familiar with the problems that lead to these erroneous 
witness messages as I have tracked down many of them and am even 
responsible for the code that generates them in some cases.  Let me first 
outline a handful of generic problems.  The root cause is that witness can 
not determine the real order between two locks due to relationships too 
complex to describe with a pair of strings.

One example, which has been brought up, is the hierarchical nature of 
vnode locks.  This impacts vnodes within one filesystem but it also 
involves vnodes between two different filesystems as you cross mount 
points.  We can construct perfectly valid and deadlock free chains of 
mount points that have two different filesystem types in different orders 
which will LOR at the boundaries.  We already skip duplicates to avoid 
this problem within each filesystem.  We need to skip cross-filesystem 
duplicates, most desirably at the few specific places where this happens. 
This problem comes up especially for devfs because we lock devvps while 
file vnodes are locked but we lock devfs directories after the rootfs lock 
when crossing mountpoints in lookup.

A second example, is locks of a fundamentally different type that have a 
complex ordering relationship.  For example, a vnode lock may be acquired 
after a buf lock belonging to the parent's directory block.  A cg buf lock 
may be acquired after any file buf lock.  Here we want to ignore 
interactions between these two specific types at this particular location 
but not others as they may be unsafe.

The third example, is a complex locking pattern with shared locks as 
presented by dirhash.  We are seeing a similar pattern develop in the vm 
where we are going to use an exclusive object lock to protect pages or a 
shared object lock + a page lock.  The semantics only get more complex as 
we push for more scalability.  I expect to see more of these patterns 
develop.

None of these problems can be solved with names alone.  So far we've 
just lived with the warnings and we're no longer willing to accept that. 
What we need is a solution that blesses the specific instances and the 
specific lock classes involved without silencing legitimate warnings that 
may only occur after new code is added.  For example, it may be safe to 
add a sx lock around some vnode code but you may not notice that you LOR 
if you silence all witness warnings related to the vnode lock site.

I believe that the perfect solution would be a mechanism that could teach 
witness about and enforce these specific relationships.  However, that may 
be computationally prohibitive and too complex to code.  A more reasonable 
option would be to bless the specific relationships at the specific call 
sites.  Turning all witness off at particular sites or with particular 
types renders important infrastructure useless for very large functional 
areas.  It's also important to distinguish between squelching the error 
message from eliminating the other state that is saved at lock sites.

We already have lock names and types.  What I would propose we do is make 
the type 'vnode' for all vnodes and 'buf' for all bufs with the names used 
for the specific filesystems.  Then you could specify a DUPOK that 
automatically blesses any filesystem to filesystem related LORs.  In this 
way witness still records the call sites and unrelated LORs or panics 
still have the acquisition information.  You could eventually unwind this 
to only DUPOK at the specific currently known places that we anticipate 
multiple vnodes.

To solve the buf and other complex LORs involving multiple types you would 
make lock variants that accept a blessed list of the specific locks you 
are blessing.  We already have support for a blessed list in witness.  We 
just want something like this per-call.

For example;  When acquiring a child vnode lock after a parent lock when 
a parent's buf is held I would do something like this:

vn_lock_flags(vp, LK_EXCLUSIVE, { "vnode", "bufwait", NULL });

Written properly dead code elimination will take care of it for non-debug 
builds.  You could come up with other ways of writing this that don't 
involve passing pointers down the stack.  For example, making a unique 
token for this blessed pair and passing it in the high bits of the flags. 
I don't really care about the particular implementation but I think this 
is the right model.

Thanks,
Jeff

>
> Modified:
>  head/sys/conf/options
>  head/sys/kern/kern_lock.c
>  head/sys/kern/subr_witness.c
>  head/sys/kern/vfs_subr.c
>  head/sys/sys/lock.h
>  head/sys/sys/lockmgr.h
>
> Modified: head/sys/conf/options
> ==============================================================================
> --- head/sys/conf/options	Thu May  9 16:09:39 2013	(r250410)
> +++ head/sys/conf/options	Thu May  9 16:28:18 2013	(r250411)
> @@ -672,6 +672,7 @@ KTR_ENTRIES		opt_global.h
> KTR_VERBOSE		opt_ktr.h
> WITNESS			opt_global.h
> WITNESS_KDB		opt_witness.h
> +WITNESS_NO_VNODE	opt_witness.h
> WITNESS_SKIPSPIN	opt_witness.h
>
> # options for ACPI support
>
> Modified: head/sys/kern/kern_lock.c
> ==============================================================================
> --- head/sys/kern/kern_lock.c	Thu May  9 16:09:39 2013	(r250410)
> +++ head/sys/kern/kern_lock.c	Thu May  9 16:28:18 2013	(r250411)
> @@ -393,6 +393,8 @@ lockinit(struct lock *lk, int pri, const
> 		iflags |= LO_WITNESS;
> 	if (flags & LK_QUIET)
> 		iflags |= LO_QUIET;
> +	if (flags & LK_IS_VNODE)
> +		iflags |= LO_IS_VNODE;
> 	iflags |= flags & (LK_ADAPTIVE | LK_NOSHARE);
>
> 	lk->lk_lock = LK_UNLOCKED;
>
> Modified: head/sys/kern/subr_witness.c
> ==============================================================================
> --- head/sys/kern/subr_witness.c	Thu May  9 16:09:39 2013	(r250410)
> +++ head/sys/kern/subr_witness.c	Thu May  9 16:28:18 2013	(r250411)
> @@ -1289,7 +1289,19 @@ witness_checkorder(struct lock_object *l
> 			w->w_reversed = w1->w_reversed = 1;
> 			witness_increment_graph_generation();
> 			mtx_unlock_spin(&w_mtx);
> -
> +
> +#ifdef WITNESS_NO_VNODE
> +			/*
> +			 * There are known LORs between VNODE locks. They are
> +			 * not an indication of a bug. VNODE locks are flagged
> +			 * as such (LO_IS_VNODE) and we don't yell if the LOR
> +			 * is between 2 VNODE locks.
> +			 */
> +			if ((lock->lo_flags & LO_IS_VNODE) != 0 &&
> +			    (lock1->li_lock->lo_flags & LO_IS_VNODE) != 0)
> +				return;
> +#endif
> +
> 			/*
> 			 * Ok, yell about it.
> 			 */
>
> Modified: head/sys/kern/vfs_subr.c
> ==============================================================================
> --- head/sys/kern/vfs_subr.c	Thu May  9 16:09:39 2013	(r250410)
> +++ head/sys/kern/vfs_subr.c	Thu May  9 16:28:18 2013	(r250411)
> @@ -1037,7 +1037,7 @@ alloc:
> 	 * By default, don't allow shared locks unless filesystems
> 	 * opt-in.
> 	 */
> -	lockinit(vp->v_vnlock, PVFS, tag, VLKTIMEOUT, LK_NOSHARE);
> +	lockinit(vp->v_vnlock, PVFS, tag, VLKTIMEOUT, LK_NOSHARE | LK_IS_VNODE);
> 	/*
> 	 * Initialize bufobj.
> 	 */
>
> Modified: head/sys/sys/lock.h
> ==============================================================================
> --- head/sys/sys/lock.h	Thu May  9 16:09:39 2013	(r250410)
> +++ head/sys/sys/lock.h	Thu May  9 16:28:18 2013	(r250411)
> @@ -79,6 +79,7 @@ struct lock_class {
> #define	LO_SLEEPABLE	0x00100000	/* Lock may be held while sleeping. */
> #define	LO_UPGRADABLE	0x00200000	/* Lock may be upgraded/downgraded. */
> #define	LO_DUPOK	0x00400000	/* Don't check for duplicate acquires */
> +#define	LO_IS_VNODE	0x00800000	/* Tell WITNESS about a VNODE lock */
> #define	LO_CLASSMASK	0x0f000000	/* Class index bitmask. */
> #define LO_NOPROFILE    0x10000000      /* Don't profile this lock */
>
>
> Modified: head/sys/sys/lockmgr.h
> ==============================================================================
> --- head/sys/sys/lockmgr.h	Thu May  9 16:09:39 2013	(r250410)
> +++ head/sys/sys/lockmgr.h	Thu May  9 16:28:18 2013	(r250411)
> @@ -146,6 +146,7 @@ _lockmgr_args_rw(struct lock *lk, u_int
> #define	LK_NOWITNESS	0x000010
> #define	LK_QUIET	0x000020
> #define	LK_ADAPTIVE	0x000040
> +#define	LK_IS_VNODE	0x000080	/* Tell WITNESS about a VNODE lock */
>
> /*
>  * Additional attributes to be used in lockmgr().
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.BSF.2.00.1305111405421.2005>