From owner-freebsd-hackers@freebsd.org  Tue Mar 14 17:02:25 2017
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 90A1BD0CEFA
 for <freebsd-hackers@mailman.ysv.freebsd.org>;
 Tue, 14 Mar 2017 17:02:25 +0000 (UTC) (envelope-from pho@holm.cc)
Received: from relay01.pair.com (relay01.pair.com [209.68.5.15])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 6E0FE17EA;
 Tue, 14 Mar 2017 17:02:24 +0000 (UTC) (envelope-from pho@holm.cc)
Received: from x2.osted.lan (87-58-223-204-dynamic.dk.customer.tdc.net
 [87.58.223.204])
 by relay01.pair.com (Postfix) with ESMTP id D7FA9D00A96;
 Tue, 14 Mar 2017 13:02:22 -0400 (EDT)
Received: from x2.osted.lan (localhost [127.0.0.1])
 by x2.osted.lan (8.14.9/8.14.9) with ESMTP id v2EH2KC5023059
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Tue, 14 Mar 2017 18:02:21 +0100 (CET)
 (envelope-from pho@x2.osted.lan)
Received: (from pho@localhost)
 by x2.osted.lan (8.14.9/8.14.9/Submit) id v2EH2Kk9023058;
 Tue, 14 Mar 2017 18:02:20 +0100 (CET) (envelope-from pho)
Date: Tue, 14 Mar 2017 18:02:20 +0100
From: Peter Holm <peter@holm.cc>
To: Mark Johnston <markj@FreeBSD.org>
Cc: freebsd-hackers@FreeBSD.org
Subject: Re: draining high-frequency callouts
Message-ID: <20170314170220.GA22844@x2.osted.lan>
References: <20170110205711.GA86449@wkstn-mjohnston.west.isilon.com>
 <20170313082120.GA44651@x2.osted.lan>
 <20170313183813.GB57357@wkstn-mjohnston.west.isilon.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170313183813.GB57357@wkstn-mjohnston.west.isilon.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 14 Mar 2017 17:02:25 -0000

On Mon, Mar 13, 2017 at 11:38:13AM -0700, Mark Johnston wrote:
> On Mon, Mar 13, 2017 at 09:21:20AM +0100, Peter Holm wrote:
> > On Tue, Jan 10, 2017 at 12:57:12PM -0800, Mark Johnston wrote:
> > > I'm occasionally seeing an assertion failure in softclock_call_cc() when
> > > running DTrace tests on a system with hz=10000. The assertion
> > > (c->c_flags & CALLOUT_ACTIVE) != 0 is failing while a thread is
> > > concurrently draining the callout, which runs at a high frequency. At
> > > the time of the panic, that thread is spinning on the per-CPU callout
> > > lock after having been awoken from "codrain", and CALLOUT_PENDING is
> > > set on the callout. The callout is direct, i.e., it is executed in hard
> > > interrupt context.
> > > 
> > > I think this is what's happening:
> > > - callout_drain() is called while the callout is executing but after the
> > >   callout has rescheduled itself, and goes to sleep after having cleared
> > >   CALLOUT_ACTIVE.
> > > - softclock_call_cc() wakes up the callout_drain() caller, but the
> > >   callout fires again before the caller is scheduled.
> > > - the second softclock_call_cc() call sees that CALLOUT_ACTIVE is
> > >   cleared and panics.
> > > 
> > > Is there anything that prevents this scenario? Is it really correct to
> > > leave CALLOUT_ACTIVE cleared when the per-CPU callout lock must be
> > > dropped in order to acquire a sleepqueue lock?
> > > 
> > 
> > Is this the same problem?
> > 
> > panic: softclock_call_cc: act 0xfffff8000de64800 0
> 
> It's hard to say for sure. The minimal patch below fixed the problem for
> me - could you give it a try? I also did not see any problems while
> testing on Hans' branch.
> 
> diff --git a/sys/kern/kern_timeout.c b/sys/kern/kern_timeout.c
> index 5b70cf2033f5..a9c50fd98fbe 100644
> --- a/sys/kern/kern_timeout.c
> +++ b/sys/kern/kern_timeout.c
> @@ -1256,7 +1256,8 @@ again:
>  		 * Succeed we to stop it or not, we must clear the
>  		 * active flag - this is what API users expect.
>  		 */
> -		c->c_flags &= ~CALLOUT_ACTIVE;
> +		if ((flags & CS_DRAIN) == 0)
> +			c->c_flags &= ~CALLOUT_ACTIVE;
>  
>  		if ((flags & CS_DRAIN) != 0) {
>  			/*
> @@ -1315,6 +1316,7 @@ again:
>  				PICKUP_GIANT();
>  				CC_LOCK(cc);
>  			}
> +			c->c_flags &= ~CALLOUT_ACTIVE;
>  		} else if (use_lock &&
>  			   !cc_exec_cancel(cc, direct) && (drain == NULL)) {
>  			

I ran the test that triggered the panic all night.
I follow up with a buildworld + a random mix of tests for a total
of 24 hours.

No problems seen.

-- 
Peter