From owner-freebsd-hackers@freebsd.org Tue Mar 14 17:02:25 2017 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 90A1BD0CEFA for ; Tue, 14 Mar 2017 17:02:25 +0000 (UTC) (envelope-from pho@holm.cc) Received: from relay01.pair.com (relay01.pair.com [209.68.5.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 6E0FE17EA; Tue, 14 Mar 2017 17:02:24 +0000 (UTC) (envelope-from pho@holm.cc) Received: from x2.osted.lan (87-58-223-204-dynamic.dk.customer.tdc.net [87.58.223.204]) by relay01.pair.com (Postfix) with ESMTP id D7FA9D00A96; Tue, 14 Mar 2017 13:02:22 -0400 (EDT) Received: from x2.osted.lan (localhost [127.0.0.1]) by x2.osted.lan (8.14.9/8.14.9) with ESMTP id v2EH2KC5023059 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 14 Mar 2017 18:02:21 +0100 (CET) (envelope-from pho@x2.osted.lan) Received: (from pho@localhost) by x2.osted.lan (8.14.9/8.14.9/Submit) id v2EH2Kk9023058; Tue, 14 Mar 2017 18:02:20 +0100 (CET) (envelope-from pho) Date: Tue, 14 Mar 2017 18:02:20 +0100 From: Peter Holm To: Mark Johnston Cc: freebsd-hackers@FreeBSD.org Subject: Re: draining high-frequency callouts Message-ID: <20170314170220.GA22844@x2.osted.lan> References: <20170110205711.GA86449@wkstn-mjohnston.west.isilon.com> <20170313082120.GA44651@x2.osted.lan> <20170313183813.GB57357@wkstn-mjohnston.west.isilon.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170313183813.GB57357@wkstn-mjohnston.west.isilon.com> User-Agent: Mutt/1.5.23 (2014-03-12) X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Mar 2017 17:02:25 -0000 On Mon, Mar 13, 2017 at 11:38:13AM -0700, Mark Johnston wrote: > On Mon, Mar 13, 2017 at 09:21:20AM +0100, Peter Holm wrote: > > On Tue, Jan 10, 2017 at 12:57:12PM -0800, Mark Johnston wrote: > > > I'm occasionally seeing an assertion failure in softclock_call_cc() when > > > running DTrace tests on a system with hz=10000. The assertion > > > (c->c_flags & CALLOUT_ACTIVE) != 0 is failing while a thread is > > > concurrently draining the callout, which runs at a high frequency. At > > > the time of the panic, that thread is spinning on the per-CPU callout > > > lock after having been awoken from "codrain", and CALLOUT_PENDING is > > > set on the callout. The callout is direct, i.e., it is executed in hard > > > interrupt context. > > > > > > I think this is what's happening: > > > - callout_drain() is called while the callout is executing but after the > > > callout has rescheduled itself, and goes to sleep after having cleared > > > CALLOUT_ACTIVE. > > > - softclock_call_cc() wakes up the callout_drain() caller, but the > > > callout fires again before the caller is scheduled. > > > - the second softclock_call_cc() call sees that CALLOUT_ACTIVE is > > > cleared and panics. > > > > > > Is there anything that prevents this scenario? Is it really correct to > > > leave CALLOUT_ACTIVE cleared when the per-CPU callout lock must be > > > dropped in order to acquire a sleepqueue lock? > > > > > > > Is this the same problem? > > > > panic: softclock_call_cc: act 0xfffff8000de64800 0 > > It's hard to say for sure. The minimal patch below fixed the problem for > me - could you give it a try? I also did not see any problems while > testing on Hans' branch. > > diff --git a/sys/kern/kern_timeout.c b/sys/kern/kern_timeout.c > index 5b70cf2033f5..a9c50fd98fbe 100644 > --- a/sys/kern/kern_timeout.c > +++ b/sys/kern/kern_timeout.c > @@ -1256,7 +1256,8 @@ again: > * Succeed we to stop it or not, we must clear the > * active flag - this is what API users expect. > */ > - c->c_flags &= ~CALLOUT_ACTIVE; > + if ((flags & CS_DRAIN) == 0) > + c->c_flags &= ~CALLOUT_ACTIVE; > > if ((flags & CS_DRAIN) != 0) { > /* > @@ -1315,6 +1316,7 @@ again: > PICKUP_GIANT(); > CC_LOCK(cc); > } > + c->c_flags &= ~CALLOUT_ACTIVE; > } else if (use_lock && > !cc_exec_cancel(cc, direct) && (drain == NULL)) { > I ran the test that triggered the panic all night. I follow up with a buildworld + a random mix of tests for a total of 24 hours. No problems seen. -- Peter