From owner-freebsd-current@freebsd.org Thu Dec 15 01:03:15 2016 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C46CEC75784 for ; Thu, 15 Dec 2016 01:03:15 +0000 (UTC) (envelope-from kargl@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "troutmask", Issuer "troutmask" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id ACB1A1595; Thu, 15 Dec 2016 01:03:15 +0000 (UTC) (envelope-from kargl@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id uBF13Eh8037766 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Wed, 14 Dec 2016 17:03:14 -0800 (PST) (envelope-from kargl@troutmask.apl.washington.edu) Received: (from kargl@localhost) by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id uBF13EQ2037765; Wed, 14 Dec 2016 17:03:14 -0800 (PST) (envelope-from kargl) Date: Wed, 14 Dec 2016 17:03:14 -0800 From: "Steven G. Kargl" To: Mark Johnston Cc: kargl@uw.edu, freebsd-current@freebsd.org, kib@freebsd.org, jhb@FreeBSD.org Subject: Re: Revision 309657 to stack_machdep.c renders unbootable system Message-ID: <20161215010314.GA56272@troutmask.apl.washington.edu> Reply-To: kargl@uw.edu References: <20161214194848.GA881@troutmask.apl.washington.edu> <20161214201416.GA64767@wkstn-mjohnston.west.isilon.com> <20161214221048.GB64767@wkstn-mjohnston.west.isilon.com> <20161214234804.GA26443@troutmask.apl.washington.edu> <20161215005012.GA84222@wkstn-mjohnston.west.isilon.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161215005012.GA84222@wkstn-mjohnston.west.isilon.com> User-Agent: Mutt/1.6.1 (2016-04-27) X-Mailman-Approved-At: Thu, 15 Dec 2016 01:27:32 +0000 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Dec 2016 01:03:15 -0000 On Wed, Dec 14, 2016 at 04:50:21PM -0800, Mark Johnston wrote: > On Wed, Dec 14, 2016 at 03:48:04PM -0800, Steven G. Kargl wrote: > > On Wed, Dec 14, 2016 at 02:10:48PM -0800, Mark Johnston wrote: > > > On Wed, Dec 14, 2016 at 12:14:16PM -0800, Mark Johnston wrote: > > > > On Wed, Dec 14, 2016 at 11:49:26AM -0800, Steven G. Kargl wrote: > > > > > Well, after 3 days of bisection, I finally found the commit > > > > > that renders my system unbootable. The system does not panic. > > > > > It simply gets stuck in some state. Nonfunctional keyboard, > > > > > so can't break into debugger. No serial console available. > > > > > The verbose dmesg.boot for a working kernel from revision > > > > > 309656 is at > > > > > > > > > > http://troutmask.apl.washington.edu/~kargl/freebsd/dmesg.309656.txt > > > > > > > > > > The kernel config file is at > > > > > > > > > > http://troutmask.apl.washington.edu/~kargl/freebsd/SPEW.txt > > > > > > > > > > In looking at /usr/src/UPDATING, there is no warning that one > > > > > can create a boat anchor by upgrading to 309657. If compiling > > > > > a kernel with 'options DDB' is no longer supported, this should > > > > > be stated in UPDATING. Or, UPDATING should state that 'options > > > > > DDB' requires 'options STACK'. Or, 'options DDB' should simply > > > > > to the right thing and pull in whatever 'option STACK' does. > > > > > > > > It is supported though - the point of that change was to fix a problem > > > > that occurred when DDB is configured but STACK isn't. While testing I > > > > tried every combination of the two options, and I just tried and > > > > successfully booted a kernel with DDB and !STACK. > > > > > > > > Does the kernel boot successfully if STACK is added to your > > > > configuration? > > > > > > I tried your config (plus virtio drivers) and was able to reproduce the > > > hang in bhyve. Adding STACK "fixed" the hang, as did reverting part of > > > my change to re-add dead code into the kernel. My VM was always hanging > > > after printing > > > > > > 000.000050 [ 426] vtnet_netmap_attach virtio attached txq=1, txd=1024 rxq=1, rxd=1024 > > > > > > Sure enough, removing "device netmap" from your config also fixes the > > > hang. When the hang occurs, I can see with "bhyvectl --get-rip" that > > > we're stuck in DELAY(), but I can't get a stack at that point. I think > > > my change is an innocent bystander - it just happened to expose a latent > > > issue elsewhere. > > > > > > I don't have much more time to look at this right now, but I'll look > > > into it more tonight. > > > > Yes, adding STACK got me to a booting kernel. I can't remember > > why I added netmap to my config file. Re-adding dead code seems to > > point to some memory corruption issue or a rogue pointer. :( > > It's not quite that bad, as it turns out. The key is that > adding/removing the dead code changes the ordering of the items in the > sysinit linker set. I discovered that if the ctl(4) module is > initialized before the vtnet driver attaches, the hang occurs, and > reverting my commit results in a sysinit order where vtnet comes > _before_ ctl(4). So my change triggers the problem just because it > happens to perturb something in the compile-time linker. Thanks for the explanation. > The issue actually seems to be in 4BSD, and more specifically in r308564 > and r308565. Switching to ULE or reverting either of those two commits > fixes the hang. Oh, this is bad. The last time I checked (and it has been awhile ago), ULE has/had some very bad performance issues for numerical computations that use OpenMPI (or likely any MPI implementation) if a node becomes oversubscribed. 4BSD at least manages to recover. Thanks for the pointer to r308564 and 65. I'll take a look later tonight as I've managed to break both firefox and chrome during the upgrade. -- Steve http://troutmask.apl.washington.edu/~kargl/ https://www.youtube.com/watch?v=6hwgPfCcpyQ