From owner-freebsd-stable@freebsd.org Mon Jul 24 16:25:37 2017 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A690ACFDC7E for ; Mon, 24 Jul 2017 16:25:37 +0000 (UTC) (envelope-from ken@freebsd.org) Received: from mithlond.kdm.org (mithlond.kdm.org [96.89.93.250]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "A1-33714", Issuer "A1-33714" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 792916E283; Mon, 24 Jul 2017 16:25:37 +0000 (UTC) (envelope-from ken@freebsd.org) Received: from [10.0.0.26] (mbp2013.int.kdm.org [10.0.0.26]) (authenticated bits=0) by mithlond.kdm.org (8.15.2/8.14.9) with ESMTPSA id v6OGPYlA029495 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Mon, 24 Jul 2017 12:25:35 -0400 (EDT) (envelope-from ken@freebsd.org) From: Ken Merry Message-Id: Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching Date: Mon, 24 Jul 2017 12:25:34 -0400 In-Reply-To: Cc: freebsd-stable@freebsd.org, "re@freebsd.org" , Mark.Martinec+freebsd@ijs.si, Stephen Mcconnell To: Steven Hartland References: <20170717232434.GB21048@wkstn-mjohnston.west.isilon.com> <9b3563aae75aa954d7fe31ffe25e1d29@ijs.si> <20170720000325.GB9198@wkstn-mjohnston.west.isilon.com> <81295bcacd7c44813de8d346c88cbb65@ijs.si> <20170724021504.GA97170@raichu> <10649c9070bc419d93ae2a87a511d2ba@ijs.si> X-Mailer: Apple Mail (2.3273) X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.4.3 (mithlond.kdm.org [96.89.93.250]); Mon, 24 Jul 2017 12:25:35 -0400 (EDT) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.23 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Jul 2017 16:25:37 -0000 It is possible that the change I MFCed today (r321207 in head, r321415 = in stable/11) is related, but Mark will have to boot his machine with = the fix to see if it makes any difference. What happened in my case on one particular machine (not on most machines = in our lab running the same code) was that mps_wait_command() / = mpr_wait_command() would not wait the full 60 seconds for a write to the = DPM table (Driver Persistent Mapping) table in the controller. So, it = reported that there was a timeout. There is a secondary bug that is still in the mps(4) / mpr(4) drivers = when a timeout does happen =E2=80=94 the error recovery code in the = wait_command() routine reinitializes the controller, which clears out = all the commands. When the wait_command() routine returns, the command = passed in has been freed, but the caller doesn=E2=80=99t know that. So = the caller (it happens in a number of places) dereferences a pointer to = freed memory and the kernel panics. I=E2=80=99m planning to fix that bug, too, if slm@ doesn=E2=80=99t get = to it first, I=E2=80=99ve just had other bugs to fix first. Eliminating bogus timeouts will eliminate most all of the sources of = those panics anyway. Ken =E2=80=94=20 Ken Merry ken@FreeBSD.ORG > On Jul 24, 2017, at 12:10 PM, Steven Hartland = wrote: >=20 > Based on your boot info you're using mps, so this could be related to = mps fix committed to stable/11 today by ken@ > https://svnweb.freebsd.org/changeset/base/321415 = >=20 > re@ cc'ed as this could cause hangs for others too on 11.1-RELEASE if = this is the case. >=20 > Regards > Steve >=20 > On 24/07/2017 15:55, Mark Martinec wrote: >>> Thanks! Tried it, and the message (or a backtrace) does not show=20 >>> during a boot of a generic (patched) kernel, at least not in=20 >>> the last 40-lines screen before the hang occurs.=20 >>> (It also does not show during a "Safe mode" successful boot.)=20 >>=20 >> Btw (may or may not be relevant): after the above experiment=20 >> I have rebooted the machine in "Safe mode" (generic kernel,=20 >> EARLY_AP_STARTUP enabled by default) - and spent some time=20 >> doing non-intensive interactive work on this host (web browsing,=20 >> editor, shell, all under KDE) - and after about an hour the=20 >> machine froze: clock display not updating, keyboard unresponsive,=20 >> console virtual terminals inaccessible) - so had to reboot.=20 >> According to fans speed the machine was idle.=20 >> The /var/log/messages does not show anything of interest=20 >> before the freeze. All disks are under ZFS.=20 >>=20 >> Can EARLY_AP_STARTUP have an effect also _after_ booting?=20 >> This host never hung during normal work when EARLY_AP_STARTUP=20 >> was disabled (or with 11.0 and earlier).=20 >>=20 >> Mark=20 >> _______________________________________________=20 >> freebsd-stable@freebsd.org = mailing list=20 >> https://lists.freebsd.org/mailman/listinfo/freebsd-stable = =20 >> To unsubscribe, send any mail to = "freebsd-stable-unsubscribe@freebsd.org" = =20 >=20