Date: Fri, 14 Sep 2018 16:53:15 +0200 From: Mateusz Guzik <mjguzik@gmail.com> To: Mike Tancsa <mike@sentex.net> Cc: Glen Barber <gjb@freebsd.org>, George Neville-Neil <gnn@neville-neil.com>, Paul Holes <pholes@sentex.ca>, netperf-users@freebsd.org, netperf-admin@freebsd.org Subject: Re: update of zoo to r338656 12.0 (was Re: zoo vs 12.0 (was: zoo vs 11.2-rc2) Message-ID: <CAGudoHFv1AyWZiL1KsQwP1grkrz6s=eKmtSvFudzr%2BN9f7B4oQ@mail.gmail.com> In-Reply-To: <2dcd8d1b-12f3-1da7-673c-8d24bc0eb948@sentex.net> References: <CAGudoHGb3FtoWAroBzVdDks6S2td-nnqJcdrkAsoiT_Q1PCYJQ@mail.gmail.com> <e3a7c7b6-e564-b5fe-ddad-6332bf6c96a0@sentex.net> <A344C6D5-BF69-48E8-8C0A-3610FE5BA15F@neville-neil.com> <b3f74263-127f-0b33-8d35-8e5c245cf826@sentex.net> <8ca07d41-b753-9741-49be-150d42197edc@sentex.net> <7dc50e6a-191b-002d-9adf-df16e591c9da@sentex.net> <ea0b113f-7b45-1ec3-f752-fcfa0ef7b2c0@sentex.net> <20180914140300.GB52847@FreeBSD.org> <2dcd8d1b-12f3-1da7-673c-8d24bc0eb948@sentex.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On 9/14/18, Mike Tancsa <mike@sentex.net> wrote: > On 9/14/2018 10:03 AM, Glen Barber wrote: >> Mike, >> >> In the interest of morbid curiosity, could you rebuild the 12.0 kernel >> without the 'options NUMA' line? This was turned on very late, and too >> close to the stable/12 branch, and I'd like to at least confirm this is >> not in any way at fault. > A couple of people are already working on the box. > > If its an MFI driver issue, I could put a spare card in one of the zoo > members that makes use of NUMA and has more than one domain ? I just > tried a mfi card in an EPYC based machine with the same rev, and it > boots up OK. But his only has one NUMA domain. > > I think pig would have multiple numa domains as does flix1a which noone > seems to be on right now. > lynx1-4, pig1 and flix* all do have multiple nodes. lynx* has the fastest boot cycle if you can plop a controller in there. Rebooting without NUMA as a sanity check is definitely a good idea, but I doubt that's it. I like the idea of using the above boxes with a mfi controller in hopes of reproducing the issue. Looking at differences between the driver in head and stable/11 I see 2 changes, one of which looks extremely interesting: commit a1d4bb9b4447414168dc2ffc8d5c74a1ef8bb152 Author: scottl <scottl@FreeBSD.org> Date: Fri Sep 8 17:51:19 2017 +0000 Fix intrhook release in MFI as well diff --git a/sys/dev/mfi/mfi.c b/sys/dev/mfi/mfi.c index 28054d9bf7d..91ec872558a 100644 --- a/sys/dev/mfi/mfi.c +++ b/sys/dev/mfi/mfi.c @@ -1263,8 +1263,6 @@ mfi_startup(void *arg) sc = (struct mfi_softc *)arg; - config_intrhook_disestablish(&sc->mfi_ich); - sc->mfi_enable_intr(sc); sx_xlock(&sc->mfi_config_lock); mtx_lock(&sc->mfi_io_lock); @@ -1273,6 +1271,8 @@ mfi_startup(void *arg) mfi_syspdprobe(sc); mtx_unlock(&sc->mfi_io_lock); sx_xunlock(&sc->mfi_config_lock); + + config_intrhook_disestablish(&sc->mfi_ich); } static void Note it may be this has no relation to the problem whatsoever, but booting a kernel with this change reverted would definitely help. If a zoo-testable box is confirmed to hang I can take it from there myself, the least I can do is bisect and chase the guilty. :) -- Mateusz Guzik <mjguzik gmail.com>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAGudoHFv1AyWZiL1KsQwP1grkrz6s=eKmtSvFudzr%2BN9f7B4oQ>