From owner-netperf-users@freebsd.org Fri Sep 14 14:53:17 2018 Return-Path: Delivered-To: netperf-users@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E2226108B096 for ; Fri, 14 Sep 2018 14:53:16 +0000 (UTC) (envelope-from mjguzik@gmail.com) Received: from mail-qt0-x244.google.com (mail-qt0-x244.google.com [IPv6:2607:f8b0:400d:c0d::244]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 7C9F9850D8; Fri, 14 Sep 2018 14:53:16 +0000 (UTC) (envelope-from mjguzik@gmail.com) Received: by mail-qt0-x244.google.com with SMTP id k38-v6so8945390qtk.11; Fri, 14 Sep 2018 07:53:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=TKDs6WkB1wzkxufJyYPwBoQBYaLYdDU96hE0Hxxvk5g=; b=m5E2SWzwkmnvGcZMxoSTY2DNbSBqpa8AqIkw8vdeE2guVlPmlFH2uBW8uKp5AC0r7/ sPCr93BGAmpRb+oMJd7dUMkp59Jfyyz5GUIc3OJxGoQp8ZSVg8IQRjgiuwf9zpqhlBxq 8878ONFKjU6uP6R6mh4W5L7jkGman5EU6t6Hl4WhRFXlZIThoGkQOJfteN6ZKD0nVzjQ Ekri9pVmyNnBP7PBx8WXPc3l9tbpQ8gNiuRGUx3wz2U4HI6y6kKAL4DbA+P0QotwhZBT sEigRMczMUf6ICHgaoYQDmT6zkD3EMws1G0JpMxruqe9ago4ZRAIIY5cXoHdCvpT2vfS foog== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=TKDs6WkB1wzkxufJyYPwBoQBYaLYdDU96hE0Hxxvk5g=; b=quzQeMtN4EFeIQJCZT7K6PzHGcg9s/NwtDx+TXc4iJC9nymFhf6AfbO+0zJg4LdYIq /OfUaW0ZZBK+wR7AEYdLoMDteJU3Xexpu7gZhoU1qaHcR+NoqH1uvTEFAs2y5mLRqLE1 jXKb4z6E/ur26Bqdhc+JEiDHGvun5uH9cHrIC4Qt2B53w8OcRUHD4tLiSPjmPxrU+i2v 4lcY58IVh3OIAZeKCWAPjSv0qlfxAM+ckYsLKnfdH+MYWCVrgmXxY8+aFwacxjA4z93X 3fmGMKrGXInqO0NVBgeG43S0ApqcWWliDrwmTGpyoP8MMctAjEZQ34zQ5N/BwBkhQXBu A6zQ== X-Gm-Message-State: APzg51DqWlNpaQ0wMZCet8ALc2SrdYDFZD42R7lYxt1kVyaWtbrDlqgD OQ0ceQJkxtZiInc51xc6q7Glg1guZMPGffIhqUXrKg== X-Google-Smtp-Source: ANB0VdZ9Nf1nRmaMPvyuA0yUZOpvju0itHepnHmiqpsO0trF3A/02hYaF/z5ydHcsk42GkLHD61z4aNp587a8IhWzns= X-Received: by 2002:aed:30a1:: with SMTP id 30-v6mr9080211qtf.29.1536936796061; Fri, 14 Sep 2018 07:53:16 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:ac8:3602:0:0:0:0:0 with HTTP; Fri, 14 Sep 2018 07:53:15 -0700 (PDT) In-Reply-To: <2dcd8d1b-12f3-1da7-673c-8d24bc0eb948@sentex.net> References: <8ca07d41-b753-9741-49be-150d42197edc@sentex.net> <7dc50e6a-191b-002d-9adf-df16e591c9da@sentex.net> <20180914140300.GB52847@FreeBSD.org> <2dcd8d1b-12f3-1da7-673c-8d24bc0eb948@sentex.net> From: Mateusz Guzik Date: Fri, 14 Sep 2018 16:53:15 +0200 Message-ID: Subject: Re: update of zoo to r338656 12.0 (was Re: zoo vs 12.0 (was: zoo vs 11.2-rc2) To: Mike Tancsa Cc: Glen Barber , George Neville-Neil , Paul Holes , netperf-users@freebsd.org, netperf-admin@freebsd.org Content-Type: text/plain; charset="UTF-8" X-BeenThere: netperf-users@freebsd.org X-Mailman-Version: 2.1.27 Precedence: list List-Id: "Announcements and discussions related to the netperf cluster. " List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Sep 2018 14:53:17 -0000 On 9/14/18, Mike Tancsa wrote: > On 9/14/2018 10:03 AM, Glen Barber wrote: >> Mike, >> >> In the interest of morbid curiosity, could you rebuild the 12.0 kernel >> without the 'options NUMA' line? This was turned on very late, and too >> close to the stable/12 branch, and I'd like to at least confirm this is >> not in any way at fault. > A couple of people are already working on the box. > > If its an MFI driver issue, I could put a spare card in one of the zoo > members that makes use of NUMA and has more than one domain ? I just > tried a mfi card in an EPYC based machine with the same rev, and it > boots up OK. But his only has one NUMA domain. > > I think pig would have multiple numa domains as does flix1a which noone > seems to be on right now. > lynx1-4, pig1 and flix* all do have multiple nodes. lynx* has the fastest boot cycle if you can plop a controller in there. Rebooting without NUMA as a sanity check is definitely a good idea, but I doubt that's it. I like the idea of using the above boxes with a mfi controller in hopes of reproducing the issue. Looking at differences between the driver in head and stable/11 I see 2 changes, one of which looks extremely interesting: commit a1d4bb9b4447414168dc2ffc8d5c74a1ef8bb152 Author: scottl Date: Fri Sep 8 17:51:19 2017 +0000 Fix intrhook release in MFI as well diff --git a/sys/dev/mfi/mfi.c b/sys/dev/mfi/mfi.c index 28054d9bf7d..91ec872558a 100644 --- a/sys/dev/mfi/mfi.c +++ b/sys/dev/mfi/mfi.c @@ -1263,8 +1263,6 @@ mfi_startup(void *arg) sc = (struct mfi_softc *)arg; - config_intrhook_disestablish(&sc->mfi_ich); - sc->mfi_enable_intr(sc); sx_xlock(&sc->mfi_config_lock); mtx_lock(&sc->mfi_io_lock); @@ -1273,6 +1271,8 @@ mfi_startup(void *arg) mfi_syspdprobe(sc); mtx_unlock(&sc->mfi_io_lock); sx_xunlock(&sc->mfi_config_lock); + + config_intrhook_disestablish(&sc->mfi_ich); } static void Note it may be this has no relation to the problem whatsoever, but booting a kernel with this change reverted would definitely help. If a zoo-testable box is confirmed to hang I can take it from there myself, the least I can do is bisect and chase the guilty. :) -- Mateusz Guzik