From owner-freebsd-stable@freebsd.org  Mon Jul 24 17:33:13 2017
Return-Path: <owner-freebsd-stable@freebsd.org>
Delivered-To: freebsd-stable@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 09FDCCFF575
 for <freebsd-stable@mailman.ysv.freebsd.org>;
 Mon, 24 Jul 2017 17:33:13 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: from mail-it0-x233.google.com (mail-it0-x233.google.com
 [IPv6:2607:f8b0:4001:c0b::233])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id C7DB070CE6
 for <freebsd-stable@freebsd.org>; Mon, 24 Jul 2017 17:33:12 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: by mail-it0-x233.google.com with SMTP id h199so38552591ith.1
 for <freebsd-stable@freebsd.org>; Mon, 24 Jul 2017 10:33:12 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=bsdimp-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc;
 bh=udilR9sMIOQvROjNPA2f5+iSa0QLiRWPoiB7vBkmujM=;
 b=RRs8rI6Tv+vWFQjSFMVazxklpqpDyGrG+deIKpEvfletSSNMCPHB9vXel/zMOwFcBt
 lKAJSJoCR1ogOOm3AXGpRyT0kgTtczW5Fi9N80Nzx9FhKlIx73/sDV5/QO4P3FhuNSEB
 1PPxxsWtAUql1qNxmOojkyLJrfCNwTBeNVffh/fy6ACqWII3j00CQXaUs6ZFYv4t+CUb
 gHKrbuhd2byOzFVaTlL/xntB6ag9eBmJeaR2qSITyhGL5L/oBaVbkQhRU1zODEMGXz0q
 r4GO413C+/cFLCnS6JF3+4ZoOOIvS9kenhiIhmFDHTo7u4Uw5RJkmLIvwk+VbPNvZZAB
 f34g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:sender:in-reply-to:references:from
 :date:message-id:subject:to:cc;
 bh=udilR9sMIOQvROjNPA2f5+iSa0QLiRWPoiB7vBkmujM=;
 b=jvBXpcU81VnU0oG903eIX2DVyfYK0AH994F0atEPJ43/uN1RnzBC6hmsejfA4+2mhw
 J+c8vimzpCCIHjKz84x+4ysHa63FDUfltkQpMjvqEi6lIt+TJrBNtUFGiUxqZNlCBWPH
 I/mV/Lc3UUnOxbg4/OTmzr8nsqXIGWacbN/uFteRcg6DWoaCst0ttoIdJkI2SfyvbxTs
 29utoKTC5JwGLoIULqs78ayLpQEUcMHFVwcLqhhK8OEcotwjhlnMxY6Gow34DU2zZEWr
 n+j5jStEDHyriWWlWu6yMWtpGkID/vuSO7VBMvjZ1sX4KjriOohLDNAimg0bDdQClxXz
 1lrA==
X-Gm-Message-State: AIVw111mdirCeZ4bkDZ5KJdYcqNWbw01lp8OQKx5V3iaj8SzJPaXQbcv
 CHkC8v4g3fxGcLjbE44v4OvAiHOWs9rz
X-Received: by 10.36.233.66 with SMTP id f63mr7674567ith.162.1500917592169;
 Mon, 24 Jul 2017 10:33:12 -0700 (PDT)
MIME-Version: 1.0
Sender: wlosh@bsdimp.com
Received: by 10.79.58.17 with HTTP; Mon, 24 Jul 2017 10:33:11 -0700 (PDT)
X-Originating-IP: [2603:300b:6:5100:1863:f3dc:1906:3a1e]
In-Reply-To: <accdd071-035f-215b-d2a9-d1aa1c83f705@FreeBSD.org>
References: <587928B3.2050607@grosbein.net>
 <20170113193726.GC77535@wkstn-mjohnston.west.isilon.com>
 <587A0E12.7070205@grosbein.net> <59746BD5.5010301@grosbein.net>
 <20170724014445.GA20872@raichu> <59762849.5090208@grosbein.net>
 <accdd071-035f-215b-d2a9-d1aa1c83f705@FreeBSD.org>
From: Warner Losh <imp@bsdimp.com>
Date: Mon, 24 Jul 2017 11:33:11 -0600
X-Google-Sender-Auth: 8n38EJ9K7Qj6h_UtzGKmnVbB3aI
Message-ID: <CANCZdfqMLFBzhn_jcCEaJ=M3Q7jr_+YOx3=AkCjV1F8ek-gKPA@mail.gmail.com>
Subject: Re: stable/11 debugging kernel unable to produce crashdump again
To: Alexander Motin <mav@freebsd.org>
Cc: Eugene Grosbein <eugen@grosbein.net>, Mark Johnston <markj@freebsd.org>, 
 FreeBSD Stable <freebsd-stable@freebsd.org>
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.23
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-stable>, 
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable/>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Jul 2017 17:33:13 -0000

I've often wondered why, for CAM at least, we don't automatically fall back
to the dump way when scheduling is stopped rather than have two different
interfaces and special knowledge of this in a lot of places...

Warner

On Mon, Jul 24, 2017 at 11:25 AM, Alexander Motin <mav@freebsd.org> wrote:

> I guess that problem of g_raid_shutdown_post_sync in case of panic can
> be explained by the fact it tries to write clean metadata in regular
> (not dumping) way while system is already in panic mode and there is no
> proper scheduling.  May be it could be just bypassed in case of dumping
> (should be trivial and probably OK), or use g_raid_subdisk_kerneldump()
> in that case instead of normal GEOM I/O.
>
> On 24.07.2017 20:03, Eugene Grosbein wrote:
> > CCing mav@ as graid expert.
> >
> > On 24.07.2017 08:44, Mark Johnston wrote:
> >
> >>> Sadly, this time 11.1-STABLE r321371 SMP hangs instead of doing
> crashdump:
> >>>
> >>> - "call doadump" from DDB prompt works just fine;
> >>> - "shutdown -r now" reboots the system without problems;
> >>> - "sysctl debug.kdb.panic=1" triggers a panic just fine but system
> hangs just afer showing uptime
> >>> instead of continuing with crashdump generation; same if "real" panic
> occurs.
> >>>
> >>> Same for debug.minidump set to 1 or 0. How do I debug this?
> >>
> >> I'm not able to reproduce the problem in bhyve using r321401. Looking
> >> at the code, the culprits might be cngrab(), or one of the
> >> shutdown_post_sync eventhandlers. Since you're apparently able to see
> >> the console output at the time of the panic, I guess it's probably the
> >> latter. Could you try your test with the patch below applied? It'll
> >> print a bunch of "entering post_sync"/"leaving post_sync" messages with
> >> addresses that can be resolved using kgdb. That'll help determine where
> >> we're getting stuck.
> >>
> >> Index: sys/sys/eventhandler.h
> >> ===================================================================
> >> --- sys/sys/eventhandler.h   (revision 321401)
> >> +++ sys/sys/eventhandler.h   (working copy)
> >> @@ -85,7 +85,11 @@
> >>                      _t = (struct eventhandler_entry_ ## name *)_ep; \
> >>                      CTR1(KTR_EVH, "eventhandler_invoke: executing %p",
> \
> >>                          (void *)_t->eh_func);                       \
> >> +                    if (strcmp(__STRING(name), "shutdown_post_sync")
> == 0) \
> >> +                            printf("entering post_sync %p\n", (void
> *)_t->eh_func); \
> >>                      _t->eh_func(_ep->ee_arg , ## __VA_ARGS__);      \
> >> +                    if (strcmp(__STRING(name), "shutdown_post_sync")
> == 0) \
> >> +                            printf("leaving post_sync %p\n", (void
> *)_t->eh_func); \
> >>                      EHL_LOCK((list));                               \
> >>              }                                                       \
> >>      }                                                               \
> >>
> >
> > Thanks, this helped:
> >
> > $ addr2line -f -e kernel.debug 0xffffffff80919c00
> > g_raid_shutdown_post_sync
> > /home/src/sys/geom/raid/g_raid.c:2458
> >
> > That is GEOM_RAID's g_raid_shutdown_post_sync() that hangs if called
> just before
> > crashdump generation but works just fine during normal system shutdown.
> >
> > I should note my graid's RAID1 is running in degraded state currently
> > due to dead SSD module that does not respond. Here is part of boot log:
> >
> > ahcich5: AHCI reset: device not ready after 31000ms (tfd = 00000080)
> > ahcich5: Poll timeout on slot 2 port 0
> > ahcich5: is 00000000 cs 00000004 ss 00000000 rs 00000004 tfd 80 serr
> 00000000 cmd 0000c217
> > (aprobe2:ahcich5:0:0:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00
> 00 00 00
> > (aprobe2:ahcich5:0:0:0): CAM status: Command timeout
> > (aprobe2:ahcich5:0:0:0): Error 5, Retries exhausted
> > run_interrupt_driven_hooks: still waiting after 60 seconds for xpt_config
> > ahcich5: Poll timeout on slot 3 port 0
> > ahcich5: is 00000000 cs 00000008 ss 00000000 rs 00000008 tfd 80 serr
> 00000000 cmd 0000c317
> > (aprobe2:ahcich5:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00
> 00 00
> > (aprobe2:ahcich5:0:0:0): CAM status: Command timeout
> > (aprobe2:ahcich5:0:0:0): Error 5, Retries exhausted
> > [skip]
> > Trying to mount root from ufs:/dev/raid/r0s4a [rw,noatime]...
> > Root mount waiting for: GRAID-Intel
> > Root mount waiting for: GRAID-Intel
> > Root mount waiting for: GRAID-Intel
> > Root mount waiting for: GRAID-Intel
> > Root mount waiting for: GRAID-Intel
> > GEOM_RAID: Intel-c291fe96: Force array start due to timeout.
> > GEOM_RAID: Intel-c291fe96: Disk ada0 state changed from NONE to ACTIVE.
> > GEOM_RAID: Intel-c291fe96: Subdisk r0:0-ada0 state changed from NONE to
> STALE.
> > GEOM_RAID: Intel-c291fe96: Array started.
> > GEOM_RAID: Intel-c291fe96: Subdisk r0:0-ada0 state changed from STALE to
> ACTIVE.
> > GEOM_RAID: Intel-c291fe96: Volume r0 state changed from STARTING to
> DEGRADED.
> > GEOM_RAID: Intel-c291fe96: Provider raid/r0 for volume r0 created.
> >
> >
> >
>
> --
> Alexander Motin
> _______________________________________________
> freebsd-stable@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"
>