From owner-freebsd-stable@FreeBSD.ORG  Fri Nov 22 11:18:41 2013
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id D4C0C98F;
 Fri, 22 Nov 2013 11:18:41 +0000 (UTC)
Received: from constantine.ingresso.co.uk (constantine.ingresso.co.uk
 [IPv6:2a02:b90:3002:e550::3])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 9D47C2B29;
 Fri, 22 Nov 2013 11:18:41 +0000 (UTC)
Received: from dilbert.london-internal.ingresso.co.uk ([10.64.50.6]
 helo=dilbert.ingresso.co.uk)
 by constantine.ingresso.co.uk with esmtps (TLSv1:DHE-RSA-AES256-SHA:256)
 (Exim 4.80.1 (FreeBSD)) (envelope-from <petefrench@ingresso.co.uk>)
 id 1Vjoko-0000Nv-Kv; Fri, 22 Nov 2013 11:18:30 +0000
Received: from petefrench by dilbert.ingresso.co.uk with local (Exim 4.80.1
 (FreeBSD)) (envelope-from <petefrench@ingresso.co.uk>)
 id 1Vjokn-000OuU-1Y; Fri, 22 Nov 2013 11:18:29 +0000
To: petefrench@ingresso.co.uk, trociny@FreeBSD.org
Subject: Re: Hast locking up under 9.2
In-Reply-To: <20131121203711.GA3736@gmail.com>
Message-Id: <E1Vjokn-000OuU-1Y@dilbert.ingresso.co.uk>
From: Pete French <petefrench@ingresso.co.uk>
Date: Fri, 22 Nov 2013 11:18:29 +0000
Cc: freebsd-stable@freebsd.org
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.16
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable/>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 22 Nov 2013 11:18:41 -0000

> I remember already asking you about replication mode you was using and
> don't remember you answered. One of the significant changes is memsync
> mode, which is default in 9.2 (it was fullsync in eralier versions).
> So if you are using default settings you can try switching to fullsync
> as a workaround.

Yes, I am using the default settings, so that is something I
can try. After three days of downtime last week I will not try it
in the immedaiet future though, for fear of my colleaguyes wanting
to strange me :-) Will enable on the test system however, and try on live
in a couple of weeks if I can.

> signal=6 means that hastd crashed due to some assertion failed.
> Usually "Assertion failed ..." message precedes this line in the
> logs. Don't you see such a message? It might be very helpful.

Yes, I do actually!

"Assertion failed: (!hio->hio_done), function write_complete, file /usr/src/sbin/hastd/primary.c, line 1130."

> Do you always see this error when it gets stuck?

That I do not know I am afraid - I was too busy getting the systems back online
to have time to try and recocnile the tdowntimes with what is in the logfiles.
It was only yesterday that I started trying to tarce what might have
happened

> Unfortunately the crash did not generated core (due to capsicum). When
> I want to get a coredump I rebuild hastd with CFLAGS+=-DHAVE_CAPSICUM
> removed in Makefile (and with debugging symbols). There might be an
> easier method but I don't know.
>
> If you don't find the assertion message and the crashes are
> reproducible, it would be helpful to rebuild hastd with symbols and
> capsicum disabled to make it coredump and provide the backtrace.
>
> Also, when you have hastd got stuck you can generate a core of the
> live process with gcore(1).

I didnt know about gcore - thats a very useful feature! The crash
is reproducible, but not on any machine that I could actually
crash without causing extensive downtime to the rest of the business
unfortunately. I can't deliberately crash our master database and
it doesnt crash ont he test setup we have. But what I can do is to run it up
live again with your suggested change to the config, and if it gets stuck
try and generate some more useful debugging then.

> What revision are you using? Recently there was a fix for crashes
> triggered by this failed assertion:
>
>  Assertion failed: (amp->am_memtab[ext] > 0), function
>  activemap_write_complete, file activemap.c, line 351.

I'm using r257795 - I did an upgrade to get the fix for the above assertion,
and in general I keep an eve onm the commits and anything involving hast
or zfs I take as soon as I can to try and improve stability.

Thanks for the help - if I get any more info I will let
you know, of if the above assertyion helps you track something down
then I may be able to try some patches.

cheers,

-pete.