From owner-freebsd-stable@FreeBSD.ORG Fri Nov 22 11:18:41 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id D4C0C98F; Fri, 22 Nov 2013 11:18:41 +0000 (UTC) Received: from constantine.ingresso.co.uk (constantine.ingresso.co.uk [IPv6:2a02:b90:3002:e550::3]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 9D47C2B29; Fri, 22 Nov 2013 11:18:41 +0000 (UTC) Received: from dilbert.london-internal.ingresso.co.uk ([10.64.50.6] helo=dilbert.ingresso.co.uk) by constantine.ingresso.co.uk with esmtps (TLSv1:DHE-RSA-AES256-SHA:256) (Exim 4.80.1 (FreeBSD)) (envelope-from ) id 1Vjoko-0000Nv-Kv; Fri, 22 Nov 2013 11:18:30 +0000 Received: from petefrench by dilbert.ingresso.co.uk with local (Exim 4.80.1 (FreeBSD)) (envelope-from ) id 1Vjokn-000OuU-1Y; Fri, 22 Nov 2013 11:18:29 +0000 To: petefrench@ingresso.co.uk, trociny@FreeBSD.org Subject: Re: Hast locking up under 9.2 In-Reply-To: <20131121203711.GA3736@gmail.com> Message-Id: From: Pete French Date: Fri, 22 Nov 2013 11:18:29 +0000 Cc: freebsd-stable@freebsd.org X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.16 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Nov 2013 11:18:41 -0000 > I remember already asking you about replication mode you was using and > don't remember you answered. One of the significant changes is memsync > mode, which is default in 9.2 (it was fullsync in eralier versions). > So if you are using default settings you can try switching to fullsync > as a workaround. Yes, I am using the default settings, so that is something I can try. After three days of downtime last week I will not try it in the immedaiet future though, for fear of my colleaguyes wanting to strange me :-) Will enable on the test system however, and try on live in a couple of weeks if I can. > signal=6 means that hastd crashed due to some assertion failed. > Usually "Assertion failed ..." message precedes this line in the > logs. Don't you see such a message? It might be very helpful. Yes, I do actually! "Assertion failed: (!hio->hio_done), function write_complete, file /usr/src/sbin/hastd/primary.c, line 1130." > Do you always see this error when it gets stuck? That I do not know I am afraid - I was too busy getting the systems back online to have time to try and recocnile the tdowntimes with what is in the logfiles. It was only yesterday that I started trying to tarce what might have happened > Unfortunately the crash did not generated core (due to capsicum). When > I want to get a coredump I rebuild hastd with CFLAGS+=-DHAVE_CAPSICUM > removed in Makefile (and with debugging symbols). There might be an > easier method but I don't know. > > If you don't find the assertion message and the crashes are > reproducible, it would be helpful to rebuild hastd with symbols and > capsicum disabled to make it coredump and provide the backtrace. > > Also, when you have hastd got stuck you can generate a core of the > live process with gcore(1). I didnt know about gcore - thats a very useful feature! The crash is reproducible, but not on any machine that I could actually crash without causing extensive downtime to the rest of the business unfortunately. I can't deliberately crash our master database and it doesnt crash ont he test setup we have. But what I can do is to run it up live again with your suggested change to the config, and if it gets stuck try and generate some more useful debugging then. > What revision are you using? Recently there was a fix for crashes > triggered by this failed assertion: > > Assertion failed: (amp->am_memtab[ext] > 0), function > activemap_write_complete, file activemap.c, line 351. I'm using r257795 - I did an upgrade to get the fix for the above assertion, and in general I keep an eve onm the commits and anything involving hast or zfs I take as soon as I can to try and improve stability. Thanks for the help - if I get any more info I will let you know, of if the above assertyion helps you track something down then I may be able to try some patches. cheers, -pete.